HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

0

I am reading files from S3 and using a Glue ETL job to populate a Data Catalog Table.

My S3 files look like this:

VID,ALTID,VTYPE,TIME,TIMEGMT,VALUE
ABC, ABC, DATAFEED,31-10-2021 22:00:00,01-11-2021 02:00:00,   11775.685

The scheme read in looks like this

root
|-- VID: string
|-- ALTID: string
|-- VTYPE: string
|-- TIME: string
|-- TIMEGMT: string
|-- VALUE: string

I am changing fields 3 and 4 from "strings" to timestamps, which matches the schema of my Data Catalog Table. I am also doing a few other transformations.

I am transforming them like so:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))
df = df.withColumn("timegmt", to_timestamp("timegmt", 'dd-MM-yyyy HH:mm:ss'))

When I try to read the data with Athena, I get an error:

HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

The Data Catalog Table Schema looks like this:

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

And the line in the run-1647806179090-part-r-00000 file it is choking on looks like this:

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC, DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"lf_buf_20211101_005904.csv"

Does anyone know why it would throw this error? I believe according to the documentation, this is the correct timestamp format.

bfeeny
질문됨 2년 전3204회 조회
1개 답변
0
수락된 답변

It turns out, the documentation for Athena is either incorrect or at best misleading. The excellent answer by Alexandre says it best here https://stackoverflow.com/questions/52564194/athena-unable-to-parse-date-using-opencsvserde

Basically you need to store the date or the timestamp in UNIX Epoch time. You wouldn't know that, because of all the emphasis on the format of the time. I tried as a timestamp, and that is why I got this error. As soon as I stored it as UNIX time I got somewhere. However, the unix_timestamp() function only returns time in seconds (long) and timestamp wants time in milliseconds (double). So I simply multiplied by 1000:

df = df.withColumn("time", f.unix_timestamp("time", 'dd-MM-yyyy HH:mm:ss') * 1000) After doing this, you will have a 13 digit double, and Athena will properly produce a timestamp from it if you have selected Timestamp as the Data Type.

bfeeny
답변함 2년 전

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠