HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

0

I am reading files from S3 and using a Glue ETL job to populate a Data Catalog Table.

My S3 files look like this:

VID,ALTID,VTYPE,TIME,TIMEGMT,VALUE
ABC, ABC, DATAFEED,31-10-2021 22:00:00,01-11-2021 02:00:00,   11775.685

The scheme read in looks like this

root
|-- VID: string
|-- ALTID: string
|-- VTYPE: string
|-- TIME: string
|-- TIMEGMT: string
|-- VALUE: string

I am changing fields 3 and 4 from "strings" to timestamps, which matches the schema of my Data Catalog Table. I am also doing a few other transformations.

I am transforming them like so:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))
df = df.withColumn("timegmt", to_timestamp("timegmt", 'dd-MM-yyyy HH:mm:ss'))

When I try to read the data with Athena, I get an error:

HIVE_BAD_DATA: Error parsing field value '2021-10-31 22:00:00.0' for field 3: For input string: "2021-10-31 22:00:00.0"

The Data Catalog Table Schema looks like this:

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

And the line in the run-1647806179090-part-r-00000 file it is choking on looks like this:

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC, DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"lf_buf_20211101_005904.csv"

Does anyone know why it would throw this error? I believe according to the documentation, this is the correct timestamp format.

bfeeny
gefragt vor 2 Jahren3204 Aufrufe
1 Antwort
0
Akzeptierte Antwort

It turns out, the documentation for Athena is either incorrect or at best misleading. The excellent answer by Alexandre says it best here https://stackoverflow.com/questions/52564194/athena-unable-to-parse-date-using-opencsvserde

Basically you need to store the date or the timestamp in UNIX Epoch time. You wouldn't know that, because of all the emphasis on the format of the time. I tried as a timestamp, and that is why I got this error. As soon as I stored it as UNIX time I got somewhere. However, the unix_timestamp() function only returns time in seconds (long) and timestamp wants time in milliseconds (double). So I simply multiplied by 1000:

df = df.withColumn("time", f.unix_timestamp("time", 'dd-MM-yyyy HH:mm:ss') * 1000) After doing this, you will have a 13 digit double, and Athena will properly produce a timestamp from it if you have selected Timestamp as the Data Type.

bfeeny
beantwortet vor 2 Jahren

Du bist nicht angemeldet. Anmelden um eine Antwort zu veröffentlichen.

Eine gute Antwort beantwortet die Frage klar, gibt konstruktives Feedback und fördert die berufliche Weiterentwicklung des Fragenstellers.

Richtlinien für die Beantwortung von Fragen