I am trying to read data from S3, transform it and insert it into a table I have created in the Data Catalog.
My ETL job reads data with create_dynamic_frame, converts to a Spark DataFrame, does its transforms, converts back to a Dynamic DataFrame and then writes that frame to the Data Catalog.
I do the write to the Data Catalog table like so:
DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
frame = dynamic_df,
database = db_name,
table_name = tbl_name,
transformation_ctx = "DataCatalogtable_node2",
)
I manually defined my Data Catalog table
Column Name | Data Type |
---|
vid | string |
altid | string |
vtype | string |
time | timestamp |
timegmt | timestamp |
value | int |
filename | string |
This matches the schema of my Dynamic DataFrame write before I do the write operation
root
|-- vid: string
|-- altid: string
|-- vtype: string
|-- time: timestamp
|-- timegmt: timestamp
|-- value: int
|-- filename: string
When I attempt to look at the Data in Athena after the ETL job runs, I get errors, because stored in "time" the first value is called "time". My timestamps look like and I think they are in the correct format.
vid,altid,vtype,time,timegmt,value,filename
ABC,ABC,DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"s3://bpf-load-data/abc_data/lf_buf_20211101_005904.csv"
Originally (in s3) time is in the format 31-10-2021 22:00:00
and so in order to get it into the correct format that the Data Catalog/Athena use for timestamp
I am doing:
df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))
I am not sure if this is correct. I get 2021-10-31 22:00:00.0
instead of what I would suspect which is 2021-10-31 22:00:00
In any case, in Athena when I attempt to read the table, the actually first Data in time
is the word time
and therefore it does not match timestamp and it all errors out.
HIVE_BAD_DATA: Error parsing field value 'TIME' for field 3: For input string: "TIME"
It is almost as if it is trying to insert the "header" of the dataframe as actual data.
Does anyone know what could be going on?