By using AWS re:Post, you agree to the AWS re:Post Terms of Use

I am trying to write an ETL job to the Data Catalog but its writing the Headers as Data

0

I am trying to read data from S3, transform it and insert it into a table I have created in the Data Catalog. My ETL job reads data with create_dynamic_frame, converts to a Spark DataFrame, does its transforms, converts back to a Dynamic DataFrame and then writes that frame to the Data Catalog.

I do the write to the Data Catalog table like so:

DataCatalogtable_node2 = glueContext.write_dynamic_frame.from_catalog(
    frame = dynamic_df,
    database = db_name,
    table_name = tbl_name,
    transformation_ctx = "DataCatalogtable_node2",
)

I manually defined my Data Catalog table

Column NameData Type
vidstring
altidstring
vtypestring
timetimestamp
timegmttimestamp
valueint
filenamestring

This matches the schema of my Dynamic DataFrame write before I do the write operation

root
|-- vid: string
|-- altid: string
|-- vtype: string
|-- time: timestamp
|-- timegmt: timestamp
|-- value: int
|-- filename: string

When I attempt to look at the Data in Athena after the ETL job runs, I get errors, because stored in "time" the first value is called "time". My timestamps look like and I think they are in the correct format.

vid,altid,vtype,time,timegmt,value,filename
ABC,ABC,DATAFEED,"2021-10-31 22:00:00.0","2021-11-01 02:00:00.0",11775,"s3://bpf-load-data/abc_data/lf_buf_20211101_005904.csv"

Originally (in s3) time is in the format 31-10-2021 22:00:00 and so in order to get it into the correct format that the Data Catalog/Athena use for timestamp I am doing:

df = df.withColumn("time", to_timestamp("time", 'dd-MM-yyyy HH:mm:ss'))

I am not sure if this is correct. I get 2021-10-31 22:00:00.0 instead of what I would suspect which is 2021-10-31 22:00:00

In any case, in Athena when I attempt to read the table, the actually first Data in time is the word time and therefore it does not match timestamp and it all errors out.

HIVE_BAD_DATA: Error parsing field value 'TIME' for field 3: For input string: "TIME"

It is almost as if it is trying to insert the "header" of the dataframe as actual data.

Does anyone know what could be going on?

asked 3 years ago3.4K views
1 Answer
1
Accepted Answer

I was able to figure this out. I was creating my Data Catalog table manually, and I needed to add the table property "skip.header.line.count = 1".

answered 3 years ago
profile picture
EXPERT
reviewed 8 months ago
AWS
EXPERT
reviewed 3 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions