Read the input file name from S3 in AWS Glue into redshift

0

I am reading multiple files from S3 and writing the output to Redshift DB. Below is my code to read all the files from a S3 location (s3://abc/oms/YFS_CATEGORY_ITEM/)

yfs_category_item_df = glueContext.create_dynamic_frame.from_options(
    format_options={},
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": [
            #"s3://" + args['s3_bucket'] + "/" + args['s3_key']
            f"s3://abc/oms/YFS_CATEGORY_ITEM/"
            ],
        "recurse": True,
    },
    transformation_ctx="yfs_category_item_df",
)

Below is my code to write the data to Redshift:

RedshiftCluster_node3 = glueContext.write_dynamic_frame.from_jdbc_conf(
        frame=yfs_category_item_df,
        catalog_connection="dodsprd_connection",
        connection_options={
            "database": "dodsprod",
            "dbtable": "dw_replatform_stage.rt_category_item",
            "preactions": pre_query,
            "postactions": post_query,
        },
        redshift_tmp_dir=tmp_dir,
        transformation_ctx="upsert_to_redshift",
    )

All my work is going on as expected. My new requirement is to capture the input file name (that I am reading from S3) and insert that into my target table as a new column. Can someone please help me to do this? I am not using any crawler or Glue tables in my entire script. Also i tried using "input_file_name()" option. This does not work in JDBC option and works only if crawler/glue tables are used. Also tried using "format_options" and this is also not supported in JDBC_CONF. Any help please?

2 Answers
0

The sink being JDBC (BTW for Redshift it would be better to use the Redshift connector) has no impact on what you are trying to do, you could show() the DF before writing and if the filename is there, it will be written like any other columns.
You could use the option attachFilename on the source, see: https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format.html#aws-glue-programming-etl-format-shared-reference or convert to DataFrame, call the function and then back to DynamicFrame (input_file_name() is not an "option", it's a SQL function)

profile pictureAWS
EXPERT
answered 5 months ago
profile pictureAWS
SUPPORT ENGINEER
reviewed 5 months ago
  • Thanks for your input. "attchFilename" cannot b added when we are using JDBC method to write to Redshift. it threw an error. You are right, input_file_name() is a SQL function only. The way I gave my statement was wrong. When I said "i tried using input_file_name() option ", I meant that this function was also tried and it did not work.

0

Just a small tweak is needed to get the filename:

yfs_category_item_df = glueContext.create_dynamic_frame.from_options(
    format_options={"attachFilename": "your_filename_column_name"},
    connection_type="s3",
    format="parquet",
    connection_options={
        "path": [
            #"s3://" + args['s3_bucket'] + "/" + args['s3_key']
            f"s3://abc/oms/YFS_CATEGORY_ITEM/"
            ],
        "recurse": True,
    },
    transformation_ctx="yfs_category_item_df",
)

Change the value of attachFilename in format_options to have the column name you desire.

AWS
Don_D
answered 5 months ago
  • i tried this also, but did not work. I got blank value. I read somewhere that while using CRWALER/Glue Table only this works. Anyway, I found a workaround. IN the code you have mentioned above, I am using "args['s3_key']". This value is coming from Lambda for me. I am passing this value to a variable, and I am using that variable in my "Post Query" while doing a "write_dynamic_frame.from_jdbc_conf". Thanks for your inputs.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions