AWS Glue Error - File already Exists

0

I am doing a AWS Glue job to read from Redshift (schema_1) and write it back to Redhshift (schema_2). This process is done using below:

Redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options={
        "sampleQuery": sample_query,
        "redshiftTmpDir": tmp_dir,
        "useConnectionProperties": "true",
        "connectionName": "dodsprd_connection",
        "sse_kms_key" : "abc-fhrt-2345-8663",
    },
    transformation_ctx="Redshift_read",
)

Redshift_write = glueContext.write_dynamic_frame.from_jdbc_conf(
    frame=Redshift_read,
    catalog_connection="dodsprd_connection",
    connection_options={
        "database": "dodsprod",
        "dbtable": "dw_replatform_stage.rt_order_line_cancellations_1",
        "preactions": pre_query,
        "postactions": post_query,
    },
    redshift_tmp_dir=tmp_dir,
    transformation_ctx="Redshift_write",
)

The "sample_query" is a normal SQL query with some business logic. When i run this glue job, I am getting below error:

An error occurred while calling o106.pyWriteDynamicFrame. File already exists:

When running the same SQL query manually in SQL workbench, I am getting a proper output. Can anyone please help me on this.

Joe
asked 2 months ago369 views
1 Answer
0

Since you are using Redshift, I suspect the error comes from the temporary files used to do COPY/UNLOAD, check the stacktrace to confirm.
A quick solution could be giving them different temporary subpaths.

profile pictureAWS
EXPERT
answered 2 months ago
  • Can you please tell me, how to check the stack trace.?

    I also tried giving temp subpaths as below: Old path: s3://bjs-digital-dods-data-lake-processed-prod/temp/fact_order_line_returns/ New path: s3://bjs-digital-dods-data-lake-processed-prod/temp/temp_test/fact_order_line_returns/ Still, I get the same error.

  • The full error stack trace (including both Python and Scala) will be in the error log

  • I contacted AWS Support and we worked together in this issue. But not able to fix. The query I am using through this logic is very huge. For some reason, if the query is very huge, I ma getting this error. Support team told that, internally some node failure is happening when the data is COPIED from S3 to redshift. This failure is displayed as the error I told you. But till now, I am did not get any solution. But found a workaround. I am writing the output to an S3 file, and Lambda will take care of loading into Redshift through an event trigger. This is working smooth.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions