Skip to content

dropDuplicates transform in AWS Glue ETL does not delete records in Redshift table

0

I've created a Glue ETL job with a Redshift table as the source node and a dropDuplicates transform over some specific keys (not the entire row, because they have created timestamps). I understand there is no need for a target node unless you want to store the transformed data elsewhere. The job runs successfully, but no rows are deleted.

Is this expected for Redshift? Can I use a DataBrew recipe instead?

1 Answer
0
Accepted Answer

The behavior you're experiencing with the dropDuplicates transform in AWS Glue ETL not deleting records in your Redshift table is expected. There are a few important points to understand about this situation:

  1. The dropDuplicates transform in AWS Glue works on the data in memory during the ETL process. It doesn't directly modify the source data in your Redshift table.

  2. When you use dropDuplicates without a target node, you're essentially just transforming the data in memory without writing the results back to any destination. This is why you don't see any changes in your Redshift table.

  3. To actually remove duplicates from your Redshift table, you would need to add a target node to your Glue job. This target node would write the deduplicated data back to Redshift (either to the same table or a new one).

  4. When using dropDuplicates, it's important to note that the transform is case-sensitive and treats all values as strings. Also, when duplicates are found, the first occurrence is kept and subsequent duplicates are dropped.

  5. If you're specifying keys for deduplication (as you mentioned), all fields will still be kept in the resulting dataframe, even if they weren't part of the specified keys.

Regarding your question about using AWS Glue DataBrew instead:

Yes, you can use an AWS Glue DataBrew recipe as an alternative approach. DataBrew provides a visual interface for data preparation tasks, including deduplication. However, keep in mind that DataBrew, like Glue ETL, doesn't directly modify your source data. You would still need to write the transformed data back to your Redshift table or to a new destination.

To effectively remove duplicates from your Redshift table, you have a few options:

  1. Modify your Glue job to include a target node that writes the deduplicated data back to Redshift.
  2. Use DataBrew to prepare a recipe for deduplication and then write the results back to Redshift.
  3. Consider using Redshift's own SQL commands for deduplication if that's more suitable for your use case.

Remember, when working with large datasets in Redshift, it's important to consider performance implications of your deduplication strategy.
Sources
Using Drop Duplicates - AWS Glue
CodeGenConfigurationNode - AWS Glue
Duplicate entries in target Glue data Catalog table using ETL | AWS re:Post

answered 7 months ago
AWS
EXPERT
reviewed 7 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.