- Newest
- Most votes
- Most comments
The behavior you're experiencing with the dropDuplicates transform in AWS Glue ETL not deleting records in your Redshift table is expected. There are a few important points to understand about this situation:
-
The dropDuplicates transform in AWS Glue works on the data in memory during the ETL process. It doesn't directly modify the source data in your Redshift table.
-
When you use dropDuplicates without a target node, you're essentially just transforming the data in memory without writing the results back to any destination. This is why you don't see any changes in your Redshift table.
-
To actually remove duplicates from your Redshift table, you would need to add a target node to your Glue job. This target node would write the deduplicated data back to Redshift (either to the same table or a new one).
-
When using dropDuplicates, it's important to note that the transform is case-sensitive and treats all values as strings. Also, when duplicates are found, the first occurrence is kept and subsequent duplicates are dropped.
-
If you're specifying keys for deduplication (as you mentioned), all fields will still be kept in the resulting dataframe, even if they weren't part of the specified keys.
Regarding your question about using AWS Glue DataBrew instead:
Yes, you can use an AWS Glue DataBrew recipe as an alternative approach. DataBrew provides a visual interface for data preparation tasks, including deduplication. However, keep in mind that DataBrew, like Glue ETL, doesn't directly modify your source data. You would still need to write the transformed data back to your Redshift table or to a new destination.
To effectively remove duplicates from your Redshift table, you have a few options:
- Modify your Glue job to include a target node that writes the deduplicated data back to Redshift.
- Use DataBrew to prepare a recipe for deduplication and then write the results back to Redshift.
- Consider using Redshift's own SQL commands for deduplication if that's more suitable for your use case.
Remember, when working with large datasets in Redshift, it's important to consider performance implications of your deduplication strategy.
Sources
Using Drop Duplicates - AWS Glue
CodeGenConfigurationNode - AWS Glue
Duplicate entries in target Glue data Catalog table using ETL | AWS re:Post
Relevant content
- asked 3 years ago
- asked 7 years ago
