2 Answers
- Newest
- Most votes
- Most comments
1
Hi,
The Drop Duplicates feature is what you need to avoid duplicates in subsequent runs of the ETL job: https://docs.aws.amazon.com/glue/latest/dg/transforms-drop-duplicates.html
This will avoid you the development of code to do the same.
Best,
Didier
0
To prevent duplicate entries in your AWS Glue catalog table when using an ETL job to fetch data from MongoDB, follow these steps:
- Modify your AWS Glue ETL job to perform deduplication before writing data to the Glue catalog table. Load data from MongoDB into a temporary staging table in Glue, run deduplication (e.g., using Pandas or Spark) on the staging table to remove duplicates based on a unique identifier (such as a primary key), and write the deduplicated data from the staging table to the final Glue catalog table.
- Ensure your MongoDB data has a unique identifier (e.g., a primary key). Use this identifier to prevent duplicate entries in the Glue catalog table. When writing data, check if a record with the same unique identifier exists. Update it instead of creating a new one.
- Configure the Glue ETL job’s output options to handle duplicates. Set “Update Behavior” to “Update in Place” or “Overwrite Files” (if applicable), and specify the S3 path where deduplicated data should be stored.
- Schedule the Glue ETL job to run at intervals (e.g., daily or hourly) to keep the Glue catalog table up to date.
Relevant content
- asked 4 months ago
- Accepted Answerasked 10 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
Hi Giovanni thanks for the answer but for your point number 1 in the end I still need to put data into my destination table where all the entries are getting duplicated after each run. Also my source has no duplicate entries, the main problem is that everytime the ETL job runs it appends the entries instead of updating them in the target table if it is already present. I have updated my question with a screenshot of my Visual ETL job.