Duplicate entries in target Glue data Catalog table using ETL

0

I am using AWS GLUE ETL job that is fetching data from Mongo DB and putting it to AWS Glue catalog table but the issue is everytime the job runs it is creating the duplicate entries.(If there are 1000 entries in the source mongo DB table, for the first time the job runs it creates 1000 entries in the target GLUE catalog table, and when the ETL job runs for 2nd time it creates another 1000 entries in Glue catalog table) I don't want to create duplicate entries in the target table please help. Please fins the screenshot of the Visual ETL job attached. Enter image description here

2 Answers
1

Hi,

The Drop Duplicates feature is what you need to avoid duplicates in subsequent runs of the ETL job: https://docs.aws.amazon.com/glue/latest/dg/transforms-drop-duplicates.html

This will avoid you the development of code to do the same.

Best,

Didier

profile pictureAWS
EXPERT
answered 3 months ago
profile picture
EXPERT
reviewed 3 months ago
0

To prevent duplicate entries in your AWS Glue catalog table when using an ETL job to fetch data from MongoDB, follow these steps:

  • Modify your AWS Glue ETL job to perform deduplication before writing data to the Glue catalog table. Load data from MongoDB into a temporary staging table in Glue, run deduplication (e.g., using Pandas or Spark) on the staging table to remove duplicates based on a unique identifier (such as a primary key), and write the deduplicated data from the staging table to the final Glue catalog table.
  • Ensure your MongoDB data has a unique identifier (e.g., a primary key). Use this identifier to prevent duplicate entries in the Glue catalog table. When writing data, check if a record with the same unique identifier exists. Update it instead of creating a new one.
  • Configure the Glue ETL job’s output options to handle duplicates. Set “Update Behavior” to “Update in Place” or “Overwrite Files” (if applicable), and specify the S3 path where deduplicated data should be stored.
  • Schedule the Glue ETL job to run at intervals (e.g., daily or hourly) to keep the Glue catalog table up to date.
profile picture
EXPERT
answered 3 months ago
profile picture
EXPERT
Sandeep
reviewed 3 months ago
profile pictureAWS
EXPERT
reviewed 3 months ago
  • Hi Giovanni thanks for the answer but for your point number 1 in the end I still need to put data into my destination table where all the entries are getting duplicated after each run. Also my source has no duplicate entries, the main problem is that everytime the ETL job runs it appends the entries instead of updating them in the target table if it is already present. I have updated my question with a screenshot of my Visual ETL job.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions