DMS: How does a replication task handle duplicate data w/o Primary/Unique keys on the table

0

Good afternoon!

I am trying to determine how DMS Replication Tasks (DB2 LUW -> Redshift) handle duplicate data if a Full-Load or Full-Load-And-CDC task are paused/resumed or interrupted and re-started when the target tables don't contain Primary Keys or Unique Identifiers?

If a Full-Load is running to the target, and the job is paused and resumed, does it simply pick up at the same location it left off in the transaction? Is there a possibility of duplication at that point? If the replication task is cancelled and re-started at a later time, does it overwrite the data in the target table due to the full-load operation? Or is that all determined by task settings?

If a Full-Load-and-CDC task is running and is paused/resumed during the ongoing CDC replication, does it simply finish that CDC replication task from the point where it was paused, or is there still potential for duplication at that point? Also, if the same task is cancelled and re-started at a later time, does the Full-Load portion of the task overwrite the table contents, or is there a possibility for duplication?

I know that in all instances it would be optimal if I were able to add Primary Keys/Unique IDs to the tables in Redshift, but I'm curious how DMS handles target tables w/o those fields.

Thanks in advance! -mike

1 Answer
0

Hi Mike so i hear you want know how does DMS handle duplicate data.

As you asks if a Full Load is stopped/paused and then it is resumed, what happens? Well the answer is that Full load is not designed to handle a resume. It either has to finish in one go or fail. So if you are going to stop/pause it while it processes, it will start all over again. This understanding can be drawn from this wording "When using a single AZ or Multi-AZ replication instance during a FULL LOAD and a failover or host replacement occurs, the full load task is expected to fail. You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state." and the doc is here[1]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html

With CDC, if you stop/pause it, it will make a checkpoint and when you resume it later, it will continue. When you stop a replication task, DMS sets a checkpoint so you can resume replication from the exact place in the transaction log[1]https://aws.amazon.com/about-aws/whats-new/2018/06/aws-dms-can-start-replication-anywhere-in-a-transaction-log/

Okay so now that we have the above clear, DMS task has settings to handle how it will apply data on the target. These are:

DO_NOTHING – Data and metadata of the existing target table aren't affected.

DROP_AND_CREATE – The existing table is dropped and a new table is created in its place.

TRUNCATE_BEFORE_LOAD – Data is truncated without affecting the table metadata. if a DMS tasks fails and restarts, if you have DROP_AND_CREATE or TRUNCATE_BEFORE_LOAD as your settings, it will not cause duplicate values ever. Only DO_NOTHING will cause duplicate data after pause/restart. Doc for these settings is[2]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html

so Mike, lastly when it comes to primary keys, DMS uses them when applying data to target in a faster way. They do not guarantee uniqueness because if you have a DMS task sending duplicate records, the primary key constraint will be violated (target database will block duplicate records basically). So only way you can really block duplicate data is if they use DROP_AND_CREATE or TRUNCATE_BEFORE_LOAD as DMS task settings

You can set the error handling behavior of your replication task during change data capture (CDC) using the following setting

FullLoadIgnoreConflicts – Set this option to true to have AWS DMS ignore "zero rows affected" and "duplicates" errors when applying cached events. If set to false, AWS DMS reports all errors instead of ignoring them. The default is true.

Duplicate records occur on a target table without a primary key.

the Primary Key is required to be able to uniquely identify different row. Running a full load and CDC task can create duplicate records on target tables that don't have a primary key or unique index. To avoid duplicating records on target tables during full load and CDC tasks, make sure that target tables have a primary key or unique index.

answered 2 years ago
  • Thanks for the response! I've got a couple questions, but the first is in regards to this comment:

    "As you asks if a Full Load is stopped/paused and then it is resumed, what happens? Well the answer is that Full load is not designed to handle a resume. It either has to finish in one go or fail. So if you are going to stop/pause it while it processes, it will start all over again. This understanding can be drawn from this wording "When using a single AZ or Multi-AZ replication instance during a FULL LOAD and a failover or host replacement occurs, the full load task is expected to fail. You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state." and the doc is here[1]https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TaskSettings.FullLoad.html"

    Doesn't that statement contradict itself? You say that a Full Load Task has only one of two states - completion or failure, with no resume capability. Yet in the quote, it says "You can restart the task from the point of failure for the remaining tables that didn't complete, or are in an error state."

    Any chance of clarification on that?

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions