DMS between two RDS instances, grown OldestReplicationSlotLag

Question

Hello. 
I have a DMS task(full load + CDC) replicating only one table between two AWS RDS instances (source, target).
This table has a primary key, but the table is updated very rarely.

Once I start the DMS task, OldestReplicationSlotLag and TransactionLogsDiskUsage (source db) grow until I restart the DMS task. However, I see changes in the table on both source and target.

What could be the reason for this issue?

Answer

When using DMS for ongoing replication (Change Data Capture or CDC), it relies on database logs to capture changes. For PostgreSQL (assuming your RDS instances are PostgreSQL because you mentioned replication slots), DMS uses replication slots.

Here's a breakdown of the issues you mentioned and potential solutions:

1. OldestReplicationSlotLag Growing: This indicates that the replication slot used by DMS is not being "consumed" or "advanced" as expected. In PostgreSQL, replication slots retain WAL (Write Ahead Logs) until they're consumed by the client (in this case, DMS). If DMS isn't consuming these logs appropriately, they'll continue to be retained, causing the replication lag to grow.

2. TransactionLogsDiskUsage Growing: This is directly related to the above issue. Since the logs aren't being consumed, they take up more and more disk space.

Here are some potential causes and solutions:

1. DMS Isn't Consuming Logs: This is the core issue. Even if the table is updated rarely, DMS should still periodically check and advance the replication slot, ensuring that old logs are released.

2. Network Issues: Ensure there's no network bottleneck or interruptions between DMS and the RDS source instance.

3. DMS Task Settings: Ensure that your DMS task settings are appropriate. For example, make sure you're not using a very high value for `ChangeProcessingTuning` or `BatchApplyPreserveTransaction`. This might cause delays in processing the logs.

4. Monitor DMS Metrics: Use CloudWatch to monitor DMS task metrics, especially the `CDC Latency` metric. This metric provides insights into the lag between when a change is made in the source database and when it's replicated to the target.

5. Replication Slot: Ensure that the replication slot used by DMS is active. Sometimes, manual interventions or other issues might invalidate a slot. You can check the status of replication slots using the `pg_replication_slots` view in PostgreSQL.

6. Source DB Parameters/: Ensure that the `wal_level` is set to `logical`, `max_replication_slots` is appropriately configured, and `max_wal_senders` is set to a number that accommodates DMS and any other replication processes you might have.

7. Manual Cleanup: As a temporary measure, you can manually drop the replication slot used by DMS and restart the task, which would cause DMS to create a new slot. But this is only a temporary solution and doesn't address the underlying issue. Remember to be cautious as this will drop the replication slot, which might cause data inconsistency if not done correctly.

8. DMS Version: Ensure you're using the latest version of the DMS replication instance. AWS occasionally releases updates, and using the latest version can help avoid bugs.

DMS between two RDS instances, grown OldestReplicationSlotLag

Relevant content