I want to troubleshoot the restart or failover of my AWS Database Migration Service (AWS DMS) replication instance.
Short description
An AWS DMS replication instance automatically restarts for the following reasons:
- There's an infrastructure issue with the primary instance, such as loss of network connectivity, a compute unit issue, or a storage issue.
- The instance class type changed as a result of a vertical scaling activity.
- There's a software patch in progress on the host of the instance during a specific maintenance window. For more information, see Working with replication engine versions.
- You used the Reboot or Reboot with planned failover options to issue a manual reboot of the instance.
When the replication instance experiences issues and fails to respond to AWS DMS health checks, AWS DMS automatically initiates a recovery or a failover. For Single-AZ deployments, AWS DMS initiates a recovery. For a Multi-AZ deployment, AWS DMS initiates a failover. Then, AWS DMS restarts the replication instance before you can manually resume the database migration tasks.
Resolution
Review AWS DMS events to identify the root cause
To identify the cause of the restart or failover of your instance, view the AWS DMS events for the last 24 hours. Open the AWS DMS console, and choose Events.
Note: By default, AWS DMS registers events in the UTC time zone.
To store events for a long time, send the AWS DMS events to Amazon EventBridge. For more information, see Implement an automated approach for handling AWS DMS operational events.
If you see the event message Replication instance patched, then there was an engine version upgrade to the replication instance. An upgrade can occur immediately after instance modification, or during your scheduled maintenance window.
If the instance class type changes, then you see the event message The replication instance class for this replication instance is being changed or The replication instance class for this replication instance has changed. Single-AZ deployments are unavailable for a few minutes during a scaling operation. Multi-AZ deployments are unavailable for the duration of the failover. The failover usually takes 60 seconds. AWS DMS upgrades the standby database before the newly sized database fails over.
You might see the event messages Multi-AZ instance failover started or Multi-AZ instance failover completed for the following reasons:
- The primary replication instance is unresponsive.
- The instance was manually rebooted with the options Reboot or Reboot with planned failover.
- The replication instance experiences intermittent network issues with the underlying host.
Monitor AWS DMS metrics with the enhanced monitoring dashboard
AWS DMS delivers metrics from the enhanced monitoring dashboard to Amazon CloudWatch Logs. View the Replication instance log for performance, resource utilization, and health metrics.
Note: AWS DMS serverless replications don't support enhanced monitoring.
Turn on Multi-AZ deployments to reduce downtime
To reduce downtime, turn on Multi-AZ deployments. In a Multi-AZ deployment, a standby replica of the replication instance is available in a different Availability Zone. For more information, see Resilience in AWS Database Migration Service.
Note: For instances that use Amazon Simple Storage Service (Amazon S3) as a target, AWS DMS might write duplicate records to your S3 bucket. This occurs when you resume your task after a restart or failover and the TargetTablePrepMode is set to DO_NOTHING.
Related information
Best practices for AWS Database Migration Service
Working with an AWS DMS replication instance