How can I troubleshoot replication lag or a backlog on my Windows source server for Application Migration Service?

9 minute read
2

I see a lag or backlog in my Windows source server when replicating data using AWS Application Migration Service.

Short description

The following are factors that contribute to replication lag and backlog when replicating data from a source server to a target server:

  • Network uplink speed and bandwidth availability: The network connection speed between the source server and the replication server might have significant impact on replication performance. Slow connections might prevent the replication process from completing. Also, limited bandwidth limits the amount of data that you can replicate in a given time.
  • Changes in disk while replicating: During the replication process, the source server might continue to write new data to its disks. If there's a large spike in the amount of new data that the source server is writing, then data accumulates and creates a significant backlog. The AWS Replication Agent must send this backlog with the initial sync. The larger the backlog, the longer it takes to complete the data replication.
  • I/O speed of the storage disks: During the replication process, the AWS Replication Agent reads storage blocks of disks and transmits data to the replication server. However, high read latency on the source server disks might impact the speed and efficiency of the data replication. Slow disks cause delays, and fast disks improve replication speed.
  • Load on source server: Resource contention on the source server might lead to high CPU utilization, memory consumption, I/O wait, or other resource constraints. For example, high CPU utilization might cause replication bottlenecks. This is because the system struggles to allocate CPU resources between the AWS Replication Agent and other processes. Similarly, high memory consumption might cause the system to swap memory pages to disk. This results in increased I/O wait and a slowdown in the replication process.
  • Under-provisioned replication resources: Staging Amazon Elastic Block Store (Amazon EBS) volumes with lower throughput and IOPS might cause high read and write latency, high queue length. All of these issues impact replication performance. Also, a replication server instance type with low network throughput and Amazon EBS bandwidth leads to replication performance issues.

Resolution

To determine the reason for the lag, first perform checks on the source server. Then, perform checks on the staging area.

Source server checks

Verify that the source server is booted and running

Make sure that the source server for the migration is booted and running.

Verify that AWS Replication Agent processes are running

Run the following command from PowerShell to list the running AWS Replication Agent services:

get-service | where-object name -like “*AWSR*”

The following output shows the default services that are running. Verify that the main AWSReplicationService is running:

PS C:\Users\Administrator> get-service | where-object name -like “*AWSR*”

Status   Name               DisplayName
------   ----               -----------
Running  AwsReplicationD... AwsReplicationDriverLogger
Running  AwsReplicationL... AwsReplicationLogger
Stopped  AwsReplicationP... AwsReplicationPostConvertService
Running  AwsReplicationS... AwsReplicationService
Running  AwsReplicationV... AwsReplicationVolumeUpdaterService

Or, press Windows + R, and then type services.msc, and press Enter. Verify that AWSReplicationService is running.

Verify active TCP connections

Verify that there are five active TCP connections established with the replication server on TCP Port 1500.

Run the following command:

C:\Users\Administrator>netstat -an | find "1500"

Check the command output for the active connections:

TCP    172.31.82.135:50929    <Replicator Instance IP>:1500    ESTABLISHED
TCP    172.31.82.135:50930    <Replicator Instance IP>:1500    ESTABLISHED
TCP    172.31.82.135:50931    <Replicator Instance IP>:1500    ESTABLISHED
TCP    172.31.82.135:50933    <Replicator Instance IP>:1500    ESTABLISHED
TCP    172.31.82.135:50934    <Replicator Instance IP>:1500    ESTABLISHED

Use Windows Resource Monitor to check the performance on the source server

The AWS Replication Agent operates on one CPU core at a time. If CPU utilization is high on the core where the AWS Replication Agent is running, then data replication slows. To check CPU utilization, complete the following steps:

1.    Use one of the following methods to launch the Windows Resource Monitor:

  • On the Performance tab of the Task Manager, choose Open Resource Monitor.
  • In the Control Panel, choose Administrative Tools, Resource Monitor.
  • Run resmon.exe from the command line or PowerShell.
  • Choose the Windows icon, and then enter resmon.exe.

2.    Check the CPU utilization of the CPU core that the AWS Replication Agent is running on.

  • If the CPU utilization is high on that core, then investigate which process is consuming most of the CPU.
  • The agent uses at least 5% of the CPU. Make sure that there's enough CPU available for the agent to perform the data replication.

3.    Check disk performance on the source server.

Check for Write (B/sec) and Response Time metrics.

You can check for these metrics under Disk Activity. If there's low read throughput on the source disk, then the agent reads and replicates less data. Note any increase in the disk read and disk write metrics.

Note: The required bandwidth for transferring the replicated data over TCP Port 1500 is based on the write speed of the participating source server. It's a best practice to have a bandwidth that's at least the sum of the average write speed of all replicated source machines.

4.    Check the source server for a spike in write operations.

To check for write operations under Disk Activity, look for Write (B/sec).

As the workload changes, check the disk performance periodically to determine the I/O load. If the write throughput (MB/s) exceeds the provided amount of network throughput, then you see replication lag.

Note: To calculate the required bandwidth from source server to replication server, see Calculating the required bandwidth for TCP Port 1500.

If your source server is write heavy and is writing more than the speed of replication, then the backlog keeps growing.

Check replication speed and available bandwidth from source server to the staging area subnet

For information on how to run a speed test, see How can I perform an SSL connectivity and bandwidth test?

Check for a source server that was shutdown ungracefully

If a source server is shutdown ungracefully, the AWS Replication Agent rescans all the disks after the server reboots. The AWS Replication Agent rereads the disks, and the lag continuously grows until rescanning completes. For more information, see Which Windows and Linux OSs support no-rescan upon reboot?

To verify how the source machine was shut down, complete the following steps:

1.    Press Windows + R, and then type eventvwr.msc, and press Enter.

2.    In the Event Viewer, double click Windows Logs to expand it.

3.    Right-click System.

4.    Choose Filter Current Log.

5.    Choose the dropdown arrow near Event sources, and then choose USER32.

6.    In the <All Event IDs> field, type 1074, and then choose OK. You see a list of power off (shutdown) and restart Shutdown Type events in Event Viewer.

7.    To see the dates and times of all unexpected computer shutdowns, type 6008 in the <All Event IDs> field, and then choose OK.

Verify that TCP Port 1500 isn't blocked outbound

1.    Make sure that TCP Port 1500 isn't blocked outbound from the source server to the replication server.

2.    You can perform a Telnet or TNC to the replicator instance IP address, as shown in the following examples:

  • From CMD: telnet <replication subnet IP address> 1500
  • From PowerShell: TNC <replication subnet IP address> -port 1500

3.    Make sure that your local firewall allows connectivity from the source server to the replication server over TCP Port 443. To turn on connectivity on the OS firewall, complete the following actions:

  • On the source server, open the Windows Firewall console.
  • On the console, choose the Outbound Rules option from the tree.
  • On the Outbound Rules table, select the rule that relates to the Remote Port - 1500 connection. Verify that the Enabled status is set to Yes.
  • If the Enabled status of the rule is No, then right-click and select Enable Rule.

4.    To allow traffic over TCP port 1500, make sure that your corporate firewall allows this connectivity.

Note: You can also perform the SSL Speed test or iperf test. For more information, see How can I perform an SSL connectivity and bandwidth test?

Verify that bandwidth throttling is turned off in the replication settings on the source server

Turn off bandwidth throttling in the replication settings of the source server. Turning off bandwidth throttling helps maintain sufficient bandwidth for data transfer from the source server to the staging area subnet.

Turning on bandwidth throttling in the source server results in constant or stagnant lag growth lag. This is because throttling limits data replication from the source server to the replication server.

To check for bandwidth throttling, complete the following steps:

1.    Open the Application Migration Service console.

2.    Choose Settings, and then select the replication template under Data routing and throttling.

3.    If Throttle network bandwidth is turned on, check the value and compare it with the replication speed that's listed. For more information, see the preceding section Check replication speed and available bandwidth from source server to the staging area subnet.

Staging area resource checks

Verify that TCP Port 1500 isn't blocked inbound

Make sure that TCP Port 1500 isn't blocked inbound in the replication server security groups.

Note: You must complete the following steps in the Amazon Elastic Compute Cloud (Amazon EC2) console.

1.    Open the Amazon EC2 console.

2.    Select the security group that's attached to the replicator instance.

3.    Verify that inbound TCP Port 1500 is allowed on the attached security group.

Check the snapshot quota in the target Region

Make sure that your AWS account didn't reach snapshot quota limits in the Region where you're replicating source servers. Run the following AWS Command Line Interface (AWS CLI) command to check if you reached the snapshot quota in the Region. In the following example, replace region with your target Region.

Note: If you receive errors when running AWS CLI commands, make sure that you’re using the most recent version of the AWS CLI.

# aws service-quotas get-service-quota --service-code ebs --quota-code L-309BACF6 --region region --query "Quota.Value"  
# aws ec2 describe-snapshots --owner-ids self --region region --query "length(Snapshots)"

Related information

Identification of replication bottlenecks when using AWS Application Migration Service

AWS OFFICIAL
AWS OFFICIALUpdated a year ago