How can I troubleshoot replication lag or a backlog on my Linux Source Server for Application Migration Service?

11 minute read
2

I see lag or backlog in my Linux Source Server in when replicating data using AWS Application Migration Service.

Short description

The following are factors that contribute to replication lag and backlog when replicating data from a source server to a target server:

  • Network uplink speed and bandwidth availability: The network connection speed between the source server and the replication server might have significant impact on replication performance. Slow connections might prevent the replication process from completing. Also, limited bandwidth limits the amount of data that you can replicate in a given time.
  • Changes in disk while replicating: During the replication process, the source server might continue to write new data to its disks. If there's a large spike in the amount of new data that the source server is writing, then data accumulates and creates a significant backlog. The AWS Replication Agent must send this backlog with the initial sync. The larger the backlog, the longer it takes to complete the data replication.
  • I/O speed of the storage disks: During the replication process, the AWS Replication Agent reads storage blocks of disks and transmits data to the replication server. However, high read latency on the source server disks might impact the speed and efficiency of the data replication. Slow disks cause delays, and fast disks improve replication speed.
  • Load on source server: Resource contention on the source server might lead to high CPU utilization, memory consumption, I/O wait, or other resource constraints. For example, high CPU utilization might cause replication bottlenecks. This is because the system struggles to allocate CPU resources between the AWS Replication Agent and other processes. Similarly, high memory consumption might cause the system to swap memory pages to disk. This results in increased I/O wait and a slowdown in the replication process.
  • Under-provisioned replication resources: Staging Amazon Elastic Block Store (Amazon EBS) volumes with lower throughput and IOPS might cause high read and write latency, high queue length. All of these issues impact replication performance. Also, a replication server instance type with low network throughput and Amazon EBS bandwidth leads to replication performance issues.

Resolution

To determine the underlying reason for the lag, first perform checks on the source server. Then, perform checks on the staging area.

Source server checks

Verify that the source server is booted and running

Make sure that the Source Server for the migration is booted and running.

Verify that the source server can establish an SSL connection with the regional Application Migration Service API endpoint and the replication server

Make sure that SSL certificates aren't intercepted and changed at any point between the source server and the Application Migration Service API endpoint. And, make sure that the SSL certificates aren't intercepted and changed between the source server and the replication server. To do this, run the following command:

# echo -n | openssl s_client -connect mgn.<region>.amazonaws.com:443
# echo -n | openssl s_client -connect <replication server IP>:1500

Note: Use the command listed in the following Verify active TCP connections section to find the replication server's IP address.

Verify that all AWS Replication Agent processes are running

Run the following command to list the running AWS Replication Agent services:

# ps -u aws-replication

The following output shows the AWS Replication Agent processes that must be running:

 PID  TTY TIME    CMD
 30878 ? 00:00:00 update_onprem_v
 30879 ? 00:00:00 run_linux_migra
 30880 ? 00:00:00 tailer
 30881 ? 00:04:45 java
 30902 ? 00:00:01 tailer
 30904 ? 00:00:00 run_linux_migra
 30905 ? 00:00:10 update_onprem_v
 31023 ? 00:00:00 tail

Verify active TCP connections

Run the following command to verify that there are five active TCP connections established with the replication server on TCP Port 1500.

# sudo netstat -anp | awk '$5 ~ /:1500$/ {print}'

Check the command output for the active connections:

tcp6       0      0 172.31.1.39:54814       172.31.0.82:1500        ESTABLISHED 30881/java          
tcp6       0      0 172.31.1.39:54828       172.31.0.82:1500        ESTABLISHED 30881/java          
tcp6       0      0 172.31.1.39:54832       172.31.0.82:1500        ESTABLISHED 30881/java          
tcp6       0      0 172.31.1.39:54812       172.31.0.82:1500        ESTABLISHED 30881/java          
tcp6       0      0 172.31.1.39:54800       172.31.0.82:1500        ESTABLISHED 30881/java

Check CPU utilization on the CPU core where the AWS Replication Agent is running

The AWS Replication Agent is a single-threaded process that operates on one CPU core at a time. If CPU utilization is high on the core where the AWS Replication Agent is running, then data replication slows.

1.    Run the following commands and then review the output to determine the following:

  • The process ID of the AWS Replication Agent.
  • The CPU core (indicated by psr) that it's running on.
# ps --pid $(pidof /var/lib/aws-replication-agent/jre/bin/java) -o psr,pid,comm

# mpstat -P <psr column value> 3

2.    Then, check the CPU utilization of the identified CPU core.

Check disk performance on the source server

If there's low read throughput (rMB/s) on source disks, then less data is read and replicated. Make a note of any increase in the IO depth (avgqu-sz) and I/O wait (await) metrics. You can use sar or iostat tools to measure the disk read throughput:

# iostat -myx 3
# sar -dp 2

Check the source server for a spike in write operations

A spike of write operations in the source server might cause replication lag growth. This growth continues until the AWS Replication Agent flushes all the written data to the replication server. Run the iostat test periodically to determine what the I/O load is as the workload changes. If the write throughput (wMB/s) exceeds the available network throughput, then you see replication lag.

Note: To calculate the required bandwidth from source server to replication server, see Calculating the required bandwidth for TCP Port 1500.

Check replication speed and available bandwidth from the source server to the staging area subnet

1.    In your target AWS Region, launch a test Amazon Elastic Compute Cloud (Amazon EC2) instance using the publish AMI CE-ssl-speedtest. The EC2 instance must be the same instance type as the replication server.

2.    Select the same subnet as the subnet used in the replication settings of your source server.

3.    Make sure that the security group allows TCP Port 1500 inbound access.

4.    On the source server, configure the SpeedTest CLI as shown in the following example:

# cd /tmp
# git clone https://github.com/librespeed/speedtest-cli.git
# cd speedtest-cli/
# ls -l
# ./build.sh
# cat << EOF >> ./servers.json
[
  {
    "id": 1,
    "name": "PHP Backend",
    "server": "https://<test server private IP>:1500/speedtest/",
    "dlURL": "/garbage.php",
    "ulURL": "/empty.php",
    "pingURL": "/empty.php"
  }
 ]
EOF

Note: In the preceding example, make sure to replace the test server's IP address. If you're using the test server's public IP for a speed test, include "getIpURL": "/getIP.php" after the "pingURL" line.

5.    Run the LibreSpeed CLI as shown in the following example to test the replication speed:

# ./out/librespeed-cli-linux-amd64 —local-json ./servers.json —server 1 —no-icmp —skip-cert-verify —simple
Ping: 11.00 ms Jitter: 0.00 ms
Download rate: 503.84 Mbps
Upload rate: 493.56 Mbps

Check for a source server that was shut down ungracefully

If a source server is shut down ungracefully, then the AWS Replication Agent rescans all of the disks after the server reboots. The AWS Replication Agent rereads the disks, and the lag continuously grows until rescanning completes. For more information, see Which Windows and Linux OSs support no-rescan upon reboot?

Check for a kernel upgrade

If the kernel is upgraded on the source server and the server is rebooted, then the AWS Replication Agent fails to run. The running kernel version matches the kernel version that the AWS Replication Agent driver was compiled for during the agent installation.

Run the following commands to verify that the running kernel version matches the kernel version that the AWS Replication Agent driver was compiled for:

$ uname -r
$ modinfo -F vermagic /var/lib/aws-replication-agent/aws-replication-driver.ko

Note: vermagic is used to verify that which kernel version the kernel driver is compiled.

Verify that TCP Port 1500 isn't blocked outbound

Make sure that TCP Port 1500 isn't blocked outbound from the source server to the replication server.

Review the MGN Agent logs

Inspect the MGN Agent logs for any connectivity problems with the replication server on TCP Port 1500. Also, check for replication irregularities that indicate frequent connection loss. After identifying these issues, review the network topology to investigate further.

Verify that intermediate devices don't have a lower MTU

Confirm that none of the intermediate devices in the replication path have a lower MTU. A lower MTU reduces the replication speed and cause delays in the process. It's a best practice to maintain a consistent MTU size throughout the replication path. If a device in the path has a lower MTU, then update or replace it with a higher MTU device.

Note: If you're replicating over the public Internet, then make sure that the MTU is 1500. 1500 is the largest that internet gateway, peering, and VPN supports. Jumbo frames only work within Amazon Virtual Private Cloud (Amazon VPC) or AWS Direct Connect and has its own limitations. For more information, see the following:

Verify that network bandwidth throttling is turned off in the replication settings on the source server

Bandwidth throttling must be turned off in the replication settings of the source server.

Turning on bandwidth throttling in the source server throttles the data transfer rate of the AWS Replication Agent. This might result in constant or stagnant lag growth if there is backlog on the source server. To maintain constant and limited bandwidth for data transfer, turn on network bandwidth throttling.

To check for bandwidth throttling, complete the following steps:

1.    Open the Application Migration Service console.

2.    Choose Source servers, and then select the source server.

3.    Choose the Replication settings tab.

3.    If Throttle network bandwidth is turned on, then make sure that the throttled value is equal to or greater than the required bandwidth for data replication. For more information, see the note in the preceding section, Check the source server for a spike in write operations.

Staging area resource checks

Verify that TCP Port 1500 isn't blocked inbound

Make sure the TCP Port 1500 isn't blocked inbound in the replication server's security groups.

Note: You must complete the following steps in the Amazon Elastic Compute Cloud (Amazon EC2) console.

1.    Open the Amazon EC2 console.

2.    Select the security group that's attached to the replicator instance.

3.    Verify that inbound TCP Port 1500 is allowed on the attached security group.

Check the NetworkIn CloudWatch metric

If the NetworkIn Amazon CloudWatch metric for the replication server nears the bandwidth limit, throttling might occur. Throttling results in slower replication speed and increased lag. Consider upgrading to a larger instance type that can handle the required bandwidth.

Check the aggregated throughput and IOPS of the replication server's EBS volumes

Replication server performance might be throttled if the Amazon Elastic Block Store (Amazon EBS) volumes' aggregated throughput and IOPS exceed limits. If throttling occurs, change to a replication server instance type that accommodates your replication needs and sustain performance without throttling. It's a best practice to use a current generation EBS-optimized instance type for replication servers. On instances without support for EBS-optimized throughput, network traffic contends with traffic between your instance and your EBS volumes. On EBS-optimized instances, the two types of traffic are kept separate. Monitor replication server network and EBS CloudWatch metrics. For more information, see the following:

Monitor metrics for all replication EBS volumes

Lag and backlog accumulate when the replication server's volume write speed can't match the rate of change on the source server. To avoid replication lag, use a faster volume type with higher IOPS and bandwidth. For optimal performance EBS volume performance, it's a best practice to monitor the CloudWatch metrics for every replication EBS volume.

Check for EBS volumes created from a snapshot

Replication server's that have EBS volumes created from a snapshot, might have increased I/O operations latency the first time each block is accessed. This latency might cause lag growth or stagnation until the rescanning process is complete. For more information, see Be aware of the performance penalty when initializing volumes from snapshots.

Verify the snapshot quota in the target Region

Make sure that your AWS account hasn't reached snapshot quota limits in the AWS Region where you're replicating source servers. Use the following AWS Command Line Interface (AWS CLI) commands to verify if you've reached the snapshot quota in the Region. In the following example, replace region with your target AWS Region:

# aws service-quotas get-service-quota --service-code ebs --quota-code L-309BACF6 --region region --query "Quota.Value"
# aws ec2 describe-snapshots --owner-ids self --region region --query "length(Snapshots)"

Note: If you receive errors when running the AWS CLI commands, make sure that you're using the most recent version of the AWS CLI.

Related information

Identification of replication bottlenecks when using AWS Application Migration Service

AWS OFFICIAL
AWS OFFICIALUpdated 2 years ago