The purpose is to provide a comprehensive guide for diagnosing and resolving performance bottlenecks, focusing on replication instance types and staging disk configurations. It explains how to interpret CloudWatch metrics, understand EBS performance limits, and recognize instance-level constraints. The article aims to educate users on concepts like EBS microbursting and volume throttling while offering practical recommendations for performance optimization.
When experiencing slow replication in AWS Application Migration Service (MGN) or Elastic Disaster Recovery (DRS), it's crucial to examine two key components: the replication instance and staging disk configuration.
By default, these services use a t3.small instance, which can lead to performance bottlenecks during concurrent replication tasks, resulting in slower replication speeds and extended initial data synchronization periods. While upgrading the instance type can improve performance, it's important to note that this will also increase costs.
Note: Changing the replication server instance type will not affect data replication. Data replication will automatically pick up where it left off, using the new instance type you selected.
Staging disk type :
The staging disk configuration also plays a vital role in replication performance. For volumes smaller than 125 GiB , AWS typically uses Magnetic HDD (Standard) volumes. These offer average performance of about 100 IOPS, with the ability to burst to several hundred IOPS for short periods. However, this may not be sufficient for high-throughput replication tasks.
You can check standard volume configurations-
Previous generation volumes
To identify performance bottlenecks, it's essential to monitor several CloudWatch metrics. These include VolumeQueueLength, VolumeReadOps, VolumeWriteOps, VolumeThroughputPercentage, VolumeIdleTime, and VolumeConsumedReadWriteOps. These metrics can help detect EBS volume throttling, which occurs when you exceed your provisioned IOPS or throughput limits.
Another factor that can impact replication performance is EBS microbursting - short bursts of I/O activity that exceed baseline performance. Microbursting can cause inconsistent replication speeds, temporary failures, and may be misidentified as general performance issues.
You can read more about Microbursting : How do I identify whether my EBS volume is micro-bursting and make sure that it doesn't affect performance?
To check if your replication is slow due to EBS volume performance limits, you can monitor two important CloudWatch metrics:
- VolumeIOPSExceededCheck
- VolumeThroughputExceededCheck
These metrics help you understand if your replication is being throttled due to EBS volume limitations:
If VolumeIOPSExceededCheck shows 1, it means your replication is trying to perform more IOPS than your provisoned IOPS of an EBS volume , causing slowness in MGN/DRS data transfer.
If VolumeThroughputExceededCheck shows 1, it means your replication is attempting to transfer more than your provisoned throughput an EBS volume resulting in slower replication speeds.
If both metrics show 0, your EBS volume performance is not the cause of slow replication, and you should investigate other factors.
Recommended actions:
If you see consistent 1s, consider upgrading your EBS volume type or size
For MGN/DRS replication, GP3 or io2 volumes are recommended for better performance. Monitor these metrics during initial sync and continuous replication phases
This will help you identify if EBS performance is causing replication bottlenecks in your migration or disaster recovery setup.
You can check more details about EBS cloudwatch metrics :-
Amazon CloudWatch metrics for Amazon EBS
It's also crucial to understand instance-level constraints. All EBS optimized instances such as t3.micro provides dedicated network capacity for EBS I/O operations, separating storage traffic from regular network traffic. While they generally enhance EBS performance, they can also impose limits on IOPS and throughput for EBS volumes.
Each EBS-optimized instance type has specific baseline and maximum IOPS and throughput limits, which can become a bottleneck if the EBS volume is capable of higher performance. For example, consider a t3.small instance:
Baseline performance:
- EBS Throughput: 21.75 MB/s
- EBS IOPS: 1000
Maximum burst performance:
- EBS Throughput: 260.62 MB/s
- EBS IOPS: 11,800
In this case, if you attach a high-performance io1 EBS volume capable of 20,000 IOPS to a t3.small instance, the volume's performance will be limited by the instance's capabilities. During normal operations, you'll be constrained to 1000 IOPS and 21.75 MB/s throughput. During burst periods, you might achieve up to 11,800 IOPS and 260.62 MB/s throughput, but only for short durations . This demonstrates how the instance type can become a bottleneck for EBS performance, even if the volume itself is capable of higher performance levels.
All EBS optimized instances by default operates on its base performance and can only reach to the maximum performance for 30 mins in 24 hours.
To monitor these limits, AWS provides two crucial CloudWatch metrics:
EBSIOBalance% - This metric shows the percentage of I/O credits remaining in your burst bucket. When this value drops to zero, your IOPS will be throttled to the baseline performance level. For example, if your t3.small instance consistently operates above 1000 IOPS, you'll see this percentage gradually decrease.
EBSByteBalance% - This metric indicates the percentage of throughput credits remaining. When this drops to zero, your throughput will be throttled to the baseline level. For instance, if your t3.small instance consistently transfers data above 21.75 MB/s, this percentage will decline.
Both metrics start at 100% and decrease when your instance exceeds its baseline performance.
For baseline and maximum performance of all EBS optimized instances, please check - Amazon EBS-optimized instance types
To optimize replication performance, consider upgrading your instance type if you consistently require high iops/throughput. Additionally, monitor your EBS volume metrics closely and consider upgrading to io2 or gp3 volumes for better performance. Implementing CloudWatch alarms for critical metrics and regularly monitoring burst balance credits can help you stay ahead of potential performance issues.
Upon completing a thorough analysis of the AWS-side components as detailed in this section, if no significant bottlenecks or issues are identified that could account for the slow replication performance in AWS Application Migration Service (MGN) or Elastic Disaster Recovery (DRS), it becomes imperative to shift focus to the source environment.
In scenarios where AWS infrastructure appears to be functioning optimally, yet replication speeds remain suboptimal, it is crucial to conduct a comprehensive examination of the source machine. This investigation is particularly vital as source-side constraints can significantly impact replication performance, even when the AWS environment is properly configured.
For a detailed guide on performing these essential source-side checks, specifically for Windows environments, please check - How do I fix replication lag for Windows source machines in AWS Application Migration Service?
By understanding these factors and implementing appropriate monitoring and optimization strategies, you can significantly improve the speed and reliability of your replication processes in AWS MGN and DRS services.