AWS announces preview of AWS Interconnect - multicloud
AWS announces AWS Interconnect – multicloud (preview), providing simple, resilient, high-speed private connections to other cloud service providers. AWS Interconnect - multicloud is easy to configure and provides high-speed, resilient connectivity with dedicated bandwidth, enabling customers to interconnect AWS networking services such as AWS Transit Gateway, AWS Cloud WAN, and Amazon VPC to other cloud service providers with ease.
How do I troubleshoot high binlog replica lag with Amazon RDS for MySQL and Aurora MySQL?
I want to know why there's replica lag when I use Amazon Relational Database Service (Amazon RDS) for MySQL or Amazon Aurora MySQL.
Short description
Amazon RDS for MySQL uses asynchronous replication, sometimes a replica lags behind the primary DB instance. To monitor a replication lag, use an Amazon RDS for MySQL read replica with binary log file position-based replication.
To check the ReplicaLag metric for Amazon RDS, use Amazon CloudWatch. The ReplicaLag metric reports the value of the Seconds_Behind_Master field of the SHOW SLAVE STATUS command.
For Aurora, the AuroraBinlogReplicaLag metric measures replica lag between Aurora DB clusters that use binary logs.
Note: For MySQL versions 8.0.22 and later, the SHOW REPLICA STATUS command has replaced the SHOW SLAVE STATUS command. For more information, see SHOW SLAVE | REPLICA STATUS statement on the MySQL website.
The Seconds_Behind_Master field shows the current lag in seconds that the replica DB instance is behind the source instance. It also shows the original value that's logged on the primary DB instance for the event that processes on the replica DB instance.
MySQL replication uses the binary log dump, replication I/O receiver, and replication SQL applier threads. For more information about how the threads work, see Replication threads on the MySQL website. If there's a delay in replication, then check whether IO_THREAD replica or SQL_THREAD replica causes the lag. Then, you can identify the root cause of the lag.
Resolution
Identify the replication thread that lags
Complete the following steps:
- To identify where the source or primary DB instance writes binary logs to, run the SHOW MASTER STATUS command on the primary DB instance:
Example output:mysql> SHOW MASTER STATUS;
Note: In the preceding example output, the source or primary DB instance writes the binary logs to the mysql-bin.066552 file.+----------------------------+----------+--------------+------------------+-------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +----------------------------+----------+--------------+------------------+-------------------+ | mysql-bin.066552 | 521 | | | | +----------------------------+----------+--------------+------------------+-------------------+ 1 row in set (0.00 sec) - To find your output status, run the SHOW SLAVE STATUS or SHOW REPLICA STATUS command on the replica DB instance:
mysql> SHOW SLAVE STATUS\G;
Example output 1:
*************************** 1. row *************************** Master_Log_File: mysql-bin.066548 Read_Master_Log_Pos: 10050480 Relay_Master_Log_File: mysql-bin.066548 Exec_Master_Log_Pos: 10050300 Slave_IO_Running: Yes Slave_SQL_Running: Yes
In the preceding example output, the Master_Log_File: mysql-bin.066548 shows that the IO_THREAD replica reads from the mysql-bin.066548 binary log file. The primary DB instance writes the binary logs to the mysql-bin.066552 file. The IO_THREAD replica is four binlogs behind. However, because the Relay_Master_Log_File is mysql-bin.066548, the SQL_THREAD replica reads from same file as IO_THREAD. The SQL_THREAD replica maintains speed, but the replica IO_THREAD lags.
Example output 2:
*************************** 1. row *************************** Master_Log_File: mysql-bin.066552 Read_Master_Log_Pos: 430 Relay_Master_Log_File: mysql-bin.066530 Exec_Master_Log_Pos: 50360 Slave_IO_Running: Yes Slave_SQL_Running: Yes
The preceding example output shows that the primary instance's log file is mysql-bin.066552. The IO_THREAD maintains speed with the primary DB instance. In the replica output, the SQL thread performs Relay_Master_Log_File: mysql-bin.066530. As a result, SQL_THREAD lags 22 binary logs behind.
Because the IO_THREAD reads the binary logs only from the primary or source instance, the IO_THREAD doesn't usually cause large replication delays. However, network connectivity and network latency can affect the speed of the reads between the servers. High bandwidth usage can cause the IO_THREAD replica to perform more slowly.
If the SQL_THREAD replica causes the replication delays, then use the following troubleshooting steps to resolve your issue.
Long-running write queries on the primary instance
Long-running write queries on the primary DB instance that take an equal amount of time to run on the replica DB instance can increase seconds_behind_master. For example, if a change to the primary instance takes 1 hour to run, then the lag is 1 hour. If the change also takes 1 hour to complete on the replica, then the total lag is 2 hours.
To minimize lag, monitor the slow query log on the primary instance and identify queries that need optimization. You can also reduce long-running statements to smaller statements or transactions.
To troubleshoot binlog-related performance issues, analyze wait events found in Performance Insights for more information. You can also adjust isolation levels.
Insufficient DB instance class size or storage
If the replica DB instance's class or storage configuration is lower than the primary instance's, then the replica might throttle because of insufficient resources. The replica can't maintain the number of changes on the primary instance.
To resolve this issue, make sure that the replica's DB instance type is the same as or higher than the primary DB instance. For replications to effectively operate, each read replica requires the same number of compute and storage resources as the source DB instance.
Parallel queries run on the primary DB instance
By default, MySQL replication is single threaded. When you run queries in parallel on the primary instance, the queries commit on the replica in a serial order. When a high volume of writes to the source instance occurs in parallel, the writes to the read replica use a single SQL_THREAD to serialize. Then, a lag between the source DB instance and read replica might occur.
With MySQL versions 8.0.27 and later, the default value for replica_parallel_workers is 4. This value means that replicas are multi-threaded by default. Similarly, for Aurora MySQL version 3.04 and later, replication is multithreaded by default, with the replica_parallel_workers value set as 4. You can modify this parameter in your custom parameter group.
For more information about multi-threaded replication, see Binary logging options and variables on the MySQL website.
Multi-threaded replication can cause gaps in replication. For example, it's difficult to identify transactions that you skip. When you use multi-threaded replication, it's not a best practice to skip the replication errors. Gaps in data consistency between the primary and replica DB instances might occur.
Binary logs synced to the disk on the replica DB instance
When you turn on automatic backups on the replica, overhead occurs to sync the binary logs to the disk on the replica. The default value of the sync_binlog parameter is set to 1. If you change the value to 0, then the MySQL server can't synchronize the binary log to the disk. Instead, the operating system (OS) occasionally flushes the binary logs to disk.
To reduce the performance overhead that's required to sync the binary logs to disk on every commit, turn off the binary log synchronization. However, if there's a power failure or the OS crashes, then some of the commits might not synchronize to the binary logs. Asynchronization can affect point-in-time restore (PITR) capabilities. For more information, see sync_binlog on the MySQL website.
Binlog_format is set to ROW
When replication occurs in the following scenarios, the SQL thread performs a full table scan:
- The binlog_format on the primary DB instance is set to ROW.
- The source table doesn't have a primary key.
This scan happens because the default value of the slave_rows_search_algorithms parameter is TABLE_SCAN,INDEX_SCAN.
To temporarily resolve this issue, change the search algorithm to INDEX_SCAN,HASH_SCAN to reduce the overhead of a full table scan. For a more permanent solution, it's a best practice to add an explicit primary key to each table.
For more information about the slave-rows-search-algorithms parameter, see slave_rows_search_algorithms on the MySQL website.
Replica creation lag (Applicable for RDS MySQL)
To create a read replica of a MySQL primary instance, Amazon RDS takes a DB snapshot. Then, Amazon RDS restores the snapshot to create a new DB instance and establishes replication between the two.
After Amazon RDS establishes replication, a lag occurs when Amazon RDS creates a backup of the primary DB instance. To minimize the lag, create a manual backup before you call for the replica creation. The DB snapshot then becomes an incremental backup.
When you restore a read replica from a snapshot, the replica doesn't wait for all the data to transfer from the source. The replica DB instance is available to perform the DB operations. The existing Amazon Elastic Block Store (Amazon EBS) snapshots create a new volume in the background.
Note: For Amazon RDS for MySQL replicas (Amazon EBS-backed volumes), the replica lag might initially increase because lazy loading can affect replication performance.
To reduce the effects of lazy loading on tables for your new read replica, perform operations that use full-table scans. For example, run mysqldump on your read replica for specific tables or databases to prioritize all backed-up table data from Amazon Simple Storage Service (Amazon S3).
You can also use the on demand InnoDB cache warming feature for RDS MySQL. The InnoDB cache warming feature saves the buffer pool state on disk in a file that's named ib_buffer_pool in the InnoDB data directory. Because Amazon RDS dumps the current state of the primary DB instance buffer pool before you create the read replica, performance improves with the lesser load. Then, you can reload the buffer pool after you create the read replica.
Effects of replica Lag
A large replica lag on the replica instance or cluster can cause accumulation of binary logs on the source. To prevent possible issues, monitor and manage binlog file sizes and disk space usage. For RDS, you can monitor the BinLogDiskUsage metric on the source DB instance.
Configure optimal binlog retention period to balance point-in-time recovery capabilities with storage usage in an RDS MySQL instance and on Aurora. Additionally, to leverage binlogs for replication and recovery, understand binlog file naming conventions and rotation behavior. To get a list of the available logs, run the SHOW BINARY LOGS command.
Related information
Working with MySQL replication in Amazon RDS
- Language
- English
Related videos


Relevant content
- asked 4 years ago
- asked a year ago
- asked 3 years ago
- asked 3 years ago