Replication lag issue during and after upgrade from mysql5.7 to mysql 8.0.36

0

The main issue is the replication lag that isn't decreasing

Good to know: the replication on mysql5.7 with the same setup, multi-az, same instances and storage was almost perfect with only some minor peaks in replication lag.

The road to 8

  • From 5.7 with a blue/green deployment upgrading to 8 was not possible, the blue/green replication lag didn't reach zero, it only steadily increased. Blue/green had similar instance type and storage size
  • The upgrade was managed by shutting down all sites and workers, create a snapshot and create a new instance from that snapshot and upgrade to 8

The workload is write-heavy and requires at least one replica. Primary and replica are a db.r5.x2large with 1000GB storage.

The replica

  • Adding a replica throws the notice The database is using the doublewrite buffer. RDS Optimized Writes is incompatible with the storage configuration. For more information, see the RDS Optimized Writes for MYSQL documentation. but isn't the main focus for this thread (although all criteria from the docs are met... https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/rds-optimized-writes.html#rds-optimized-writes-using)
  • After the creation, which takes around 4h (because of the storage optimisation) the replication lag is a steady increasing line - even when the workload on the primary is lower.

The current "poor man's solution" is to create a new replica twice a day, this way we keep the replication lag under 14h.

Observations:

  1. Via https://repost.aws/knowledge-center/rds-mysql-high-replica-lag we have learned that SQL_THREAD is the source of the delay. None of the given solutions make a difference.
  2. The binlog disk usage on the primary easily gets to 100GB
  3. The replica is under-performing and not reaching a total 3000IOPS

Things we tried and didn't solve the issue

  1. A brand new blue/green deployment on top of the current mysql 8 instance - same story, replication lag steadily increases
  2. Added 500GB to a fresh replica, total IOPS didn't budge and stayed well under 2000 - where the max should be 4500 (Note: the primary is easily reaching the max of 3000 IOPS in total)
  3. s-l-a-v-e_parallel_workers (or replica_parallel_workers) from "default" (either zero, 1 or 4, the var-name and docs are unclear) to 8 - this didn't increase the throughput at all (note; this setting was applied + rebooted on primary and replica)
  4. Disabled the sync of the binlog via sync_binlog = 0 on the replica - no changes.

At this point we are out of options and hope there is somebody here that can assist us with this challenge.

Thanks Rob


Update 1

A comparison between the NetworkTransmitThroughput and ReplicaLag. The lag is "a bit better" but still increasing.

On mysql 5.7 with multi-az, high troughput, a bit of lag sometimes Enter image description here

On mysql 8, the first section is without multi-az, the second section is with multi-az - which shows an increase in throughput, but not anywhere near the throughput while on mysql5.7 Enter image description here


Update 2

With all suggested config and params tried, replication still increases.

At this point we think the delay is caused by not having optimised writes - somehow the instance keep throwing The database is using the doublewrite buffer. RDS Optimized Writes is incompatible with the storage configuration. For more information, see the RDS Optimized Writes for MYSQL documentation even though all settings are correct.

Next steps are waiting for support from AWS or move to a single but powerful instance or move to Aurora (although we would waste a lot of reserved instance resources)


Update 3

Switching to a single instance helped (obviously) and it's able to take all reads + writes. Not the best solution. In the meantime we'll wait on support (and give the replication-thread-workers option another try)


Update 4

We started another fresh replica and gathered as much data as possible:

Replica lag increases steadily after creation replica-lag-increase

PI overview of replica - stays empty (According to the AWS expert, this should have shown some stats like the primary?) performance-insight-on-replica-empty

PI overview of primary performance-insights-on-primary

Replica Monitoring screenshots replica-monitoring-1 replica-monitoring-2 replica-monitoring-3

Output of show master status

| File                        | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set |
|-----------------------------|----------|--------------|------------------|-------------------|
| mysql-bin-changelog.1193668 | 19399441 |              |                  |                   |

More details in my next comment, since the max post-size has been reached.

Rob
asked a month ago168 views
2 Answers
0

Consider to troubleshooting the following:

  1. Review Instance Class and Storage Configuration:

    • Ensure that the instance class and storage type/configuration meet the performance requirements of your workload. Consider upgrading the instance class or adjusting the storage configuration if necessary to handle the workload's I/O demands.
  2. Optimize Binlog Configuration:

    • Review and optimize the binlog configuration parameters on both the primary and replica instances. Ensure that the binlog format (binlog_format) and binlog retention settings are appropriate for your workload.
  3. Monitor Disk I/O and Throughput:

    • Use Amazon CloudWatch or other monitoring tools to closely monitor disk I/O metrics on both the primary and replica instances. Identify any spikes or bottlenecks in disk throughput that could be contributing to replication lag.
  4. Review Network Configuration:

    • Check the network configuration between the primary and replica instances. Ensure that there are no network latency issues or bandwidth constraints affecting replication performance.
  5. Optimize Database Queries:

    • Review and optimize the database queries executed on the primary instance to reduce the workload's impact on replication. Consider optimizing queries, adding appropriate indexes, and avoiding long-running transactions.
  6. Review AWS Support:

    • Consider reaching out to AWS Support for assistance. They may be able to provide insights and recommendations specific to your RDS setup and help troubleshoot the replication lag issues.
  7. Consider Alternative Solutions:

    • If replication lag persists despite optimization efforts, consider alternative solutions such as using a different replication method (e.g., asynchronous replication with delayed secondary) or exploring managed database services other than RDS.
  8. Database Schema Optimization:

    • Review and optimize the database schema design to reduce contention and improve replication efficiency. This may involve partitioning large tables, denormalizing where appropriate, or optimizing schema indexes.
  9. Database Maintenance:

    • Regularly perform database maintenance tasks such as vacuuming, analyzing, and optimizing tables to maintain database health and performance.
  10. Database Migration:

    • If all else fails and replication lag remains a persistent issue, consider migrating to a different database solution or architecture that better suits your performance and scalability requirements.
profile picture
EXPERT
answered a month ago
  • Thanks - I have added the most important line to my post: replication on the same setup running mysql 5.7 was near perfect with only some minor peaks in lag once in a while. With the switch to 8 we expected to get better performance and similar replication.

0

The follow-up on update #4

Output of show slave status

| Slave_IO_State                   | Master_Host | Master_User  | Master_Port | Connect_Retry | Master_Log_File             | Read_Master_Log_Pos | Relay_Log_File  | Relay_Log_Pos | Relay_Master_Log_File       | Slave_IO_Running | Slave_SQL_Running | Replicate_Do_DB | Replicate_Ignore_DB | Replicate_Do_Table | Replicate_Ignore_Table                                                                                                                                                                                            | Replicate_Wild_Do_Table | Replicate_Wild_Ignore_Table | Last_Errno | Last_Error | Skip_Counter | Exec_Master_Log_Pos | Relay_Log_Space | Until_Condition | Until_Log_File | Until_Log_Pos | Master_SSL_Allowed | Master_SSL_CA_File | Master_SSL_CA_Path | Master_SSL_Cert | Master_SSL_Cipher | Master_SSL_Key | Seconds_Behind_Master | Master_SSL_Verify_Server_Cert | Last_IO_Errno | Last_IO_Error | Last_SQL_Errno | Last_SQL_Error | Replicate_Ignore_Server_Ids | Master_Server_Id | Master_UUID                          | Master_Info_File        | SQL_Delay | SQL_Remaining_Delay | Slave_SQL_Running_State                           | Master_Retry_Count | Master_Bind | Last_IO_Error_Timestamp | Last_SQL_Error_Timestamp | Master_SSL_Crl | Master_SSL_Crlpath | Retrieved_Gtid_Set | Executed_Gtid_Set | Auto_Position | Replicate_Rewrite_DB | Channel_Name | Master_TLS_Version | Master_public_key_path | Get_master_public_key | Network_Namespace |
|----------------------------------|-------------|--------------|-------------|---------------|-----------------------------|---------------------|-----------------|---------------|-----------------------------|------------------|-------------------|-----------------|---------------------|--------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|-----------------------------|------------|------------|--------------|---------------------|-----------------|-----------------|----------------|---------------|--------------------|--------------------|--------------------|-----------------|-------------------|----------------|-----------------------|-------------------------------|---------------|---------------|----------------|----------------|-----------------------------|------------------|--------------------------------------|-------------------------|-----------|---------------------|---------------------------------------------------|--------------------|-------------|-------------------------|--------------------------|----------------|--------------------|--------------------|-------------------|---------------|----------------------|--------------|--------------------|------------------------|-----------------------|-------------------|
| Waiting for master to send event | 172.20.2.70 | rdsrepladmin |        3306 |            60 | mysql-bin-changelog.1193659 |            17271874 | relaylog.000175 |      77384349 | mysql-bin-changelog.1193367 | Yes              | Yes               |                 |                     |                    | innodb_memcache.cache_policies,innodb_memcache.config_options,mysql.plugin,mysql.rds_configuration,mysql.rds_history,mysql.rds_monitor,mysql.rds_replication_status,mysql.rds_sysinfo,mysql.rds_upgrade_prechecks |                         |                             |          0 |            |            0 |            77384111 |     36609107623 | None            |                |             0 | Yes                |                    |                    |                 |                   |                |                 10603 | No                            |             0 |               |              0 |                |                             |       1628242188 | 009a9080-ebc1-11ee-a3af-027b96cf0647 | mysql.slave_master_info |         0 |                     | Waiting for slave workers to process their queues |              86400 |             |                         |                          |                |                    |                    |                   |             0 |                      |              |                    |                        |                     0 |                   |

Output of SHOW PROCESSLIST Multiple outputs of all variants I could see. #1

| Id | User            | Host              | db     | Command | Time  | State                                             | Info             |
|----|-----------------|-------------------|--------|---------|-------|---------------------------------------------------|------------------|
|  5 | event_scheduler | localhost         |        | Daemon  |  9722 | Waiting on empty queue                            |                  |
|  7 | rdsadmin        | localhost         |        | Sleep   |     8 |                                                   |                  |
|  8 | system user     | connecting host   |        | Connect |  9668 | Waiting for master to send event                  |                  |
|  9 | system user     |                   |        | Query   |     0 | Waiting for slave workers to process their queues |                  |
| 10 | system user     |                   |        | Connect | 10725 | waiting for handler commit                        |                  |
| 11 | system user     |                   |        | Connect | 10725 | waiting for handler commit                        |                  |
| 12 | rdsadmin        | localhost         |        | Sleep   |     0 |                                                   |                  |
| 46 | master          | 172.17.98.7:60464 | SEARCH | Query   |     0 | init                                              | SHOW PROCESSLIST |

#2

| Id | User            | Host              | db     | Command | Time  | State                                             | Info                                                                                                 |
|----|-----------------|-------------------|--------|---------|-------|---------------------------------------------------|------------------------------------------------------------------------------------------------------|
|  5 | event_scheduler | localhost         |        | Daemon  |  9762 | Waiting on empty queue                            |                                                                                                      |
|  7 | rdsadmin        | localhost         |        | Sleep   |     8 |                                                   |                                                                                                      |
|  8 | system user     | connecting host   |        | Connect |  9708 | Waiting for master to send event                  |                                                                                                      |
|  9 | system user     |                   |        | Query   |     0 | Waiting for slave workers to process their queues |                                                                                                      |
| 10 | system user     |                   | SEARCH | Connect | 10756 | starting                                          | INSERT INTO `SEARCH_PRICE` (
| 11 | system user     |                   | SEARCH | Connect | 10756 | update                                            | INSERT INTO `SEARCH_PRICE` (
| 12 | rdsadmin        | localhost         |        | Sleep   |     0 |                                                   |                                                                                                      |
| 46 | master          | 172.17.98.7:60464 | SEARCH | Query   |     0 | init                                              | SHOW PROCESSLIST                                                                                     |

#3

|----|-----------------|-------------------|--------|---------|-------|---------------------------------------------|------------------|
|  5 | event_scheduler | localhost         |        | Daemon  |  9815 | Waiting on empty queue                      |                  |
|  7 | rdsadmin        | localhost         |        | Sleep   |     1 |                                             |                  |
|  8 | system user     | connecting host   |        | Connect |  9761 | Waiting for master to send event            |                  |
|  9 | system user     |                   |        | Query   |     0 | Waiting for dependent transaction to commit |                  |
| 10 | system user     |                   | SEARCH | Connect | 10799 | waiting for handler commit                  | COMMIT           |
| 11 | system user     |                   |        | Connect | 10799 | Waiting for an event from Coordinator       |                  |
| 12 | rdsadmin        | localhost         |        | Sleep   |     1 |                                             |                  |
| 46 | master          | 172.17.98.7:60464 | SEARCH | Query   |     0 | init                                        | SHOW PROCESSLIST |

#4

| Id | User            | Host              | db     | Command | Time  | State                                             | Info                                                                                                 |
|----|-----------------|-------------------|--------|---------|-------|---------------------------------------------------|------------------------------------------------------------------------------------------------------|
|  5 | event_scheduler | localhost         |        | Daemon  |  9856 | Waiting on empty queue                            |                                                                                                      |
|  7 | rdsadmin        | localhost         |        | Sleep   |    12 |                                                   |                                                                                                      |
|  8 | system user     | connecting host   |        | Connect |  9802 | Waiting for master to send event                  |                                                                                                      |
|  9 | system user     |                   |        | Query   |     0 | Waiting for slave workers to process their queues |                                                                                                      |
| 10 | system user     |                   |        | Connect | 10832 | Waiting for preceding transaction to commit       |                                                                                                      |
| 11 | system user     |                   | SEARCH | Connect | 10832 | updating                                          | UPDATE SEARCH_PRICE SET BOOKABLE = 0 WHERE BOOKABLE = 1 AND `DATE` >= '2024-05-01' AND `DATE` <= '20 |
| 12 | rdsadmin        | localhost         |        | Sleep   |     1 |                                                   |                                                                                                      |
| 46 | master          | 172.17.98.7:60464 | SEARCH | Query   |     0 | init                                              | SHOW PROCESSLIST                                                                                     |

#5

| Id | User            | Host              | db     | Command | Time  | State                                       | Info                                                                                                 |
|----|-----------------|-------------------|--------|---------|-------|---------------------------------------------|------------------------------------------------------------------------------------------------------|
|  5 | event_scheduler | localhost         |        | Daemon  |  9905 | Waiting on empty queue                      |                                                                                                      |
|  7 | rdsadmin        | localhost         |        | Sleep   |     1 |                                             |                                                                                                      |
|  8 | system user     | connecting host   |        | Connect |  9851 | Waiting for master to send event            |                                                                                                      |
|  9 | system user     |                   |        | Query   |     0 | Waiting for dependent transaction to commit |                                                                                                      |
| 10 | system user     |                   | SEARCH | Connect | 10870 | System lock                                 | UPDATE SEARCH_PRICE SET BOOKABLE = 0 WHERE BOOKABLE = 1 AND `DATE` >= '2024-03-27' AND `DATE` <= '20 |
| 11 | system user     |                   |        | Connect | 10870 | Waiting for an event from Coordinator       |                                                                                                      |
| 12 | rdsadmin        | localhost         |        | Sleep   |     0 |                                             |                                                                                                      |
| 46 | master          | 172.17.98.7:60464 | SEARCH | Query   |     0 | init                                        | SHOW PROCESSLIST                                                                                     |
Rob
answered 9 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions