We are experiencing a problem during the manual failover of a cluster of AWS Aurora MySQL.
The cluster has just one writer and one reader, the problem is that short after the failover the DNS still resolve with IP of the old writer that has become now a reader and if an update is triggered the client gets the error: "The MySQL server is running with the --read-only option so it cannot execute this statement"
You can reproduce the error with this simple bash script:
#!/bin/bash
while [[ 1 ]]; do
date
mysql -h clusterendpoint -u user -ppassword -D test -e "select @@hostname; update test_table set date = now() where id = 1; " 2>&1 | grep -v "Using a password on the command line interface"
dig +short clusterendpoint
echo ""
sleep 1
done
this is an example of output after triggering the failover manually:
Wed 26 Apr 2023 03:40:54 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:40:55 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:40:56 PM UTC
@@hostname
ip-10-4-2-74
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
# ---> the cluster start the failover so the writer endpoint goes down and all the TCP connections to the writer are terminated
Wed 26 Apr 2023 03:40:57 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:40:58 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:40:59 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:00 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:01 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:02 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:03 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:04 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:05 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
# ---> this is the bug
# 1) immediately after the finish of the down the clusterendpoint is still resolving with the IP of the old writer (that is now the reader), the IP is still the same as before the failover (ip-10-4-2-74 - 172.31.24.7)
# 2) the instance accepts connection but now is READ ONLY mode so an update triggers an error
Wed 26 Apr 2023 03:41:06 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:07 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:08 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:09 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
Wed 26 Apr 2023 03:41:10 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7
# ---> after some seconds the DNS is now correct, it now points to the new writer 172.31.11.137 and the update query runs without problem
# after some secs goes to the correct ip
Wed 26 Apr 2023 03:41:11 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:41:12 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
Wed 26 Apr 2023 03:41:13 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137
in this simple example is not a big issue since 5 seconds after all is working correctly.
The real problem is that we are using a software written in java that as soon that the connection is available (after it has been closed for the failover) connects immediately to the IP resolved by the cluster endpoint and since for some seconds it get the old writer instance it connects to that host thinking that is the writer instance and start to log errors: "The MySQL server is running with the --read-only option so it cannot execute this statement". To fix the problem we need to restart the application so that it resolves the correct IP from the cluster endpoint.
Why AWS after the down and the disconnections of the active TCP connections for some seconds resolves with the old writer / current reader?
I think you do not have understood fully the problem and the impatcs, I'm NOT complaining of the down, I know that there is the down but this is not the problem!
The problem is that AFTER the down the client is correctly trying to reconnect but the DNS still give the OLD IP and not the NEW one! I have also found an article online where they had the same error during the failover: https://proxysql.com/blog/failover-comparison-in-aurora-mysql-2-10-0-using-proxysql-vs-auroras-cluster-endpoint/ search for: "ERROR 1290 The MySQL server is running with the –read-only "
Sorry I can’t be of assistance. All the best
no problem, thanks for the help, I have wrote here hoping that someone developer from AWS was watching this forum, this is a bug on AWS, I have opened a case in the support