AWS Aurora Mysql cluster - BUG - Wrong DNS resolution for 5 seconds after failover

0

We are experiencing a problem during the manual failover of a cluster of AWS Aurora MySQL.

The cluster has just one writer and one reader, the problem is that short after the failover the DNS still resolve with IP of the old writer that has become now a reader and if an update is triggered the client gets the error: "The MySQL server is running with the --read-only option so it cannot execute this statement"

You can reproduce the error with this simple bash script:

#!/bin/bash

while [[ 1 ]]; do
date
mysql -h clusterendpoint -u user -ppassword -D test -e "select @@hostname;  update test_table set date = now() where id = 1; " 2>&1 | grep -v "Using a password on the command line interface"
dig +short clusterendpoint
echo ""
sleep 1 
done

this is an example of output after triggering the failover manually:

Wed 26 Apr 2023 03:40:54 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:55 PM UTC
@@hostname
ip-10-4-2-74
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:56 PM UTC
@@hostname
ip-10-4-2-74
ERROR 2013 (HY000) at line 1: Lost connection to MySQL server during query
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# ---> the cluster start the failover so the writer endpoint goes down and all the TCP connections to the writer are terminated

Wed 26 Apr 2023 03:40:57 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:58 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:40:59 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:00 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:01 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:02 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:03 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:04 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:05 PM UTC
ERROR 2003 (HY000): Can't connect to MySQL server on 'clusterendpoint.eu-west-1.rds.amazonaws.com:3306' (111)
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7


# --->  this is the bug
#  1) immediately after the finish of the down the clusterendpoint is still resolving with the IP of the old writer (that is now the reader), the IP is still the same as before the failover (ip-10-4-2-74 - 172.31.24.7)
#  2) the instance accepts connection but now is READ ONLY mode so an update triggers an error

Wed 26 Apr 2023 03:41:06 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:07 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:08 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:09 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

Wed 26 Apr 2023 03:41:10 PM UTC
@@hostname
ip-10-4-2-74
ERROR 1290 (HY000) at line 1: The MySQL server is running with the --read-only option so it cannot execute this statement
cluster-instance2.eu-west-1.rds.amazonaws.com.
172.31.24.7

# ---> after some seconds the DNS is now correct, it now points to the new writer 172.31.11.137 and the update query runs without problem

# after some secs goes to the correct ip
Wed 26 Apr 2023 03:41:11 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:12 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

Wed 26 Apr 2023 03:41:13 PM UTC
@@hostname
ip-10-4-1-206
cluster-instance1.eu-west-1.rds.amazonaws.com.
172.31.11.137

in this simple example is not a big issue since 5 seconds after all is working correctly.

The real problem is that we are using a software written in java that as soon that the connection is available (after it has been closed for the failover) connects immediately to the IP resolved by the cluster endpoint and since for some seconds it get the old writer instance it connects to that host thinking that is the writer instance and start to log errors: "The MySQL server is running with the --read-only option so it cannot execute this statement". To fix the problem we need to restart the application so that it resolves the correct IP from the cluster endpoint. Why AWS after the down and the disconnections of the active TCP connections for some seconds resolves with the old writer / current reader?

asked a year ago536 views
4 Answers
0

I have found this article: https://aws.amazon.com/blogs/database/improve-application-availability-on-amazon-aurora/ for anyone with the same problem please upvote the comment to that article asking AWS to fix this problem

answered a year ago
-1

Applications experience minimal interruption of service if they connect using the cluster endpoint and implement connection retry logic. During the failover, AWS modifies the cluster endpoint to point to the newly created/promoted DB instance. Well-architected applications reconnect automatically. The downtime during failover depends on the existence of healthy Read Replicas. If no Read Replicas are configured, or if existing Read Replicas are not healthy, then you might notice increased downtime to create a new instance.

It’s normal what you are seeing as per documentation. The application needs to handle this kind of failure.

They even show an example of your failure.

As expected, an error occurs because read-only replicas don’t support writable transactions.

https://aws.amazon.com/blogs/database/failover-with-amazon-aurora-postgresql/

profile picture
EXPERT
answered a year ago
  • I think you do not have understood fully the problem and the impatcs, I'm NOT complaining of the down, I know that there is the down but this is not the problem!

    The problem is that AFTER the down the client is correctly trying to reconnect but the DNS still give the OLD IP and not the NEW one! I have also found an article online where they had the same error during the failover: https://proxysql.com/blog/failover-comparison-in-aurora-mysql-2-10-0-using-proxysql-vs-auroras-cluster-endpoint/ search for: "ERROR 1290 The MySQL server is running with the –read-only "

  • Sorry I can’t be of assistance. All the best

  • no problem, thanks for the help, I have wrote here hoping that someone developer from AWS was watching this forum, this is a bug on AWS, I have opened a case in the support

-1

A lot of applications cache the IP Address of a host name and will never re-resolve if there is a failure. You may also want to check the behaviour of your app that its handling a database disconnect correctly.

If you have to restart your app each time, then its not correctly handing a database loss.

Otherwise, the TTL on the dns record is what ever AWS set.

profile picture
EXPERT
answered a year ago
  • Hi, what is the contribution of this answer to the problem?

  • Apologies. I have reanswered. This is expected behaviour.

    The application code needs to handle this type of failure.

-1

It sounds like this has to do with the time-to-live (TTL) on the DNS entries for the cluster. Can you check for the instance being read-only and, if it is, reset the connection within your application after some amount of waiting?

profile picture
answered a year ago
  • What do you mean exaclty? The TTL expiration must the handled by AWS during the down and the cluster should not accept new connections until the TTL is not expired, the CNAME from what I see is 5 seconds. The application is legacy and we cannot modify it

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions