Troubleshooting RDS outage on March 9, 2024

Question

We had a rather bizarre outage on the evening of March 9, 2024 (around 7pm mountain).  The RDS clusters in two regions (us-east-1, us-west-1) on two different AWS accounts stopped responding.  One of those was fronted by an RDS Proxy and even it failed.  I tried a manual connection from an EC2 instance and it failed with this error:

ERROR 2002 (HY000): Received error packet before completion of TLS handshake. The authenticity of the following error cannot be verified:
1040 - Received error packet before completion of TLS handshake. The authenticity of the following error cannot be verified:

I looked at RDS>Certificates to see if perhaps a cert had expired.  The page wouldn't load (just spinning, and is still spinning 3 days later).

The system repaired itself after an hour.  The database connection for all servers connected.  There is no record of an outage with AWS.  The ticket I logged the same day has not been assigned or responded to.

We are on the tail end of converting our entire infrastructure to RDS and this is making me nervous.  A single point of failure for a system that is designed to be highly redundant.  Even if I had a replica in an entirely different region, this would not have helped.

My question is this:
1) Did anyone else have an outage that day and time.
2) How can I go about figuring out the cause.
3) Has anyone else experienced such a failure and do you have any ideas on what we can do to prevent it.

Accepted Answer

On 4/1/2024 this happened again with a single customer (3 EC2 instances, 1 RDS cluster).  I was able to get the same error when attempting to connect from the mysql client.  This time I was able to get a tcpdump of the traffic of the connection attempt and could see what was happening using the following when attempting to log in with mysql then analyzing it with Wireshark:

tcpdump port 3306 -w /root/file.cap

The failed connection error from the mysql client attempted session was hiding the real error from the server which was "too many connections".  I had mistakenly assumed the connection build up was somehow happening because a TLS connection could not be made.  Instead the TLS error reported by the client was caused by the rejected connection to the database due to "too many connections".  With this knowledge I can now go back and figure out what is rapidly ramping up connections beyond the already large pool already established and doing so in a way that it triggers it for all servers in two different regions in two different accounts at the same time.  Here are my theories:
1. The connections are building up due to poor coding and are then hit with a spike that puts them over the edge (the limit established for RDS).
2. The spike could be from our external monitoring system which hits all the clouds on a schedule.
3. Or it could be from the health check done by the AWS load balancers.
4. Or it could be legitimate, spiked, http traffic (not likely but possible).

Answer

Hi There,

I understand you have faced an outage on March 9, 2024 where the RDS clusters in two regions (us-east-1, us-west-1) on two different AWS accounts stopped responding.

I would like to inform you that, I have checked internally that no global issue was detected during the reported time frame. Furthermore, we haven't received any issue from other customers regarding outages occurring at the same date and time.

Moreover, I attempted to locate the error message you received. However, upon searching for "ERROR 2002 (HY000)" on its own, I found several matches, none of which corresponded to the issue you described.

Hence, to thoroughly understand the root cause of the outage, a deeper analysis of the resources is necessary.

Therefore, I would recommend you to follow up on the case which you have created as our team would be able to access your concerned resources and could offer more tailored recommendations.

Answer

As of 3/19/2024 here is the status.
I marked the issue originally created for the RDS team as resolved, basically because as you've outlined there were no indications of an outage other than some RDS logs reporting failed connections.  The RDS team suggested I escalate a new issue to the EC2 team.  The EC2 team did an initial review and asked that I provide exact date/time of outage and any log files.  The log files I submitted were from 30 different EC2 instances in 2 different regions in 2 different accounts.  All concur the outage happened at the same time and lasted roughly 60 minutes.

My theory still lies in something went wrong with a common TLS handshake service.  The JDBC logs don't provide any indication that the connection failure was TLS related.  The theory is based on the one manual connection I attempted (described above) and subsequent analysis of packets between the EC2 instance and RDS during authentication where Wireshark shows TLS is involved.

At this point I am waiting for analysis by the EC2 team and I suggested this be escalated as it potentially reveals a single point of failure in a system designed to be highly available.

Troubleshooting RDS outage on March 9, 2024

Relevant content