Skip to content

Why did my Amazon RDS DB instance restart, recover, or failover?

7 minute read
0

I want to know the root cause for the restart, recover, or failover of my Amazon Relational Database Service (Amazon RDS) database instance.

Short description

The Amazon RDS DB instance automatically performs a restart under the following conditions:

  • There isn't availability in the primary Availability Zone or performance bottlenecks and resource contention have caused an excessive workload.
  • There's an underlying infrastructure issue with the primary instance. This issue can be a loss of network connectivity to the primary instance, or a compute unit or storage issue on the primary instance.
  • The DB instance class type is changed as part of a DB instance vertical scaling activity.
  • The underlying host of the DB instance is undergoing OS patching during a specific maintenance window. For more information, see Maintaining a DB instance and Upgrading a DB instance engine version.
  • You used the Reboot or Reboot with failover option to initiate a manual reboot of the DB instance.

When the DB instance shows potential issues and fails to respond to Amazon RDS health checks, Amazon RDS automatically takes one of the following actions:

  • For a single AZ deployment, Amazon RDS initiates a Single-AZ recovery.
  • For a Multi-AZ deployment, Amazon RDS initiates a Multi-AZ failover for the Multi-AZ deployment.

Then, Amazon RDS restarts the DB instance so that you can resume database operations as quickly as possible without administrative intervention.

Resolution

To identify the cause of the outage, check the following logs and metrics for your DB instance. Also, make sure to follow best practices.

Amazon RDS event messages

To identify the root cause of an unplanned outage in your instance, view all the Amazon RDS events for the last 24 hours. By default, Amazon RDS registers all events in the UTC/GMT time. To store events for a longer time period, send the Amazon RDS events to Amazon CloudWatch Events. When your instance restarts, you see one of the following messages in your Amazon RDS event notifications.

The RDS instance was modified by customer

This RDS event message indicates that an RDS instance modification initiated the failover.

Applying modification to database instance class

This RDS event message indicates that the DB instance class type is changed based on the deployment type:

  • Single-AZ deployments become unavailable for a few minutes during this scaling operation.
  • Multi-AZ deployments are unavailable during the time that it takes for the instance to failover. This duration is usually about 60 seconds. This delay occurs because the standby database is upgraded before the newly sized database experiences a failover. Then, your database restarts, and the engine performs recovery to make sure that your database remains in a consistent state.

The user requested a failover of the DB instance

This message indicates that a user used the Reboot or Reboot with failover option to initiate a manual reboot of the DB instance.

The primary host of the RDS Multi-AZ instance is unhealthy
This reason indicates that a transient underlying hardware issue led to the loss of communication to the primary instance. This issue might render the instance unhealthy because the RDS monitoring system couldn't communicate with the RDS instance to perform health checks.

The primary host of the RDS Multi-AZ instance is unreachable due to loss of network connectivity

This reason indicates that a transient network issue that affected the primary host of your Multi-AZ deployment caused the Multi-AZ failover and database instance restart. The internal monitoring system detected this issue and initiated a failover.

The RDS Multi-AZ primary instance is busy and unresponsive, the Multi-AZ instance activation started, or the Multi-AZ instance activation completed

The event log shows these messages under the following situations:

  • The primary DB instance is unresponsive.
  • A memory crunch after excessive memory consumption in the database prevented the Amazon RDS monitoring system from contacting the underlying host. As a proactive measure, the monitoring system restarts the database.
  • The DB instance experienced intermittent network issues with the underlying host.
  • The instance experienced a database load. In this case, you might notice spikes in CloudWatch metrics CPUUtilization, DatabaseConnections, IOPS metrics, and Throughput details. You might also notice depletion of Freeablememory.

Database instance patched

This message indicates that the DB instance underwent a minor version upgrade during a maintenance window. This message occurs because the Auto minor version upgrade setting is turned on for the instance.

CloudWatch metrics

To check if a database load issue caused the outage, review the CloudWatch metrics for your Amazon RDS instance. Spikes in the following key metrics might indicate an issue in the availability and health status of your RDS instance:

  • DatabaseConnections
  • CPUUtilization
  • FreeableMemory
  • WriteIOPS
  • ReadIOPS
  • ReadThroughput
  • WriteThroughput
  • DiskQueueDepth

Enhanced Monitoring

Amazon RDS delivers metrics from Enhanced Monitoring into your Amazon CloudWatch Logs account. This feature provides metrics in real time for the operating system that your DB instance runs on. You can view all the system metrics and process information for your DB instances on the console.

Database Insights

The CloudWatch Database Insights Dashboard contains information related to your database performance that you can use to analyze and troubleshoot performance issues. Use the Database Insights Dashboard to analyze statistics for a query. It's a best practice to use the information on this dashboard to tune the performance of the query and optimize your workload. For more information, see Viewing the Database Instance Dashboard for CloudWatch Database Insights.

Note: To reduce issues, it's a best practice to work with your database administrator to make these changes.

RDS database logs

To troubleshoot the cause of the outage for your DB instance, you can use the Amazon Aurora and RDS console or Amazon RDS API operations to view, download, or monitor database log files. You can also query the database log files that are loaded into database tables. For more information, see Monitoring Amazon RDS log files.

Keep the following best practices in mind when dealing with RDS instance outages:

Related information

What factors affect my downtime or database performance in Amazon RDS?

Why did my Amazon RDS DB instance fail over?

How do I minimize downtime during required Amazon RDS maintenance?

How do I check running queries and diagnose resource consumption issues for my Amazon RDS for PostgreSQL or Aurora PostgreSQL DB instance?