Skip to content

Unexpected disk latency increased

0

One of my production DBs suddenly stopped responding, and the AWS processes triggered the switchover. The log files were not written, and in CloudWatch I can only see spikes in the increased usage of several metrics, but I do not see any process other than the usual ones, let alone one with more calls than usual.

I do not have a support plan. Everything points to a disk issue: SSD io1, 1756 GiB, 1500 IOPS.

What could have happened, or what else could I check to investigate the failure?

*May 13, 2026, 01:00 (UTC+02:00)

Finished updating DB parameter group May 13, 2026, 00:59 (UTC+02:00)

Performance Insights has been enabled May 13, 2026, 00:59 (UTC+02:00)

Monitoring Interval changed to 60 May 13, 2026, 00:50 (UTC+02:00)

Multi-AZ instance failover completed May 13, 2026, 00:50 (UTC+02:00)

The RDS Multi-AZ primary instance is busy and unresponsive. May 13, 2026, 00:50 (UTC+02:00)

DB instance restarted May 13, 2026, 00:48 (UTC+02:00)

Multi-AZ instance failover started. May 12, 2026, 08:52 (UTC+02:00)

Finished DB Instance backup*

Thanks in advance.

Enter image description here

asked a month ago43 views
3 Answers
2

In my experience, spikes in latency are a symptom of spikes in usage that exceed provisioned capacity/performance. I'd start by checking logs and performance metrics in the upstream systems/app servers.

I see you wrote: "One of my production DBs..." & "I do not have a support plan." I suggest you consider the cost of your time investigating an issue, as well as the cost to your reputation when you have a 'service impacting event.'

It's unusual but possible that there was an AWS 'event' that impacted your system. You could try opening a Billing Support ticket and see if they'll tell you if there was an AWS incident at that time.

AWS
answered a month ago
1

The symptoms suggest you hit a resource limit—likely either the IOPS or the Throughput limit of your io1 volume.

Key investigation steps:

  • Check DiskQueueDepth: If this spiked during the incident, your DB was waiting for I/O. Even if IOPS stayed under 1500, you might have hit the

  • Throughput (MiB/s) limit, which for a 1756 GiB io1 volume is capped (typically 16 \times \text{Provisioned IOPS} up to a max per instance type).

  • Performance Insights: Since you just enabled it, check the "Top SQL" and "Load by Wait" sections during the spike. Look for io/table_io or io/file_io wait events to identify the specific query causing the load.

  • Enhanced Monitoring: Check the OS processes list in RDS Enhanced Monitoring (if enabled) to see if system tasks like vacuuming or internal maintenance coincided with your application load.

  • Configuration Change: You noted a DB parameter group update at 00:59. Verify if any parameters related to memory (e.g., max_connections or buffer sizes) were changed, as insufficient memory often leads to increased disk swapping.

Recommendation: Consider migrating to gp3. It provides 3,000 IOPS and 125 MiB/s baseline performance for free, which is higher than your current provisioned io1 setup, and it is usually more cost-effective.

EXPERT
answered a month ago
0

Based on the symptoms you're describing - sudden unresponsiveness, automatic failover, and spikes in CloudWatch metrics - this does appear to be related to I/O throttling or storage performance issues. Here are key areas to investigate:

Metrics to Check in CloudWatch:

  1. ReadLatency and WriteLatency - Look for unusual spikes that would indicate storage performance degradation
  2. DiskQueueDepth - A consistently high value indicates a large number of I/O requests are queuing, which can cause the database to become unresponsive
  3. IOPS metrics (ReadIOPS/WriteIOPS) - Check if you reached your provisioned 1500 IOPS limit
  4. EBSIOBalance% and EBSByteBalance% - For io1 volumes, if these reached 0%, it indicates you exhausted your burst capacity
  5. Throughput metrics - Verify if you hit throughput limits, as io1 volumes have both IOPS and throughput constraints

What Could Have Happened:

Even without seeing unusual processes, your workload may have generated I/O patterns that exceeded your storage's capacity. With io1 at 1500 IOPS, if your database suddenly needed more IOPS or throughput than provisioned, the storage would throttle requests, causing the queue depth to spike and the database to become unresponsive enough to trigger failover.

Next Steps:

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.