- Newest
- Most votes
- Most comments
In my experience, spikes in latency are a symptom of spikes in usage that exceed provisioned capacity/performance. I'd start by checking logs and performance metrics in the upstream systems/app servers.
I see you wrote: "One of my production DBs..." & "I do not have a support plan." I suggest you consider the cost of your time investigating an issue, as well as the cost to your reputation when you have a 'service impacting event.'
It's unusual but possible that there was an AWS 'event' that impacted your system. You could try opening a Billing Support ticket and see if they'll tell you if there was an AWS incident at that time.
The symptoms suggest you hit a resource limit—likely either the IOPS or the Throughput limit of your io1 volume.
Key investigation steps:
-
Check DiskQueueDepth: If this spiked during the incident, your DB was waiting for I/O. Even if IOPS stayed under 1500, you might have hit the
-
Throughput (MiB/s) limit, which for a 1756 GiB io1 volume is capped (typically 16 \times \text{Provisioned IOPS} up to a max per instance type).
-
Performance Insights: Since you just enabled it, check the "Top SQL" and "Load by Wait" sections during the spike. Look for io/table_io or io/file_io wait events to identify the specific query causing the load.
-
Enhanced Monitoring: Check the OS processes list in RDS Enhanced Monitoring (if enabled) to see if system tasks like vacuuming or internal maintenance coincided with your application load.
-
Configuration Change: You noted a DB parameter group update at 00:59. Verify if any parameters related to memory (e.g., max_connections or buffer sizes) were changed, as insufficient memory often leads to increased disk swapping.
Recommendation: Consider migrating to gp3. It provides 3,000 IOPS and 125 MiB/s baseline performance for free, which is higher than your current provisioned io1 setup, and it is usually more cost-effective.
Based on the symptoms you're describing - sudden unresponsiveness, automatic failover, and spikes in CloudWatch metrics - this does appear to be related to I/O throttling or storage performance issues. Here are key areas to investigate:
Metrics to Check in CloudWatch:
- ReadLatency and WriteLatency - Look for unusual spikes that would indicate storage performance degradation
- DiskQueueDepth - A consistently high value indicates a large number of I/O requests are queuing, which can cause the database to become unresponsive
- IOPS metrics (ReadIOPS/WriteIOPS) - Check if you reached your provisioned 1500 IOPS limit
- EBSIOBalance% and EBSByteBalance% - For io1 volumes, if these reached 0%, it indicates you exhausted your burst capacity
- Throughput metrics - Verify if you hit throughput limits, as io1 volumes have both IOPS and throughput constraints
What Could Have Happened:
Even without seeing unusual processes, your workload may have generated I/O patterns that exceeded your storage's capacity. With io1 at 1500 IOPS, if your database suddenly needed more IOPS or throughput than provisioned, the storage would throttle requests, causing the queue depth to spike and the database to become unresponsive enough to trigger failover.
Next Steps:
- Review the specific time period of the incident in CloudWatch, focusing on the metrics mentioned above
- Consider whether 1500 IOPS is sufficient for your workload's peak demands
- Evaluate if increasing provisioned IOPS or switching to gp3 storage (which offers better baseline performance) would prevent recurrence
- Look for any correlation between the incident time and application behavior changes
Sources
Community | Troubleshoot slow performance after moving to AWS RDS
Troubleshoot Amazon RDS SQL Server instances | AWS re:Post
Relevant content
- asked 2 years ago
- AWS OFFICIALUpdated 4 months ago
