RollbackSegmentHistoryListLength

Question

Once in a while I am getting this alert. 
`
```
Threshold Crossed: 10 out of the last 10 datapoints were greater than the threshold (50000.0). The most recent datapoints which crossed the threshold: [58377.8 (20/05/23 19:13:00), 57030.933333333334 (20/05/23 19:08:00), 56191.933333333334 (20/05/23 19:03:00), 55343.26666666667 (20/05/23 18:58:00), 54466.2 (20/05/23 18:53:00)] (minimum 10 datapoints for OK -> ALARM transition).
```
`

`Threshold:
- The alarm is in the ALARM state when the metric is GreaterThanThreshold 50000.0 for at least 10 of the last 10 period(s) of 300 seconds.

Monitored Metric:
- MetricNamespace:                     AWS/RDS
- MetricName:                          RollbackSegmentHistoryListLength
- Dimensions:                          [DBClusterIdentifier = prod-aurora-cluster]
- Period:                              300 seconds
- Statistic:                           Average
- Unit:                                not specified
- TreatMissingData:                    missing

`

What shall I do to get rid of it for good? 
Thanks

Answer

The RollbackSegmentHistoryListLength metric in Amazon RDS for Aurora refers to the length of the undo log or rollback segment history list. 
This list contains the before images of the database records and is used during the process of a transaction rollback or to provide a consistent read view for long-running transactions.
A high value for RollbackSegmentHistoryListLength may indicate:

* Long-Running Transactions: If you have long-running transactions in your database, they could be holding onto undo logs for a long time, causing the length of the history list to increase.
* High Rate of DML Operations: A high rate of Data Manipulation Language (DML) operations (INSERT, UPDATE, DELETE) could generate a large number of undo logs, increasing the length of the history list.
* Resource Intensive Queries: Queries that are resource-intensive or not properly optimized could also be causing this issue.

Here's what you can do to resolve this issue:

* **Identify Long-Running Transactions:** Use performance insights, slow query logs, or the SHOW FULL PROCESSLIST command to identify any long-running transactions and either terminate them or optimize them to run faster.
* **Optimize DML Operations:** If your application performs a lot of DML operations, consider optimizing them. For example, batch your INSERTs, UPDATEs, and DELETEs, where possible.
* **Tune Database Parameters:** There are some InnoDB parameters like innodb_max_undo_log_size, innodb_purge_threads, and innodb_purge_batch_size which could help control the size of the undo log. However, before making any changes, make sure to thoroughly test in a non-production environment as these could have other impacts on your database performance.
* **Monitor and Alert:** Finally, set up CloudWatch Alarms to alert you when the RollbackSegmentHistoryListLength metric crosses a threshold that you consider critical for your application. This way, you can intervene promptly if the issue arises.

RollbackSegmentHistoryListLength

Relevant content