- Newest
- Most votes
- Most comments
When CloudWatch anomaly detection bands suddenly change like you're experiencing, it's typically due to the model adapting to new patterns in your metric data. There are a couple of key things happening here:
For most AWS metrics (including many common ones), CloudWatch ensures the anomaly detection bands stay within logical values - for example, keeping utilization metrics between 0 and 100%. However, it appears that in your case with RDS EBSByteBalance%, the model is not properly constraining the upper bound.
When sudden metric pattern changes occur, it can take several hours for the anomaly detection bands to adapt to the new level. During this adaptation period, false alarms can occur, which is likely what you're experiencing.
To address this issue, you have a few options:
-
Modify the sensitivity of the anomaly detection to change the band's width. Lower sensitivity will create wider bands (potentially reducing false alarms), while higher sensitivity creates narrower bands.
-
Delete and recreate the anomaly detector to manually retrain the model. This forces CloudWatch to build a new model based on the most recent data.
-
If you know certain time periods have unusual patterns that shouldn't be used for training the model, you can exclude those specific time periods from being used to train the anomaly detection model.
These approaches should help stabilize your anomaly detection bands and reduce false alarms for your RDS EBSByteBalance% metric.
Sources
Manage and troubleshoot CloudWatch anomaly detection bands | AWS re:Post
Using CloudWatch anomaly detection - Amazon CloudWatch
- the metric data pattern has not change and has been very stable for the last 6 days.
- How do I modify the sensitivity? I can't find an option for that in the alarm configuration wizard.
- I updated the alarm to use a static threshold, and then updated it again to use anomaly detection - and it looks the same with the same artifacts.
- that's not something that happens.
Since you mentioned the metric data pattern has been very stable, excluding time periods may not be the solution here. Given the stable nature of your RDS EBSByteBalance% metric, the erratic behavior of the anomaly detection bands is quite puzzling. A few additional thoughts:
- Try contacting AWS Support - they may be able to further investigate what is causing the anomaly detection model to behave this way for your specific use case.
- Consider using a static threshold alarm instead of anomaly detection, at least temporarily, to avoid the false alarms.
- Monitor the issue closely and document any pattern changes in the anomaly detection model over time. This may help identify the root cause.
The key here seems to be that the underlying metric data is stable, so the anomaly detection model should also be stable. The fact that it is behaving erratically suggests there may be an issue with the anomaly detection algorithm itself for your particular metric. Reaching out to AWS Support is likely your best path forward.
The main problem is the "Try contacting AWS Support" part. Without paying for additional support (at 10% our spend its a bit much for the occasional support call once a year), where everything else AWS relegates to "have a complaint? post in re:Post, and maybe - unlikely but maybe - some one from AWS will someday in the future look at it and decides if its a bug that requires engineering work". I'm pretty sure this is a bug in the Cloudwatch alarm model, so this is basically the OP - "look AWS, there's a bug which I think you need to fix" - because there is no other way to report bugs (other than, as noted, paying 10% for the privilege of telling AWS where there problems are).
All the other things I've already done before posting the original "question": 2. There's a static alarm that is much less strict and will hopefully trigger only if there's a real problem, but early enough that it can be dealt with before it is crippling. 3. I'm monitoring the situation and the occasional flare up, and documenting the problem here.
Relevant content
- asked 2 years ago
- asked 2 years ago
- asked a year ago
- AWS OFFICIALUpdated 5 months ago
