Skip to content

Cloudwatch anomaly detection going haywire for a bit

0

I'm looking at an alarm I have setup on a metric that is pretty stable (RDS EBSByteBalance%) where the alarm is using anomaly detection. From time to time the anomaly detection band goes completely haywire - here's a screen of the last 5 days:

anomaly detection last 5 days

Where we can see that even though the metric itself is very stable, the anomaly detection band deeps to less than half of the actual measurement for some hours. Thankfully the alarm is set to "if below" so it didn't wake me up the middle of the night, but today it went high in an even weirder way:

anomaly detection goes high

And this did trigger an alarm - that is completely wrong.

What is going on? This is an internal percent metric, it never goes above 100% - how come anomaly detection thought it may go up to 150%?

Is there a way to avoid this?

Update After a Couple of Hours

Now it is even worse: Peak for a while, then steadily going up

It is now going steadily up instead of just peaking and coming back down. I have to disable this alarm because it makes no sense.

Final update after a day

seems to be back to normal

The anomaly detection model wen haywire for the rest of the day and then got aligned again. I'm re-enabling the alarm but this is pretty annoying and I do expect that to happen again.

Update on 27-06-2025

It happened again: another failure

I think the problem is that the silly little algorithm has gotten into its head that we're going to have a very low EBS metric every week on Thursday evening, followed by a high and fluctuating EBS metric the following Friday, and that is somehow OK. Here's an 8 days view of the metric and associated anomaly detection: history

asked 10 months ago395 views
2 Answers
0

When CloudWatch anomaly detection bands suddenly change like you're experiencing, it's typically due to the model adapting to new patterns in your metric data. There are a couple of key things happening here:

For most AWS metrics (including many common ones), CloudWatch ensures the anomaly detection bands stay within logical values - for example, keeping utilization metrics between 0 and 100%. However, it appears that in your case with RDS EBSByteBalance%, the model is not properly constraining the upper bound.

When sudden metric pattern changes occur, it can take several hours for the anomaly detection bands to adapt to the new level. During this adaptation period, false alarms can occur, which is likely what you're experiencing.

To address this issue, you have a few options:

  1. Modify the sensitivity of the anomaly detection to change the band's width. Lower sensitivity will create wider bands (potentially reducing false alarms), while higher sensitivity creates narrower bands.

  2. Delete and recreate the anomaly detector to manually retrain the model. This forces CloudWatch to build a new model based on the most recent data.

  3. If you know certain time periods have unusual patterns that shouldn't be used for training the model, you can exclude those specific time periods from being used to train the anomaly detection model.

These approaches should help stabilize your anomaly detection bands and reduce false alarms for your RDS EBSByteBalance% metric.
Sources
Manage and troubleshoot CloudWatch anomaly detection bands | AWS re:Post
Using CloudWatch anomaly detection - Amazon CloudWatch

answered 10 months ago
    • the metric data pattern has not change and has been very stable for the last 6 days.
    1. How do I modify the sensitivity? I can't find an option for that in the alarm configuration wizard.
    2. I updated the alarm to use a static threshold, and then updated it again to use anomaly detection - and it looks the same with the same artifacts.
    3. that's not something that happens.
-4
Accepted Answer

Since you mentioned the metric data pattern has been very stable, excluding time periods may not be the solution here. Given the stable nature of your RDS EBSByteBalance% metric, the erratic behavior of the anomaly detection bands is quite puzzling. A few additional thoughts:

  • Try contacting AWS Support - they may be able to further investigate what is causing the anomaly detection model to behave this way for your specific use case.
  • Consider using a static threshold alarm instead of anomaly detection, at least temporarily, to avoid the false alarms.
  • Monitor the issue closely and document any pattern changes in the anomaly detection model over time. This may help identify the root cause.

The key here seems to be that the underlying metric data is stable, so the anomaly detection model should also be stable. The fact that it is behaving erratically suggests there may be an issue with the anomaly detection algorithm itself for your particular metric. Reaching out to AWS Support is likely your best path forward.

AWS
EXPERT
answered 10 months ago
  • The main problem is the "Try contacting AWS Support" part. Without paying for additional support (at 10% our spend its a bit much for the occasional support call once a year), where everything else AWS relegates to "have a complaint? post in re:Post, and maybe - unlikely but maybe - some one from AWS will someday in the future look at it and decides if its a bug that requires engineering work". I'm pretty sure this is a bug in the Cloudwatch alarm model, so this is basically the OP - "look AWS, there's a bug which I think you need to fix" - because there is no other way to report bugs (other than, as noted, paying 10% for the privilege of telling AWS where there problems are).

    All the other things I've already done before posting the original "question": 2. There's a static alarm that is much less strict and will hopefully trigger only if there's a real problem, but early enough that it can be dealt with before it is crippling. 3. I'm monitoring the situation and the occasional flare up, and documenting the problem here.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.