Why did my CloudWatch alarm initiate when the monitored metric doesn't have any breaching datapoints?

5 minute read
0

My Amazon CloudWatch alarm changed to the ALARM state. When I check the metric that's monitored, the CloudWatch graph doesn't show any breaching datapoints. However, the Alarm History contains an entry with a breaching datapoint. I want to know what initiated my CloudWatch alarm.

Short description

CloudWatch alarms evaluate metrics based on the datapoints that are immediately available. The alarm's history shows a record of the datapoints that the alarm evaluated at that timestamp. However, after the alarm evaluation, CloudWatch can publish new samples. The new samples might affect the value that's calculated when CloudWatch aggregates the metric data.

Resolution

Find the breaching datapoints

If your CloudWatch graph doesn't show any breaching datapoints, then the datapoints occurred outside of the alarm evaluation time.

For example, X number of samples become available when an alarm evaluation occurs. The X number of examples result in an aggregated value of A. Then, new samples are posted. So,Y number of samples are retrieved for the same timestamp. The Y number samples results in an aggregated value of B.

In the following example, an alarm is configured with the preceding parameters:

  • Namespace: Web_App
  • Metric: ResponseTime
  • Dimension: host,h_04254448d4e964956
  • Statistic: Average
  • Threshold: 0.005
  • ComparisonOperator: GreaterThanThreshold
  • Period: 60 seconds (1 minute)
  • Evaluation Period: 1

When the alarm evaluates the period from 12:00:00 - 12:01:00 UTC, the metric retrieves following values:

Sample-1: 12:00:00 UTC, numeric value: 0.00675  
Sample-2: 12:00:00 UTC, numeric value: 0.00789  
Sample-3: 12:00:00 UTC, numeric value: 0.00421

Because the average of these values is 0.006283333, the average breaches the threshold of 0.005 seconds, and the alarm changes to the ALARM state. The alarm's history shows the aggregated values that exceed the threshold.

A host that temporarily experiences a performance issue affects the client application that's responsible for publishing metrics. As a result, the host might not post datapoints at equal intervals. In this case, samples for 12:00 are published after the alarm evaluation occurs.

The following example represents all the samples for the 12:00 timestamp:

Sample-1: 12:00:00 UTC, numeric value: 0.00675  
Sample-2: 12:00:00 UTC, numeric value: 0.00789  
Sample-3: 12:00:00 UTC, numeric value: 0.00421  
Sample-4: 12:00:00 UTC, numeric value: 0.00002  
Sample-5: 12:00:00 UTC, numeric value: 0.00007

When you receive an alert from the alarm, generate a CloudWatch graph to review the metric behavior. CloudWatch retrieves the five samples from 12:00:00 - 12:01:00 UTC and aggregates them as an average of 0.003788. So, the value changed from the previously calculated value and is below the threshold. If additional samples are posted after the alarm evaluation occurs, then the breaching datapoints aren't visible in the time range.

Increase the alarm evaluation interval

When you configure Datapoints to Alarm, a longer evaluation interval might occur. When an alarm generates false alerts because of delayed metrics, the evaluation interval increases, and the delayed datapoints are included in the alarm evaluation. The inclusion of delayed datapoints reduces the number of false alerts.

To increase the evaluation interval, use one of the following options.

Increase the period. In the following example, the period is increased to 5 minutes:

Namespace: Web_App
Metric: ResponseTime
Dimension: host,h_04254448d4e964956
Statistic: Average
Threshold: 0.005
ComparisonOperator: GreaterThanThreshold
Period: 300 seconds (5 minutes)
Evaluation Period: 1

Or, configure M out of N Datapoints to Alarm. In the following example, M out of N datapoints are configured to two out of three datapoints:

Namespace: Web_App
Metric: ResponseTime
Dimension: host,h_04254448d4e964956
Statistic: Average
Threshold: 0.005
ComparisonOperator: GreaterThanThreshold
Period: 60 seconds (1 minute)
Evaluation Period (N): 3
Datapoints To Alarm (M): 2

When you configure Evaluation Periods and Datapoints to Alarm as different values, the M out of N alarm is set. Datapoints to Alarm is set to M and Evaluation Period is set to N. For example, if you configure four out of five datapoints with a period of 1 minute, then the evaluation interval is 5 minutes. If you configure three out of three datapoints with a period of 10 minutes, then the evaluation interval is 30 minutes.

If you configure Datapoints to Alarm with different values, then CloudWatch alarms evaluate more datapoints. CloudWatch alarms also change the alarm state when a minimum number of datapoints breaches a set of datapoints. The parameter can adjust the alarm to activate on a single datapoint, or require multiple datapoints to transition to the ALARM state.

For more information, see Create a CloudWatch alarm based on a static threshold and Configuring how CloudWatch alarms treat missing data.

Related information

Why didn't I receive an SNS notification for my CloudWatch alarm trigger?

How do I troubleshoot my CloudWatch alarm in the INSUFFICIENT_DATA state?

Why did my CloudWatch alarm send me a notification after a single breached data point?

AWS OFFICIAL
AWS OFFICIALUpdated 3 days ago