Skip to content

Why did my CloudWatch alarm initiate when the monitored metric doesn't have breaching datapoints?

5 minute read
0

My Amazon CloudWatch alarm changed to the ALARM state. When I check the metric that's monitored, I don't see breaching datapoints on the CloudWatch graph. However, the Alarm History contains an entry with a breaching datapoint. I want to know what initiated my CloudWatch alarm.

Short description

If your application publishes additional datapoints for the same period after the alarm evaluates, then the CloudWatch graph shows an updated aggregated value. The updated aggregated value includes the delayed datapoints. The graph might not show a breach even though the alarm history recorded one based on the datapoints available during evaluation.

Resolution

Find the breaching datapoints

The following example shows how delayed datapoints cause the CloudWatch graph to differ from the alarm history.

In the following example, an alarm is configured with the following parameters:

  • Namespace: Web_App
  • Metric: ResponseTime
  • Dimension: host,h_04254448d4e964956
  • Statistic: Average
  • Threshold: 0.005
  • ComparisonOperator: GreaterThanThreshold
  • Period: 60 seconds (1 minute)
  • Evaluation Period: 1

When the alarm evaluates the period from 12:00:00 - 12:01:00 UTC, the metric retrieves the following sample values.

Sample 1:

Sample-1: 12:00:00 UTC, numeric value: 0.00675

Sample 2:

Sample-2: 12:00:00 UTC, numeric value: 0.00789

Sample 3:

Sample-3: 12:00:00 UTC, numeric value: 0.00421

Because the average is 0.006283333, the average breaches the threshold of 0.005 seconds, and the alarm changes to the ALARM state. The alarm history shows the aggregated values that exceed the threshold.

A host that temporarily experiences a performance issue affects the client application that's responsible for publishing metrics. As a result, the host might not post data points at equal intervals. In this case, your application publishes samples for 12:00 after the alarm evaluation occurs.

The following example represents all samples for the 12:00 timestamp.

Sample 1:

Sample-1: 12:00:00 UTC, numeric value: 0.00675

Sample 2:

Sample-2: 12:00:00 UTC, numeric value: 0.00789

Sample 3:

Sample-3: 12:00:00 UTC, numeric value: 0.00421

Sample 4:

Sample-4: 12:00:00 UTC, numeric value: 0.00002

Sample 5:

Sample-5: 12:00:00 UTC, numeric value: 0.00007

When you receive an alert from the alarm, generate a CloudWatch graph to review the metric behavior. CloudWatch retrieves the five samples from 12:00:00 - 12:01:00 UTC and aggregates them as an average of 0.003788. The value changed from the previously calculated value and is below the threshold. If your application publishes additional samples after the alarm evaluation occurs, then the breaching datapoints aren't visible in the time range.

Increase the alarm evaluation interval

To account for data points that arrive late, configure Datapoints to Alarm to increase the evaluation interval. A longer evaluation interval gives CloudWatch more time to receive late data points before the alarm evaluates. However, this doesn't guarantee that all delated data points arrive within the evaluation interval. The effectiveness depends on the amount of delay. This reduces the number of false alerts.

To increase the evaluation interval, use one of the following options.

Increase the period

In the following example, the period is increased to 5 minutes:

Namespace: Web_App Metric: ResponseTime Dimension: host,h_04254448d4e964956 Statistic: Average Threshold: 0.005 ComparisonOperator: GreaterThanThreshold Period: 300 seconds (5 minutes) Evaluation Period: 1

Configure M out of N Datapoints to Alarm

Configure the alarm to require multiple breaching datapoints before the alarm changes to the ALARM state.

In the following example, M out of N datapoints are configured to two out of three datapoints:

Namespace: Web_App Metric: ResponseTime Dimension: host,h_04254448d4e964956 Statistic: Average Threshold: 0.005 ComparisonOperator: GreaterThanThreshold Period: 60 seconds (1 minute) Evaluation Period (N): 3 Datapoints To Alarm (M): 2

When you configure Evaluation Periods and Datapoints to Alarm as different values, you create an M out of N alarm. Datapoints to Alarm is set to M and Evaluation Period is set to N. For example, if you configure four out of five datapoints with a period of 1 minute, then the evaluation interval is 5 minutes. If you configure three out of three datapoints with a period of 10 minutes, then the evaluation interval is 30 minutes.

When you configure Evaluation Periods and Datapoints to Alarm as different values, you create an M out of N alarm. This increases the evaluation range of the alarm. The parameter allows you to adjust the alarm to activate on a single datapoint or require multiple datapoints to transition to the ALARM state.

For more information, see Create a CloudWatch alarm based on a static threshold and Configuring how CloudWatch alarms treat missing data.

Related information

Why didn't I receive an Amazon SNS notification for my CloudWatch alarm trigger?

How do I troubleshoot my CloudWatch alarm in the INSUFFICIENT_DATA state?

Why did my CloudWatch alarm send me a notification after a single breached data point?