Cloudwatch: Why is there a delay between the data breach and the alarm being triggered?

0

Hello,

I have a cloud watch alarm that triggers when the InvocationsPerInstance metric for our Sagemaker endpoint breaches a threshold in one minute period. The general setup:

MetricName='InvocationsPerInstance',
Namespace='AWS/SageMaker',
Statistic='Sum',
Period=60,
EvaluationPeriods=1,
DatapointsToAlarm=1

From this, I would expect that after we see a one-minute period above the threshold the alarm would be triggered. However, there seems to be a 3 minute delay between when the data breaches the threshold and when the alarm is triggered. I want the alarm to trigger as soon as there is data that breaches if possible.

Here is the newState of the alarm. You can see that the trigger is at 15:24:00 but the alarm does not get triggered until 15:27:26

"newState": { "stateValue": "ALARM", "stateReason": "Threshold Crossed: 1 out of the last 1 datapoints [29752.0 (17/01/24 15:24:00)] was greater than the threshold (28800.0) (minimum 1 datapoint for OK -> ALARM transition).", "stateReasonData": { "version": "1.0", "queryDate": "2024-01-17T15:27:26.429+0000", "startDate": "2024-01-17T15:24:00.000+0000", "statistic": "Sum", "period": 60, "recentDatapoints": [ 29752 ], "threshold": 28800, "evaluatedDatapoints": [ { "timestamp": "2024-01-17T15:24:00.000+0000", "sampleCount": 29752, "value": 29752 } ] } } }

1 回答
1

I'd recommend to check the "Missing data treatment".

As per documentation "CloudWatch enables you to specify how to treat missing data points when evaluating an alarm. This helps you to configure your alarm so that it goes to ALARM state only when appropriate for the type of data being monitored. You can avoid false positives when missing data doesn't indicate a problem."

For each alarm, you can specify CloudWatch to treat missing data points as any of the following:

notBreaching – Missing data points are treated as "good" and within the threshold

breaching – Missing data points are treated as "bad" and breaching the threshold

ignore – The current alarm state is maintained

missing – If all data points in the alarm evaluation range are missing, the alarm transitions to INSUFFICIENT_DATA.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

Hope that helps to sort out the Alarm after a one-minute period above the threshold the alarm would be in ALARM state.

Thanks

AWS
已回答 10 个月前

您未登录。 登录 发布回答。

一个好的回答可以清楚地解答问题和提供建设性反馈,并能促进提问者的职业发展。

回答问题的准则

相关内容