Hello fellow developers,
I have encountered a problem, in which there is a significant delay between a Cloudwatch Alarm breach and the actual Alarm state change.
I can not figure out where the delay is coming from and after searching through the documentation and the developer forums without any success, I require your assistance :)
What I have done? (Steps to reproduce)
In concrete, I deployed a webserver using ECS Fargate behind an internet-facing application loadbalancer. The metric, on which the alarm should trigger, is the amount of 4xx Response Codes, which the application loadbalancer receives from the webserver. In case the sum of failed requests (yielding in a 4xx response code) over a timeperiod of 1 minute is bigger than 1, the alarm should switch from the OK in the ALARM state.
The whole setup can be replicated using the following cdk code:
const vpc = new ec2.Vpc(scope, "SampleVpc", {
subnetConfiguration: [{ // avoid autonomous creation of private subnets and costly NAT gateways
cidrMask: 18,
name: 'Public',
subnetType: ec2.SubnetType.PUBLIC,
}]
});
//nginx provides a sample page served over port 80
const initialContainerImage = ecs.ContainerImage.fromRegistry("nginx");
const loadBalancedFargateService = new ecsPatterns.ApplicationLoadBalancedFargateService(this, "FargateService", {
assignPublicIp: true,
taskImageOptions: {
image: initialContainerImage,
},
publicLoadBalancer: true,
desiredCount: 1,
vpc: vpc
});
//create an alarm, when loadbalancer register a HTTP 4xx response code from targets
const metric4xxresponseCode = loadBalancedFargateService.loadBalancer.metricHttpCodeTarget(elbv2.HttpCodeTarget.TARGET_4XX_COUNT)
const alarm = new cw.Alarm(this, "4xxAlarm", {
alarmName: "4xxAlarm",
metric: metric4xxresponseCode,
threshold: 1,
evaluationPeriods: 1,
statistic: "sum",
period: cdk.Duration.minutes(1),
treatMissingData: cw.TreatMissingData.NOT_BREACHING
});
What was the expected behaviour?
Each minute, the Application Loadbalancer publishes the 4xxResponseCode metric to Cloudwatch (also written in this documentation: https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-cloudwatch-metrics.html).
Because the period of the alarm is also set to one minute and only the datapoints of the last period are evaluated (defined through the evaluationPeriod = 1), the alarm should trigger within one minute, after the treshold is violated. Assuming the 4xx response code only occurs one time, the alarm should also switch back from the ALARM state to the OK state one minute after the ALARM state was reached, since the period is set to one minute, right?
What did actually happen?
After provoking some 404 error codes, the metric HTTPCode_Target_4XX_Count
shows at 10:10 a sum of 3, therefore violationg the Treshold HTTPCode_Target_4XX_Count >= 1 for 1 datapoints within 1 minute.
Anyway, the alarm state changes 4 minutes later at 10:14:
$ aws cloudwatch describe-alarm-history --alarm-name 4xxAlarm
{
"AlarmHistoryItems": [
{
"AlarmName": "4xxAlarm",
"Timestamp": "2020-03-13T10:20:37.155000+00:00",
"HistoryItemType": "StateUpdate",
"HistorySummary": "Alarm updated from ALARM to OK",
"HistoryData": "{\"version\":\"1.0\",\"oldState\":{\"stateValue\":\"ALARM\",\"stateReason\":\"Threshold Crossed: 1 datapoint [3.0 (13/03/20 10:10:00)] was greater than or equal to the threshold (1.0).\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:14:37.152+0000\",\"startDate\":\"2020-03-13T10:10:00.000+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[3.0],\"threshold\":1.0}},\"newState\":{\"stateValue\":\"OK\",\"stateReason\":\"Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:20:37.149+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[],\"threshold\":1.0}}}"
},
{
"AlarmName": "4xxAlarm",
"Timestamp": "2020-03-13T10:14:37.157000+00:00",
"HistoryItemType": "StateUpdate",
"HistorySummary": "Alarm updated from OK to ALARM",
"HistoryData": "{\"version\":\"1.0\",\"oldState\":{\"stateValue\":\"OK\",\"stateReason\":\"Threshold Crossed: no datapoints were received for 1 period and 1 missing datapoint was treated as [NonBreaching].\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:05:37.195+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[],\"threshold\":1.0}},\"newState\":{\"stateValue\":\"ALARM\",\"stateReason\":\"Threshold Crossed: 1 datapoint [3.0 (13/03/20 10:10:00)] was greater than or equal to the threshold (1.0).\",\"stateReasonData\":{\"version\":\"1.0\",\"queryDate\":\"2020-03-13T10:14:37.152+0000\",\"startDate\":\"2020-03-13T10:10:00.000+0000\",\"statistic\":\"Sum\",\"period\":60,\"recentDatapoints\":[3.0],\"threshold\":1.0}}}"
},
[...]
}
Note, that the switch back to OK states takes 6 minutes. Why is there such a huge delay, even when the period can be set to 1 minute?
Thanks for your help!