Cloudwatch Alarm for Glue Job Failures

0

Hi all, is there any direct way to set up cloudwatch alarms to alert you when a glue job fails? Without using the lambda function. For example, using a direct metric such as glue.driver.aggregate.numFailedTasks

Thanks!

2개 답변
0
수락된 답변

Do you have any idea if the glue.error.ALL observability group metric would work on CloudWatch alarms instead of the glue.driver.aggregate.numFailedTasks?

==>> Yes, replicated at my end with the below CloudWatch alarm configurations and its working as expected.:

Statistic : SUM

Period : 60

MetricName: glue.error.ALL

TreatMissingData :notBreaching

ComparisonOperator : GreaterThanOrEqualToThreshold

Threshold: 1

Note: So for every 1 minute if there is a data point breaching the threshold you will be getting an alert for the same.

Reference:

[1] Using Amazon CloudWatch alarms - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/AlarmThatSendsEmail.html

[2] Create a CPU usage alarm - Setting up a CPU usage alarm using the AWS Management Console - https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/US_AlarmAtThresholdEC2.html#cpu-usage-alarm-console

AWS
지원 엔지니어
답변함 5달 전
profile picture
전문가
검토됨 2달 전
  • Thanks! I checked the same as well and it is working as expected in sending me an email for a glue job failure.

0
  • If you want to notification on a Glue Job Failure, then you can use AWS Event Bridge, no Lambda Required.

  • Create an Event Bridge Rules like below pattern for a job failure and set the Target to be SNS Topic.

  • Subscript the SNS Topic to get a notification.

  • Sample Event Rule pattern to monitor a Glue job failure: { "detail-type": ["Glue Job State Change"], "source": ["aws.glue"], "detail": { "jobName": ["<Glue job Name>"], "state": ["FAILED"] } }

  • Check the below Article for the same: https://repost.aws/knowledge-center/glue-sns-notification-state

  • Also, using glue.driver.aggregate.numFailedTasks metrics for job notification would NOT be the correct Glue job status. As this metrics is related to Spark task, which sometime fails and succeed when spark retries. So the Glue job will succeed even with failed Spark task.

AWS
답변함 5달 전
profile picture
전문가
검토됨 5달 전
  • got it. thanks so much I will have to test this out on monday as i dont have the permissions required to create a rule on eventbridge. Do you have any idea if the glue.error.ALL observability group metric would work on cloudwatch alarms instead of the glue.driver.aggregate.numFailedTasks?

로그인하지 않았습니다. 로그인해야 답변을 게시할 수 있습니다.

좋은 답변은 질문에 명확하게 답하고 건설적인 피드백을 제공하며 질문자의 전문적인 성장을 장려합니다.

질문 답변하기에 대한 가이드라인

관련 콘텐츠