How can I detect why a Fargate RunTask triggered by EventBridge rule fails

0

We use EventBridge to trigger jobs in Fargate. This has been working well for a long time. Lately, it seems like starting the task in Fargate sometimes fails silently.

We run thousands of these jobs and failures seem to be totally random and rare. I have done some digging in CloudTrail and I see that RunTask is executed. There is no corresponding CreateLogStream after RunTask and there are of course not any logs in CloudWatch in this case either.

Since this happens rarely, I have not been able to look at a stopped task in Fargate since they tend to be cleaned up rapidly, but I'm on the lookout.

I have seen this happen when we have been way below our quota in Fargate so it should not be connected to any service quotas.

  • I have been able to inspect the job in the console and found stopped reason "ResourceInitializationError: failed to configure ENI: failed to setup regular eni: netplugin failed with no error message". This doesn't solve the problem since this still silently fails.

asked 2 years ago368 views
1 Answer
1
Accepted Answer

Hello Knut,

The error message ResourceInitializationError: failed to configure ENI could be due to a transient issue within the Fargate workflow. If this Fargate task was part of an ECS service, then the ECS Service Scheduler would have attempted to re-launch the task automatically.

However, when EventBridge launches an ECS task, it performs the RunTask API operation to trigger the creation of a new task. Starting a task through the RunTask API involves an asynchronous workflow.

If the workflow started successfully, then a success code is returned. However, this doesn't mean that the task is in RUNNING state. The RunTask caller is expected to verify if the task reaches Running state, and if that does not happen, the caller needs to retry the operation.

Reattempts can be automated with an exponential backoff and retry logic by using AWS Step Functions.

Here is a knowledge-center article that explains how to use Step Functions to implement the retry-backoff functionality to mitigate your problem.

I hope this is helpful to you. Please add a comment if you have any concerns with this approach.

Thank you!

profile pictureAWS
SUPPORT ENGINEER
answered 2 years ago
profile picture
EXPERT
reviewed 6 months ago
profile picture
EXPERT
reviewed a year ago
  • Thanks for the response, Venkat. It is helpful in that it tells me that no one should ever use EventBridge rules to trigger Fargate tasks. Since all our Fargate jobs are managed through a controller to avoid hitting service quotas; we can at least implement retry there instead of the places we would have otherwise done this. I'm currently worried about which Step Functions quotas we will be struggling with if we choose that solution.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions