TimeoutError The action failed because a job worker exceeded its timelimit

0

Hello,
I created a pipeline that calls a step function that to deploys to CloudFormation. The pipeline invokes a lambda function that starts the step function.

The step function runs fine by itself, but the pipeline times out at 20 minutes.

I added a sort of keep-alive using put_job_success_result continuation Token. This is called from a step the step function every 5 minutes, which then re-invokes the lambda function. It grabs the new job_id from that and stuffs it in SSM so the SF can send put_job_success/fail_results and tokens downstream). The lambda function does not reinvoke the step function when a continuation token is in the event.

But even with the continuation token I still see timeouts on long executions. I have everything logged so I can see the continuation token is being passed and the new job is started, but sometimes it just fails.

"TimeoutError The action failed because a job worker exceeded its timelimit. If this is a custom action, make sure that the job worker is configured correctly."

Is there a better way to handle this lambda invoke action timeout? I see in the documentation it should be an hour for an invoke action, but this is 20 minutes, but sometimes much less.

Thanks,
LM

asked 5 years ago1132 views
1 Answer
0

Hello! Sorry for taking so long to respond.

I just want to make sure that we're on the same page around how to use continuation tokens.

When you use the lambda invoke action with continuations, you need to call PutJobSuccessResult for the job ID that your lambda was given within 15 minutes of receiving the job. Note that the JobID matters - you can't keep calling PutJobSuccessResult with the original job ID, because using continuations results in a new job (the first job is done - it was a success :) )

What I think it happening is that you are sending the first job ID with subsequent PutJobSuccessResult calls - could you check this is the case?

This hypothesis is supported by the time limit being 20 minutes. If you're calling PutJobSuccess for the first time five minutes after you get the first job, then the second job would time out 15 minutes after that, or 20 minutes into the action.

If you'd like, please PM me a job ID and the region the action is in, and I can take a look at what's going on.

Thanks!

Matthew

answered 5 years ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions