- Neueste
- Die meisten Stimmen
- Die meisten Kommentare
Hello,
Are you making use of Airflow xcomm in airflow ? If you use xcomms I would imagine this shouldn't happen.
The behavior that you have mentioned this is related with SQS visibility timeout which is 12 hours.
As you probably know that ApproximateAgeOfOldestMessage is the age of the oldest Task on the SQS Queue used by your environment. Basically, it means there are Tasks on the SQS Queue. When a Task is scheduled, the Executor will send the Task to the SQS Queue. When a worker picks up the Task, it will mark it as invisible until it finishes executing the Task and removes it from the SQS Queue. If the Task fails on the Worker or the Worker itself shuts down before the Task finishes, the Task won’t be cleared from the SQS Queue. There might be ‘WorkerLostError’ or Worker related errors in the logs.
Next steps:
- Consider checking the scheduler and worker logs to see what is the reason why the task was not picked up and been stuck/waiting.
- Consider checking the cloudwatch metrics - ApproximateAgeOfOldestMessage when the issue happens.
- It might also be due to edge case when the worker was marked for autoscaling ( downscale of workers in the environment ) and the previous task executed on that node.
Relevanter Inhalt
- AWS OFFICIALAktualisiert vor 2 Jahren
- AWS OFFICIALAktualisiert vor einem Jahr
- AWS OFFICIALAktualisiert vor 2 Jahren