MWAA inconsistent tasks states

0

Hi,

We are using MWAA with 2.6.3 version of Airflow, mw1.small, 2 schedulers, 1-12 workers. Every now and then, maybe once a month, in a DAG with multiple tasks, there is something strange happening - an upstream task is marked with success status, but the following downstream task fails with upstream_failed status. Strange task states

There is nothing in the Airflow task logs indicating why it happened. Also, there are other users who observe similar behaviour, for example, here: https://github.com/apache/airflow/discussions/33528 . The Airflow devs suggest that it is specific to MWAA. How to investigate the issue? It is not possible to reproduce it intentionally, and there is no pattern for such behavior.

1 Answer
0

When the celery process on the worker is forked every object in memory at that point is copied. Sessions and Connections in SQLAlchemy are not thread-safe. So, even with NullPool, when we initialize a worker with celery.worker_autoscale = 20,20 (the default for a large worker) for example, it actually is making 20 copies of the same Session object right away. If those objects create initial connections, those connections will be copied and there will be a race condition where one process releases the connection and postgres discards it, but the other worker still has a reference to it leading to sqlalchemy.exc.OperationalError. I suspect you might be observing such errors in the logs, kindly verify this. By setting celery.worker_autoscale = 20,0 no processes are initially created, and as such open connections are not copied (or at least less likely to be) because processes are spun up on-demand, and not all at once (the 0 specifies the initial minimum pool, and the 20 is the maximum).

So I think, we can try to set the celery.worker_autoscale to x,0( based on the environment class ) where 0 means that no session copies are created and this could help in getting the task executed without getting into Database connection failures. I would recommend to test this out and see how it goes at least on one environment where you are seeing frequent failures. Please let me know your thoughts on this.

AWS
answered 6 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions