Knowledge Center Monthly Newsletter - June 2025
Stay up to date with the latest from the Knowledge Center. See all new Knowledge Center articles published in the last month, and re:Post's top contributors.
How do I troubleshoot tasks that are stuck in the Queued state in my Amazon MWAA environment?
I'm running workflows in Amazon Managed Workflows for Apache Airflow (Amazon MWAA), but my tasks are stuck in the Queued state. The tasks don't progress to the Running state.
Short description
Tasks in Amazon MWAA are can remain stuck in the Queued state because of the following reasons:
- The environment reached the maximum number of concurrent tasks.
- The Airflow Configuration options are incorrectly set in your MWAA environment.
- There's insufficient memory or CPU for the tasks on the worker.
A task is stuck in the Queued state, when there's a breakdown in the normal workflow of running a task. The Apache Airflow worker can become overwhelmed and fail to respond in a specified time. When this occurs, the task remains in the Amazon Simple Queue Service (Amazon SQS) queue until the default visibility timeout is reached in 12 hours. If you configured retries, then the Apache Airflow scheduler retries the task.
Resolution
Before you troubleshoot, determine whether your environment resources are reaching maximum load or is experiencing worker-related issues. Use Amazon CloudWatch to check your environment's worker logs and CPUUtilization and MemoryUtilization metrics.
Check if the environment reached the maximum number of concurrent tasks
Your environment reaches the maximum number of concurrent tasks when the Amazon MWAA pool is full and the environment adds more tasks to the queue. To resolve this issue, increase the worker count for your environment or change the environment class size.
To determine whether you must increase worker count on your environment, complete the following steps:
- Open the CloudWatch console.
- In the navigation pane, choose Metrics, and then choose All Metrics.
- Choose the Browse tab, select the AWS Region that your environment is in, and then search for the name of your environment.
- In the AWS Namespaces section, choose MWAA < Queue.
- Select QueuedTasks and RunningTasks.
- In the graph, find the time period with the most activity, and then add the total count of both metrics.
Note: The sum is the total number of tasks for this time period. - Determine your environment's default level of concurrency.
Note: For example, the mw1.small environment has five concurrent tasks for each worker. - Divide the total number of tasks by the default level of concurrent tasks.
- Subtract the number by the Maximum worker count that you set for your environment.
Note: If the result is a positive number, then you must add workers to fulfill the current number of concurrent tasks.
To increase the worker count for your environment or change the environment class size, complete the following steps:
- Open the Amazon MWAA console.
- Select your environment, choose Edit, and then choose Next.
- In the Environment Class section, take the following actions:
Increase the Maximum worker count that you determined from step 9.
Also set the Minimum worker count to a value that your workload requires for periods of least activity.
Note: You can add only a maximum of 25 workers for your environment. If you require more than 25 workers, then under Environment class, choose a larger size. - If you increase the environment class size, then also set a maximum and minimum worker count that your workload requires.
If you optimize the worker count and it still isn't sufficient for your workload, then take the following actions:
- Use deferrable operators in place of Apache Airflow sensors. For more information, see Deferrable operators & triggers on the Apache Airflow website.
- Stagger the execution start times, and keep small gaps of time between the schedule_interval of your Directed Acyclic Graphs (DAGs). Schedule DAGs in blocks.
- If you use custom code that invokes and monitors a specific external function, then split the task into two tasks. Create one task for the invocation and the other as a deferrable operator to monitor the function.
Check if the Airflow configuration options are incorrectly set
To check your Airflow configuration options, complete the following steps:
- Open the MWAA console.
- Choose Environments, and then select your MWAA environment.
- In the Airflow Configuration options section, check core.parallelism and celery.worker_autoscale.
If core.parallelism is set, then remove any manually set core.parallelism option so that Amazon MWAA can dynamically set the configuration. Amazon MWAA calculates the dynamic default configuration by (maxWorkers * maxCeleryWorkers) / schedulers * 1.5. If you use auto scaling and manually set the value, then issues can occur with under-utilization during maximum load.
Compare the value of your celery.worker_autoscale configuration option with default level of concurrency. If you didn't modify the celery.worker_autoscale configuration option, then multiply the default level of concurrency by the maximum worker count that you set for your environment.
If the celery.worker_autoscale value is unintentionally lower than the default value, use CloudWatch metrics to monitor your workers' CPU and memory usage. If the resource values are 20–60% during maximum load, then increase the celery.worker_autoscale value to a larger number. Use small increments so that the you don't overuse the worker containers.
If you didn't set the celery.worker_autoscale value or you kept the default value, then monitor your workers' CPU and memory usage. If the metrics for you environment are too high, then lower the celery.worker_autoscale value. If the environment is 20–60% during maximum load, then you can increase the maximum value.
Check if the workers are failing because of overuse
When every Celery worker on an MWAA worker container has a task and is in maximum load, workers can be overused and fail.
Celery workers on an MWAA Worker container poll for tasks when they aren't currently used. Depending on the complexity of the running tasks and the code that defines them, workers can become overused and potentially crash. This occurs when every Celery worker on an MWAA worker container has a task and is under maximum load.
To determine whether the workers are overused and failing, complete the following steps:
- Open the CloudWatch console.
- In the navigation pane, choose Metrics, and then choose All metrics.
- Choose the Browse tab, select the AWS Region your environment is in, and then search for the name of your environment.
- In the AWS Namespaces section, choose MWAA < Queue, and then select ApproximateAgeOfOldestTask.
- Expand the time range to include a 4–6 week period.
Note: Peaks of 40,000 or more seconds show that tasks are stuck in the Amazon SQS queue, and the workers are failing from overuse. Also, the Celery worker can't write the failure to the event buffer because the system forcefully terminated them.
You can also use CloudWatch Insights to alert you when tasks are stuck in the Amazon SQS queue.
To create the alert, complete the following steps:
-
Open the CloudWatch console.
-
In the navigation pane, choose Logs, and then choose Logs Insights.
-
Specify a time range of 4–6 weeks.
-
In the Selection criteria menu, select the scheduler log group for your MWAA environment.
-
Enter the following query into the query section:
fields _@timestamp_, _@message_, _@logStream_, _@log_ | filter _@message_ like /Was the task terminated externally?/ | sort _@timestamp_ desc | limit 10000
The following is an example log that the scheduler sends when it receives a previously queued task:
[[34m**2024-01-17T11:30:18.936+0000**[0m] [34mscheduler_job_runner.py:[0m771 ERROR[0m - Executor reports task instance <TaskInstance: dag_name.task_name manual__202X-XX-XXTXX:XX:XX.758774+00:00 [queued]> finished (failed) although the task says it's queued. (Info: None) Was the task terminated externally?[0m
Reduce compute or memory-intensive workloads
Note: Carefully consider the following list. Not all factors are applicable for every-use case. If more assistance is needed, then contact AWS support.
To reduce compute or memory-intensive workloads in your environment, take the following actions:
- Make sure that your DAG code doesn't contain extract, transform, and load (ETL) scripts, data movement instructions, AI or ML pipelines, or other compute or memory-intensive workloads.
- Follow Apache Airflow best practices when you write DAG code. Make sure that the top-level code is minimized and importing only what's needed. For more information, see Best practices on the Apache Airflow website.
- Optimize the DAG code. Profile the memory footprint for any sensors, hooks, or custom, extended, or inherited operators, to find potential problem areas.
If your resources are still overused, then take the following actions:
- Reduce celery.worker_autoscale from its default value setting. Decrease the celery.worker_autoscale value by a couple of digits, and then monitor the environment for 24–48 hours. Continue to decrease the celery.worker_autoscale value until you reach an optimal level.
Note: When you decrease the celery.worker_autoscale value, the overall task pool reduces and causes more items to remain in the Queue status for longer. To counteract this, you must also increase the minimum worker count. - Also, complete the steps in the Check if the environment reached the maximum number of concurrent tasks section again to reduce the concurrent tasks per worker.
Related Information
Performance tuning for Apache Airflow on Amazon MWAA
Configuration reference on the Apache Airflow website

Relevant content
- asked 8 months ago
- asked 3 years ago
- asked 4 years ago