How do I resolve common issues with my scheduler in Amazon MWAA?

4 minute read
0

I want to resolve common issues with my scheduler in Amazon Managed Workflows for Apache Airflow (Amazon MWAA).

Short description

The scheduler runs in an infinite loop, monitors all tasks and DAGs, and invokes the task instances when their dependencies are completed. To check the health of the scheduler, Apache Airflow checks the scheduler health endpoint. If there's no heartbeat for the scheduler_health_check_threshold, then the scheduler is in an unhealthy state. The default value for scheduler_health_check_threshold is 30 seconds. The value can be modified when you configure your Amazon MWAA environment.

If you have an issue with the scheduler, then you might receive the following error message or similar:

"The scheduler does not appear to be running. Last heartbeat was received (example time period) ago. The DAGs list may not update and new tasks will not be scheduled."

Common reasons that this error occurs are:

  • Networking issues
  • Incompatible modules
  • Overwhelmed scheduler
  • Broken DAGs

Resolution

Networking issues

To resolve scheduler issues that might be caused by networking issues, check the following items:

  • Check the security groups. Check that one of the security groups has a self-referencing rule on port 443 and port 5432. One of these ports must have this rule for the security groups to allow traffic on them. If you don't have the self-referencing rule on one of these ports, then create a new inbound rule for your Amazon MWAA security group. Then, create a self-referencing rule for the required ports. Also, check that your network access control lists (network ACLs) don't block traffic to port 443 and port 5432.
  • Check the Amazon MWAA endpoints. When you create an Amazon MWAA environment, Amazon Virtual Private Cloud (Amazon VPC) interface endpoints are created. Interface endpoints are created for your Apache Airflow web server and your Amazon Aurora PostgreSQL-Compatible Edition metadata database. If one of these endpoints are deleted, then the Amazon MWAA environment is broken and your scheduler receives an error. To resolve this issue, create a new Amazon MWAA environment because these endpoints can't be recreated. For more information, see Managing your own Amazon VPC endpoints on Amazon MWAA.

Incompatible modules

If you need additional Python modules that aren't already installed in your Amazon MWAA environment, then add a constraints file with the Python modules to your requirements.txt. Then, you can install the modules when the Amazon MWAA components are provisioned. If you install an incompatible version of a Python module for a specified Apache Airflow version, then the environment can break and cause a scheduler error.

To check for incompatible modules and versions in your Amazon MWAA environment, check the requirements_install logs for the workers and schedulers. Check if the following errors or similar appear:

"ERROR: You need to upgrade the database. Please run "airflow db upgrade"."

"ERROR: Cannot install (example-package-name and example-version) because these package versions have conflicting dependencies.
The conflict is caused by:
The user requested (example-package-name and example-version).
The user requested (constraint-example-package-name and constraint-example-version)."

To resolve the preceding errors, complete the following actions:

  • If you receive an error message that states to upgrade the database, then your metadata database is corrupt. You must create a new environment with the correct module versions.
  • If you receive an error message that states that your package versions have conflicting dependencies, then update the environment with the correct module versions.

Note: As of Apache Airflow version 2.7.2, the constraint file must be included in the requirements.txt. It's a best practice to use aws-mwaa-local-runner and the correct Apache Airflow version to test your requirements. For more information, see aws-mwaa-local-runner on the GitHub website.

Overwhelmed scheduler

To make sure that your scheduler isn't overwhelmed, check that the scheduler CPU and memory utilization don't exceed 90%. It's a best practice to place the .airflowignore file and any folders and files that don't create DAG objects in your DAGs folder. This allows the scheduler to not parse these files and folders. For more information, see .airflowignore on the Apache Airflow website. Also, make sure that your scheduler configurations don't exceed the maximum values. For example, if you set AIRFLOW_SCHEDULER_PARSING_PROCESSES to a higher value than the maximum vCPU cores, then the scheduler has decreased performance or the scheduler breaks.

Broken DAGs

If you have a broken DAG, then the scheduler breaks. To resolve this issue, delete all new or recently added DAGs. Then, add them again one at a time until you identify the faulty DAG. To test your DAGs, use the aws-mwaa-local runner. For more information, see aws-mwaa-local-runner on the GitHub website.

Related information

VPC security groups

AWS OFFICIAL
AWS OFFICIALUpdated 12 days ago