Why is my AWS Batch job stuck in RUNNABLE status?

11 minute read
0

My AWS Batch job is stuck in RUNNABLE status.

Short description

AWS Batch moves a job to RUNNABLE status when the job has no outstanding dependencies and can be scheduled to a host. RUNNABLE jobs start as soon as sufficient resources are available in one of the compute environments that map to the job's queue.

If the required resources to run a job aren't available, then the job might remain in RUNNABLE status indefinitely. For more information, see Jobs stuck in a RUNNABLE status.

To troubleshoot the AWS Batch job that's stuck in RUNNABLE status, use the AWSSupport-TroubleshootAWSBatchJob runbook. Then, refer to the Outputs section to determine the possible cause of the issue and steps to fix it.

Note: This article covers troubleshooting for Amazon Elastic Container Service (Amazon ECS) on Amazon Elastic Compute Cloud (Amazon EC2) and Amazon ECS on AWS Fargate. To troubleshoot AWS Batch on Amazon Elastic Kubernetes Service (Amazon EKS), see AWS Batch on Amazon EKS.

Resolution

Use the AWSSupport-TroubleshootAWSBatchJob SAW runbook

Use AWS Support Automation Workflows (AWS SAW) to automate this troubleshooting process. To use the AWSSupport-TroubleshootAWSBatchJob runbook, see How can I use a SAW runbook to troubleshoot my AWS Batch job stuck in the RUNNABLE status?

If this runbook doesn't help you identify the issue, then refer to the following sections to manually troubleshoot your stuck job.

Verify that your compute environment has enough resources to run your job

1.    Open the AWS Batch console.

2.    Choose Dashboard.

3.    In the Job queue overview pane, in the RUNNABLE column, choose the job that's stuck in RUNNABLE status. The Job details page appears.

4.    On the Job details page, in the Container section, review the values for vCPUs, Memory, and GPUs. You need these values to complete steps 9-10.

5.    On the Job queues page, select a job queue and review its associated compute environments because any compute environment might run your job. Next, repeat steps 6-10 for each compute environment.

6.    On the Compute Environments page, select a compute environment to review its permissions.

7.    Verify that the compute environment's Status column is set to VALID. Also make sure that the service role that's associated with the environment has all the necessary permissions.

Note: When there are intermittent or transient errors, it might take a few minutes for the compute environment Status to change from VALID to INVALID.

8.    Verify that the State column is set to ENABLED.

9.    Verify that the Max vCPUs value is high enough to allow AWS Batch to increase the number of Desired vCPUs to run jobs.

Note: If you're using an AWS Fargate compute environment, then see the Verify the network and security settings of the compute environment section.

10.    Verify that the Desired vCPUs value is the same or higher than the number of vCPUs that the job must run.

If Desired vCPUs is 0, then check the amount of memory and CPU resources that are available for your Amazon EC2 instance type.

-or-

If Desired vCPU is higher than 0, or your job is still in RUNNABLE status, then complete the steps in the next section.

Important: At least one of the instance types for your compute environment must have more memory than what your job specifies. Also, the instance type must have CPU resources that are equal to or more than what your job specifies. If at least one instance type doesn't have enough memory or CPU resources to run your job, then cancel the job. Run a new job that requires less CPU or memory. Or, create a new compute environment with enough resources to run the job, and then assign the job to the appropriate job queue.

Verify that your compute environment has instances and the instances are available to run your job

For the compute environment you identified as the one that must run your job, complete the following steps:

1.    Open the Amazon ECS console.

2.    In the navigation pane, choose Clusters. Then, choose the cluster that contains your job.

For general ECS troubleshooting instructions, see Amazon ECS troubleshooting.

Note: The name of the cluster starts with the name of the compute environment. This is followed by _Batch_ and a random hash of numbers and letters.

3.    Choose the ECS Instances view. Then, verify that container instances are available to run your job.

4.    If the cluster has a container instance available to run your job, then check the status of the Docker daemon. Then, check the status of the Amazon ECS container agent.

Note: For more information, see How do I troubleshoot a disconnected Amazon ECS agent?

If there are no instances in the Amazon ECS cluster, then verify that instances can be created in your compute environment. To verify that your instances can be created, complete one of the following procedures based on your compute environment.

To verify that your instances can be created in an On-Demand compute environment:

1.    Open the Amazon EC2 console.

2.    In the left navigation pane, choose Auto Scaling Groups.

3.    For Filter, enter the name of your compute environment.

Note: Amazon EC2 can create more than one Auto Scaling group for the same compute environment.

4.    For each Auto Scaling group, choose the Activity History view. Then, look for any blocking issues.

The Status column shows Unsuccessful if there are any issues blocking the instances from launching.

For example, if your account reaches the maximum number of instances, then Amazon EC2 might return a message similar to the following example:

Launching a new EC2 instance. Status Reason: Your quota allows for 0 more running instance(s). You requested at least 1. Launching EC2 instance failed.

The event includes a timestamp in UTC from when you submitted the job:

At 2018-09-03T05:54:30Z a user request update of AutoScalingGroup constraints to min: 0, max: 1, desired: 1 changing the desired capacity from 0 to 1.
At 2018-09-03T05:54:52Z an instance was started in response to a difference between desired and actual capacity, increasing the capacity from 0 to 1.

Note: AWS Batch requests instances on your behalf. If you modify the Auto Scaling groups manually, then your compute environment might be invalidated. For more information about instance limits and how to request a limit increase, see Amazon EC2 service quotas.

5.    If the Auto Scaling group shows only successful events in recent events, then complete the steps in the following section.

Important: Certain permissions must be set for the service-linked AWS Identity and Access Management (IAM) role AWSServiceRoleForAutoScaling. The IAM role AWSServiceRoleForAutoScaling must have user access to the customer managed AWS Key Management Service (AWS KMS) key at a minimum. This is necessary in environments with custom Amazon Machine Images (AMIs), encrypted Amazon Elastic Block Store (Amazon EBS) volumes, and customer managed AWS KMS keys. For more information, see Key policy sections that allow access to the customer managed key.

To verify that your instances can be created in a Spot compute environment:

1.    Open the Amazon EC2 console.

2.    In the navigation pane, choose Instances. Then, choose Spot Requests.

3.    In the filter, for Request type, choose fleet.

4.    For Status, choose active.

5.    Choose Description. Then, review the Total target capacity value to see if the Spot Instance request was fulfilled. If no instance was created, then check the History view to see a message that explains why. For example, requests that can't reach a bid price return a message similar to the following example:

m4.large, ami-aff65ad2, Linux/UNIX (Amazon VPC), us-east-1a, Spot bid price is less than Spot market price $0.0324

6.    Choose an appropriate bid percentage for your compute environment. Make sure that you create a new compute environment if you change the bid price. For more information, see Spot Instance pricing history.

Note: AWS Batch creates Spot Fleet requests on your behalf. Avoid modifying Spot Fleet requests manually, or your compute environment might be invalidated.

7.    If the most recent events of the Auto Scaling group show only successful events, then complete the steps in the next section.

Verify the container instance IAM role

1.    Open the AWS Batch console.

2.    In the navigation pane, choose Compute environments. Then, choose your compute environment.

3.    In the Compute environment details section, copy the Instance role name.

4.    Open the IAM console.

5.    In the search box, enter the Instance role name. Then, choose your instance role from the results.

6.    Choose the Permissions view. Then, confirm that the AmazonEC2ContainerServiceforEC2Role managed policy is attached to the role. If the policy is attached, then your instance role is properly configured and you can skip to step 11.

7.    Choose Attach Policies.

8.    In the search box, enter AmazonEC2ContainerServiceforEC2Role.

9.    For the AmazonEC2ContainerServiceforEC2Role policy, select the check box. Then, choose Attach Policy.

10.    Choose the Trust Relationships view. Then, choose Edit trust relationship.

11.    Confirm that the trust relationship contains the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

12.    If the trust relationship matches the policy in the preceding example, then choose Cancel.

-or-

If the trust relationship doesn't match the policy in the preceding example, then copy the policy into the Policy Document console. Then, choose Update Trust Policy.

If your instance still doesn't join the Amazon ECS cluster, then complete the steps in the next section.

Verify the network and security settings of the compute environment

1.    Open the AWS Batch console.

2.    In the navigation pane, choose Compute environments. Then, choose your compute environment.

3.    In the Compute resources section, copy the Subnets and Security groups values.

4.    Open the Amazon Virtual Private Cloud (Amazon VPC) console.

5.    In the navigation pane, choose Subnets.

6.    For each subnet in the compute environment, choose Description. Then, review the Auto-assign public IPv4 address values.

If the Auto-assign public IPv4 address value is Yes, then the instances that launched in the subnet have the following properties:

  • A public IPv4 address
  • A route table with a route destination of 0.0.0.0/0
  • An internet gateway set to Target (for example: igw-1a2b3c4d)

If the Auto-assign public IPv4 address value is No, then the instances that launched in the subnet have the following properties:

  • A private IPv4 address
  • A route table with a route destination of 0.0.0.0/0
  • A NAT gateway set to Target (for example: nat-12345678901234567).

Note: For more information, see the Routing section in Example: VPC with servers in private subnets and NAT.

7.    In the navigation pane, choose Security Groups.

8.    For each security group specified in the compute environment, choose the Outbound Rules view. Then, verify that a rule with the following settings exists:

  • For Type, choose ALL Traffic.
  • For Protocol, choose ALL.
  • For Port Range, choose ALL.
  • For Destination, choose and 0.0.0.0/0.

Important: If the rule doesn't exist, choose Edit. Then, create the rule. For a more restrictive rule for outbound traffic, choose HTTPS (443) for Type and 0.0.0.0/0 for Destination.

9.    In the navigation pane, choose Network ACLs.

10.    Choose the VPC's network access control list (network ACL).

11.    Confirm that the default network ACL is configured to allow all traffic to flow in and out of associated subnets.

Important: If you modified the ACL, add a rule that allows outbound IPv4 HTTPS traffic from the subnet to the internet. For more information, see Control traffic to EC2 instances with security groups and Control traffic to subnets using network ACLs. To change the VPC, subnets, or security groups, create a new compute environment.

If your instance still doesn't join the Amazon ECS cluster, then connect to your instance. Check the status of the Docker daemon and the Amazon ECS container agent.

Note: The procedures in this article don't cover all possible root causes and the ways to troubleshoot them. For additional troubleshooting for an AWS Batch job that's stuck in RUNNABLE status, use AWS CloudTrail. Look up events with the Username attribute set to aws-batch to investigate errors that occur during scheduled tasks.

Related information

Connect to your Linux instance

Connect to your Windows instance

AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago