How do I resolve the "DockerTimeoutError" error in AWS Batch?

5 minute read
0

The jobs in my AWS Batch compute environment are failing and returning the following error: "DockerTimeoutError: Could not transition to created; timed out after waiting 4m0s." I want to troubleshoot the error.

Short description

If your Docker start and Docker create API calls take longer than 4 minutes, then AWS Batch returns a DockerTimeoutError error.

Note: The default timeout limit that the Amazon Elastic Container Service (Amazon ECS) container agent sets is 4 minutes.

The following reasons most commonly cause this error:

  • The ECS instance volumes of the AWS Batch compute environment are under high I/O pressure from all the other jobs in your queue. These jobs can deplete the burst balance.
  • Stopped ECS containers aren't being cleaned fast enough to free up the Docker daemon. If you use a customized Amazon Machine Image (AMI) instead of the default AMI that AWS Batch provides, then you can experience Docker issues.

If neither of these issues is causing the error, then take the following actions to further troubleshoot the issue:

  • Check your Docker logs to identify the source of the error.
  • Run the Amazon ECS logs collector script on the ECS instances in the ECS cluster that's associated with your AWS Batch compute environment.

Resolution

Resolve burst balance issues

Check the burst balance of your ECS instance

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Complete the following steps:

  1. Open the Amazon ECS console.
  2. In the navigation pane, choose Clusters. Then, select the cluster that contains your job.
    Note: The name of the cluster starts with the name of the compute environment, followed by _Batch_ and a random hash of numbers and letters.
  3. Choose the Infrastructure tab.
  4. From the Infrastructure column, below the Container instances row, choose your instance ID.
    Note: To find the failed job's instance ID, run the AWS Batch describe-jobs AWS CLI command. The instance ID appears in the output for containerInstanceArn.
  5. In the Amazon EC2 console, make sure that the instance is still selected. Then in the Storage section, choose the link for your volumeID.
  6. On the block device pop-up window, for Volume ID, select your volume.
  7. Choose the Monitoring tab. Then, choose Burst Balance to check your burst balance metrics. If your burst balance drops to 0, then your burst balance is depleted.

Create a launch template for your managed compute environment

Note: If you change the launch template, then you must create a new compute environment.

Complete the following steps:

  1. Open the Amazon EC2 console, and then choose Launch Templates.
  2. Choose Create launch template.
  3. For AMI ID, select the default Amazon ECS optimized AMI.
  4. In the Storage (Volumes) section, choose a volume type in the Volume type column. Then, enter an integer value in the Size(GiB) column.
    Note: If you choose Provisioned IOPS SSD (io1) for your volume type, then enter an integer value that's permitted for IOPS.
  5. Choose Create launch template.
  6. Use your new launch template to create a new managed compute environment.

Create an AWS Batch compute environment with your AMI

Note: If you change the AMI, then you must create a new compute environment because you can't update the AMI ID parameter.

Complete the following steps:

  1. Open the Amazon EC2 console.
  2. Choose Launch instance.
  3. Follow the steps in the setup wizard to create your instance.
    Important: On the Add Storage page, modify the volume type or size of your instance. The larger the volume size, the greater the baseline performance is and the slower it replenishes the burst balance. To get better performance for high I/O loads, change the volume to type io1.
  4. Create a compute resource AMI from your instance.
  5. Create a compute environment for AWS Batch that includes your AMI ID.

Resolve Docker issues

By default, the Amazon ECS container agent automatically cleans up stopped tasks and Docker images that your container instances aren't using. If you run new jobs with new images, then your container storage might fill with Docker images that you don't use. The default AMI for AWS Batch optimizes your Amazon ECS cleanup settings.

Complete the following steps:

  1. Use SSH to connect to the container instance for your AWS Batch compute environment.
  2. To inspect the Amazon ECS container agent, run the inspect ecs-agent Docker command. Then, review the env section in the output.
    Note: To quicken task and image cleanup, reduce the values of the following variables:
    ECS_ENGINE_TASK_CLEANUP_WAIT_DURATION
    ECS_IMAGE_CLEANUP_INTERVAL
    ECS_IMAGE_MINIMUM_CLEANUP_AGE
    ECS_NUM_IMAGES_DELETE_PER_CYCLE
    You can also use tunable parameters for automated task and image cleanup.
  3. Create a new AMI with updated values.
    -or-
    Create a launch template with the user data that includes your new environment variables.

Create a new AMI with updated values

Complete the following steps:

  1. Set your agent configuration parameters in the /etc/ecs/ecs.config file.
  2. Restart your container agent.
  3. Create a compute resource AMI from your instance.
  4. Create compute environment for AWS Batch that includes your AMI ID.

Create a launch template with the user data that includes your new environment variables

Complete the following steps:

  1. Create a launch template with user data.

    For example, the user data in the following MIME multi-part file overrides the default Docker image cleanup settings for a compute resource:

    MIME-Version: 1.0
    Content-Type: multipart/mixed; boundary="==MYBOUNDARY=="
    --==MYBOUNDARY==
    Content-Type: text/x-shellscript; charset="us-ascii"
    #!/bin/bash
    echo ECS_IMAGE_CLEANUP_INTERVAL=60m >> /etc/ecs/ecs.config
    echo ECS_IMAGE_MINIMUM_CLEANUP_AGE=60m >> /etc/ecs/ecs.config
    --==MYBOUNDARY==--

    For more information on mime multi-part files, see MIME multi-part file on the cloud-init website.

  2. Use your new launch template to create a managed compute environment.

Related information

AWS services that publish CloudWatch metrics

Compute resource AMIs

amazon-ecs-agent on the GitHub website

AWS OFFICIAL
AWS OFFICIALUpdated 22 days ago