By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Why is my Amazon ECS task stopped?

7 minute read
3

I want to troubleshoot why my Amazon Elastic Container Service (Amazon ECS) task stopped.

Short description

Your Amazon ECS tasks might stop for one of the following reasons:

  • Essential container in task exited
  • Failed Elastic Load Balancing (ELB) health checks
  • Failed container health checks
  • Unhealthy container instance
  • Underlying infrastructure maintenance
  • Service scaling event triggered
  • ResourceInitializationError
  • CannotPullContainerError
  • Task stopped by user

Resolution

You can use the DescribeTasks API to view the details of a stopped task. However, the details for the stopped task appear only for one hour in the returned results. To allow more time to view stopped task details, use an AWS CloudFormation template from the GitHub website. Use the template to store Amazon CloudWatch Logs from an EventBridge event that is triggered when a task is stopped.

Stopped reasons

The following are common reasons that your Amazon ECS task might stop.

Essential container in task exited

All tasks must have at least one essential container. If the essential parameter of a container is marked as true and fails or stops, then all containers in the task are stopped. To understand why a task exited with this reason, use the DescribeTasks API to identify the exit code. Then, complete the steps in the Common exit codes section of this article.

Task failed ELB health checks

When a task fails because of ELB health checks, confirm that your container security group allows traffic that originates from the load balancer. Complete the following tasks:

  • Define a minimum health check grace period. The grace period instructs the service scheduler to ignore Elastic Load Balancing health checks for a predefined time period after a task was instantiated.
  • Use slow start mode. By default, a target receives its requests as soon as it's registered with a target group and passes an initial health check. The slow start mode lets targets warm up before the load balancer sends the targets a full share of requests.
  • Monitor the CPU and memory metrics of the service. For example, high CPU can make your application unresponsive and result in a 502 error.
  • Check your application logs for application errors.
  • Check that the ping port and the health check path are correctly configured.
  • Curl the health check path from within Amazon Elastic Compute Cloud (Amazon EC2), and then confirm the response code.

Failed container health checks

You can define health checks in the TaskDefinition API. Or, you can define health checks in the Dockerfile. For more information, see Healthcheck on the Docker website.

To view the health status of both individual containers and the task, use the DescribeTasks API operation.

The health check command exit status must indicate that the container is healthy. To check your container logs for application errors, use the log driver settings specified in the task definition. The following are the possible values for your health check status:

  • 0: success The container is healthy and ready for use
  • 1: unhealthy The container isn't working correctly
  • 2: reserved Don't use this exit code

(instance i-##) (port #) is unhealthy in (reason Health checks failed)

This error message indicates that the container status is unhealthy. To troubleshoot this issue, complete the following tasks:

  • Verify that the security group attached to the container instance permits traffic.
  • Confirm that there's a successful response without delay from the backend.
  • Update the response time to a correct value.
  • Check the access logs of your load balancer for more information.

Service ABCService: ECS is performing maintenance on the underlying infrastructure hosting the task

This error message indicates that the task was stopped because of a task maintenance issue. For more information, see AWS Fargate task maintenance on Amazon ECS FAQs.

If the container instance is part of an Auto Scaling group, you must launch a new container instance, and then place the tasks. For more information, see Verify a scaling activity for an Auto Scaling group.

ECS service scaling event triggered

This error message is a standard service message. Amazon ECS uses the Application Auto Scaling service to provide this functionality. The ECS service can automatically increase or decrease the desired count of task. To resolve this error message, complete the following tasks:

  • Review CloudWatch alarms for any changes in your tasks.
  • Review for any deployments that are scheduled and that might affect your tasks.

ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed

To troubleshoot this error message, see How do I troubleshoot the error message "unable to pull secrets or registry auth" in Amazon ECS?

CannotPullContainerError

This error message indicates that the task execution role that's used doesn't have the permission to communicate to Amazon ECS. To troubleshoot this issue, complete the following tasks:

  • Verify that the task execution role has the needed permissions. Amazon ECS provides the managed policy named AmazonECSTaskExecutionRolePolicy that contains the permissions for most use cases.
  • Verify that the Amazon Elastic Container Registry (Amazon ECR) service endpoint is accessible to: ecr.region.amazonaws.com and dkr.ecr.region.amazonaws.com.
  • For private images that need authentication, confirm that the repositoryCredentials and credentialsParameter are defined with the correct information. For more information, see Using non-AWS container images in Amazon ECS.

Task stopped by user

This error message indicates that the task received a StopTask. To identify who initiated the call, view StopTask in CloudTrail for userIdentity information.

Common exit codes

The following are common exit codes:

  • 0: Entrypoint, success, or CMD is completing its execution. The container is stopped.
  • 1: Refers to application error. For more information, review application logs.
  • 137: Occurs when the Task was forced to exit (SIGKILL) for the container.
    If you don't respond to a SIGTERM within a default 30-second period, then the SIGKILL value is sent and containers are forcibly stopped. You can configure the default 30-second period on the ECS container agent with the ECS_CONTAINER_STOP_TIMEOUT parameter. This exit code can also occur in an Out-of-Memory (OOM) situation. To verify whether OOM occurred, review your CloudWatch metrics.
  • 139: Occurs when a segmentation fault occurs. This usually happens when the application tried to access a memory region that isn't available, or there's an unset or environment variable that's not valid.
  • 255: Occurs when the ENTRYPOINT CMD command in your container failed because of an error. To confirm that this is the cause, review your CloudWatch metrics.

Common error messages

The following are common error messages:

No Container Instances were found in your cluster

To resolve this error message, review the container instances section for your cluster. If needed, launch a container instance.

InvalidParameterException

To resolve this error message, review your TaskDefinition parameters. Any parameters that are defined in TaskDefinition must be present and the Amazon Resource Name (ARN) must be correct. Verify that the task role and task execution role have sufficient permissions.

You've reached the limit of the number of tasks that you can run concurrently

To resolve this error message, review your limits. For more information on limits, see the ECS Service Quotas.

For all other quota increase requests, create a case in the AWS Support console, and then choose Service limit increase.

AWS OFFICIAL
AWS OFFICIALUpdated 4 months ago
2 Comments

I am getting errors from the ECS service with error code 139, while I check from inside the container using ECS execute-command the memory usage for the container is around 700MB, it is assigned 8GB for the ECS task definition and the soft limit is 1024 and hard limit is 8000

is there any way to troubleshoot this error in the AWS ECS service?

Also, one thing I noticed is even though I have added 8GB as container memory in the task definition why the total memory is showing as 16GB while I check the free -m using ECS execute-command

replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago