I want to troubleshoot why my Amazon Elastic Container Service (Amazon ECS) task stopped.
Short description
Your Amazon ECS tasks might stop for one of the following reasons:
- Essential container in task exited
- Failed Elastic Load Balancing (ELB) health checks
- Failed container health checks
- Unhealthy container instance
- Underlying infrastructure maintenance
- Service scaling event triggered
- ResourceInitializationError
- CannotPullContainerError
- Task stopped by user
Resolution
You can use the DescribeTasks API to view the details of a stopped task. However, the details for the stopped task appear only for one hour in the returned results. To allow more time to view stopped task details, use an AWS CloudFormation template from the GitHub website. Use the template to store Amazon CloudWatch Logs from an EventBridge event that is triggered when a task is stopped.
Stopped reasons
The following are common reasons that your Amazon ECS task might stop.
Essential container in task exited
All tasks must have at least one essential container. If the essential parameter of a container is marked as true and fails or stops, then all containers in the task are stopped. To understand why a task exited with this reason, use the DescribeTasks API to identify the exit code. Then, complete the steps in the Common exit codes section of this article.
Task failed ELB health checks
When a task fails because of ELB health checks, confirm that your container security group allows traffic that originates from the load balancer. Complete the following tasks:
- Define a minimum health check grace period. The grace period instructs the service scheduler to ignore Elastic Load Balancing health checks for a predefined time period after a task was instantiated.
- Use slow start mode. By default, a target receives its requests as soon as it's registered with a target group and passes an initial health check. The slow start mode lets targets warm up before the load balancer sends the targets a full share of requests.
- Monitor the CPU and memory metrics of the service. For example, high CPU can make your application unresponsive and result in a 502 error.
- Check your application logs for application errors.
- Check that the ping port and the health check path are correctly configured.
- Curl the health check path from within Amazon Elastic Compute Cloud (Amazon EC2), and then confirm the response code.
Failed container health checks
You can define health checks in the TaskDefinition API. Or, you can define health checks in the Dockerfile. For more information, see Healthcheck on the Docker website.
To view the health status of both individual containers and the task, use the DescribeTasks API operation.
The health check command exit status must indicate that the container is healthy. To check your container logs for application errors, use the log driver settings specified in the task definition. The following are the possible values for your health check status:
- 0: success The container is healthy and ready for use
- 1: unhealthy The container isn't working correctly
- 2: reserved Don't use this exit code
(instance i-##) (port #) is unhealthy in (reason Health checks failed)
This error message indicates that the container status is unhealthy. To troubleshoot this issue, complete the following tasks:
- Verify that the security group attached to the container instance permits traffic.
- Confirm that there's a successful response without delay from the backend.
- Update the response time to a correct value.
- Check the access logs of your load balancer for more information.
Service ABCService: ECS is performing maintenance on the underlying infrastructure hosting the task
This error message indicates that the task was stopped because of a task maintenance issue. For more information, see AWS Fargate task maintenance on Amazon ECS FAQs.
If the container instance is part of an Auto Scaling group, you must launch a new container instance, and then place the tasks. For more information, see Verify a scaling activity for an Auto Scaling group.
ECS service scaling event triggered
This error message is a standard service message. Amazon ECS uses the Application Auto Scaling service to provide this functionality. The ECS service can automatically increase or decrease the desired count of task. To resolve this error message, complete the following tasks:
- Review CloudWatch alarms for any changes in your tasks.
- Review for any deployments that are scheduled and that might affect your tasks.
ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed
To troubleshoot this error message, see How do I troubleshoot the error message "unable to pull secrets or registry auth" in Amazon ECS?
CannotPullContainerError
This error message indicates that the task execution role that's used doesn't have the permission to communicate to Amazon ECS. To troubleshoot this issue, complete the following tasks:
- Verify that the task execution role has the needed permissions. Amazon ECS provides the managed policy named AmazonECSTaskExecutionRolePolicy that contains the permissions for most use cases.
- Verify that the Amazon Elastic Container Registry (Amazon ECR) service endpoint is accessible to: ecr.region.amazonaws.com and dkr.ecr.region.amazonaws.com.
- For private images that need authentication, confirm that the repositoryCredentials and credentialsParameter are defined with the correct information. For more information, see Using non-AWS container images in Amazon ECS.
Task stopped by user
This error message indicates that the task received a StopTask. To identify who initiated the call, view StopTask in CloudTrail for userIdentity information.
Common exit codes
The following are common exit codes:
- 0: Entrypoint, success, or CMD is completing its execution. The container is stopped.
- 1: Refers to application error. For more information, review application logs.
- 137: Occurs when the Task was forced to exit (SIGKILL) for the container.
If you don't respond to a SIGTERM within a default 30-second period, then the SIGKILL value is sent and containers are forcibly stopped. You can configure the default 30-second period on the ECS container agent with the ECS_CONTAINER_STOP_TIMEOUT parameter. This exit code can also occur in an Out-of-Memory (OOM) situation. To verify whether OOM occurred, review your CloudWatch metrics.
- 139: Occurs when a segmentation fault occurs. This usually happens when the application tried to access a memory region that isn't available, or there's an unset or environment variable that's not valid.
- 255: Occurs when the ENTRYPOINT CMD command in your container failed because of an error. To confirm that this is the cause, review your CloudWatch metrics.
Common error messages
The following are common error messages:
No Container Instances were found in your cluster
To resolve this error message, review the container instances section for your cluster. If needed, launch a container instance.
InvalidParameterException
To resolve this error message, review your TaskDefinition parameters. Any parameters that are defined in TaskDefinition must be present and the Amazon Resource Name (ARN) must be correct. Verify that the task role and task execution role have sufficient permissions.
You've reached the limit of the number of tasks that you can run concurrently
To resolve this error message, review your limits. For more information on limits, see the ECS Service Quotas.
For all other quota increase requests, create a case in the AWS Support console, and then choose Service limit increase.