ECS unhealthy task is stopped but replacing task is launched after 60+ minutes

0

I have an ECS service that performs some cpu intensive tasks which makes the CPU go to 100% for a couple minutes. I found that a couple times the target group marks the task as unhealthy since the health check endpoint does not get answered in time. Once that happens ECS drains, stops and deregisters the unhealthy task and then start a new task to replace the unhealthy one.

Besides the task behavior should be "fixed" to avoid getting marked as unhealthy so frequently, I'm seeing a weird behavior on ECS: most of the times the sequence described alllow happens, the replacement task is started less than 1 minute after the unhealthy task is deregistered by ECS which I think this is the expected behavior. But I found that in some cases ECS is starting the replacement task more than 60 or more minutes after the unhealthy task is unregistered and I can't find any events nor logs in between the unhealthy task is deregistered and the replacement task is started.

Which could be the reason ECS is taking so much time to start the replacement task? Not sure how can I troubleshoot this.

Thanks in advance

Leandro.-

asked a year ago1029 views
4 Answers
0

The delay in launching the replacement task in Amazon ECS could be due to several reasons. Here are some potential reasons and troubleshooting tips:

  1. Resource Shortages: If your cluster doesn't have enough resources (CPU, Memory, etc.) to launch a new task, ECS will not be able to start the replacement task until resources become available. This could happen if there are other tasks running in the cluster that are consuming a lot of resources. You can monitor your cluster's resource utilization in the ECS console to see if this might be the issue.

  2. Task Placement Constraints or Strategies: If you have task placement constraints or strategies that cannot be satisfied, this could also delay the launch of the replacement task. For example, if you have a constraint that the task must be placed on a specific type of instance, and no such instances are available, the task launch could be delayed.

  3. Service Throttling: AWS has rate limiting in place to protect the service from being overwhelmed. If you've been starting and stopping tasks very frequently, you might hit these limits, which could cause delays. You can review the service quotas (previously known as limits) in the ECS documentation and request an increase if necessary.

  4. Task Definition Issues: If there are issues with the task definition, such as problems with the Docker image or the container parameters, ECS might not be able to launch the task. Check the task definition to make sure everything is correct.

  5. Issues with ECS Service: Sometimes, there could be delays due to issues on the ECS service side. Check the AWS Service Health Dashboard to see if there are any ongoing issues with ECS in your region.

To troubleshoot this further, you can look at the ECS service events in the AWS Management Console. These events can give you more information about why the task is not being started. If you have CloudWatch Logs set up for your tasks, you can also check those logs for any error messages or other information that might help you diagnose the problem.

profile picture
answered a year ago
  • Thanks for you answer Yusuf. I checked and none of those 5 reasons seems to be the cause for the delay on start for the replacement task.

0

Some other topics to check:

  • Are you using the Amazon ECS-optimized AMI or a self-baked one?
  • If a custom ECS AMI I would check the /etc/ecs.config file for custom timeouts that might delay the stop or start of the task https://github.com/aws/amazon-ecs-agent/blob/master/README.md (ECS_CONTAINER_STOP_TIMEOUT for example).
  • What the Events are telling you?
  • Check the ecs-agent logs if you have them enabled. As well syslog
  • I would also check the instance metrics for CPU, Network I/O and if you have CloudWatch Agent configured to send memory and swap data check the memory stats.

Thanks!

profile picture
answered a year ago
  • Thanks for your answer Sauerkraut.

    I'm using an optimized AMI: al2023-ami-ecs-hvm-2023.0.20230530-kernel-6.1-x86_64 ami-0a685aaa06c4fb0bd.

    In ECS Service events I just see unhealthy task stopped, drained and deregistered and then almost 75 minutes later the replacement task was launched. It's worth noting that in other moments of the day the same happended and the replacement task was launched just about 1 minute later the unhealthy one was deregistered.

    After digging a bit more I found that the EC2 instance where the replacement task is running has about the same launch time than the replacement task start time, so I'm almost sure the instance where the unhealthy task was running got it ECS agent disconnected and took some time to be replaced, so ECS was not able to start the replacement task most probably due to having no available EC2 instance until the disconnected one was stopped and a new EC2 instance was launched. Haven't confirmed this yet but at the moment looks like the most probable hipotesis.

0

I have found three refs that maybe related to your problem

https://stackoverflow.com/questions/44436048/amazon-aws-ecs-task-delay https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/task.html https://repost.aws/knowledge-center/ecs-tasks-container-exit-issues

as a summary

If your task starts on a container instance that doesn't have your base image already downloaded, this can result in a delay. You can help alleviate this by pre-loading your instances, improving the networking throughput of your instances, or reducing your image size.

If you are running Amazon ECS on Amazon EC2, you can configure the Amazon ECS container agent to cache previously used container images to reduce image pull-time for subsequent launches. Using the binpack placement strategy can further enhance this effect by increasing task density in your container instances​

Rhe choice of network mode and instance type can significantly influence task launch latency. For instance, the awsvpc network mode may add several seconds of overhead to your task launches, and choosing an optimal instance type based on your task's resource reservation can help to better utilize the instance's resources​.

Using the Amazon ECS service scheduler to concurrently launch services can speed up the overall deployment process. Designing your applications as smaller services with fewer tasks rather than a large service with a large number of tasks can result in faster deployment speed​

To further troubleshoot this issue, check the diagnostic information in the service event log, the stopped tasks for errors, and your application logs for application issues. If the awslogs log driver is configured in your task definition, check the logs in CloudWatch Logs​. You can also consider tracking your task launch lifecycle to find optimization opportunities and understand how your application is contributing to the total launch time.

profile picture
EXPERT
answered a year ago
0

Leandro,

You can allow your application the opportunity to recover from the high CPU usage burst by adjusting the health check settings in the task definition as well as in the service configuration. Please see the 3 links below.

[1] https://docs.aws.amazon.com/AmazonECS/latest/bestpracticesguide/load-balancer-healthcheck.html

[2] https://docs.aws.amazon.com/AmazonECS/latest/userguide/task_definition_parameters.html#container_definition_healthcheck

[3] https://docs.aws.amazon.com/AmazonECS/latest/userguide/service_definition_parameters.html#sd-deploymentconfiguration

AWS
answered 10 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions