- Newest
- Most votes
- Most comments
When ECS agents disconnect on EC2 hosts, it can prevent tasks from running properly while the underlying EC2 instances still pass basic health checks. Here are recommended approaches to automatically replace EC2 hosts with disconnected ECS agents:
-
Use Amazon ECS-managed health checks: Configure your Auto Scaling group to use ECS-managed health checks instead of just EC2 status checks. This allows Amazon ECS to mark instances with disconnected agents as unhealthy, triggering replacement through the ASG.
-
Implement custom health checks: Create a custom script that runs on your EC2 instances to check the ECS agent status periodically. This script can monitor the agent's connectivity and trigger instance replacement if issues are detected. You can use AWS Systems Manager Run Command or include this in your instance user data.
-
Adjust ASG health check grace period: While you mentioned having a 0-second grace period, consider setting an appropriate grace period that allows enough time for the ECS agent to initialize properly, especially for GPU-based or resource-intensive workloads.
-
Use CloudWatch alarms: Set up CloudWatch alarms to monitor ECS agent metrics and trigger actions when agents disconnect. These alarms can be configured to perform auto-scaling actions or send notifications.
-
Implement a Lambda function: Create a Lambda function that periodically checks for disconnected ECS agents in your cluster and terminates the corresponding EC2 instances, allowing the ASG to launch replacements.
-
Review IAM roles and security groups: Ensure your EC2 instances have the correct IAM roles with necessary permissions to communicate with ECS, and that security groups and network ACLs allow proper communication between EC2 instances and ECS service endpoints.
-
Check VPC configuration: Verify that your VPC route tables, DNS resolution, and instance metadata access are correctly configured, as these can prevent the ECS agent from functioning properly.
For troubleshooting existing issues, examine the ECS agent logs (typically at "/var/log/ecs/ecs-agent.log") for detailed error messages and compare configurations between working and non-working instances.
Sources
Troubleshoot schedule issues for Amazon ECS tasks on EC2 instances | AWS re:Post
EC2 scale from auto scaling group not start ecs agent | AWS re:Post
Troubleshoot unhealthy instances in Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling
Hello.
Does your EC2 instance have the SSM Agent installed?
If it is installed, it can be managed with Systems Manager, so it's a good idea to create a system that uses SSM RunCommand to periodically check whether the ECS Agent process is running.
https://docs.aws.amazon.com/systems-manager/latest/userguide/running-commands-console.html
Hi, my team has encountered this exact same issue, wondering whether you managed to find a solution for this?
Relevant content
- asked a year ago
- AWS OFFICIALUpdated 2 years ago
