Skip to content

ECS Agent Disconnection & Host Replacement Issues

1

Hi,

we run a multi-region metadata aggregation service on ECS-EC2 with Auto Scaling Groups, and we are facing issues where ECS agents disconnect on EC2 hosts, making them unable to execute ECS tasks. Our ASG configured with default EC2 health checks (grace period 0 sec) still passed, so unhealthy hosts were never automatically replaced.

We tried container health checks (work for task-level issues but don't detect host-level ECS agent problems) and ASG ELB health check integration (hit architectural blocker - our NetworkTargetGroup uses TargetType.IP required for ECS but ASG health checks need TargetType.INSTANCE).

What's the recommended pattern for automatically replacing EC2 hosts when ECS agent becomes unresponsive? What health checks should we implement to detect ECS agent connectivity issues at the host level? What other approaches haven't we considered?

(Ideally in TypeScript code because we're using TypeScript APIs to manage our ecs service and cluster, nlb, etc.)

Thank you so much for any help!

3 Answers
0

When ECS agents disconnect on EC2 hosts, it can prevent tasks from running properly while the underlying EC2 instances still pass basic health checks. Here are recommended approaches to automatically replace EC2 hosts with disconnected ECS agents:

  1. Use Amazon ECS-managed health checks: Configure your Auto Scaling group to use ECS-managed health checks instead of just EC2 status checks. This allows Amazon ECS to mark instances with disconnected agents as unhealthy, triggering replacement through the ASG.

  2. Implement custom health checks: Create a custom script that runs on your EC2 instances to check the ECS agent status periodically. This script can monitor the agent's connectivity and trigger instance replacement if issues are detected. You can use AWS Systems Manager Run Command or include this in your instance user data.

  3. Adjust ASG health check grace period: While you mentioned having a 0-second grace period, consider setting an appropriate grace period that allows enough time for the ECS agent to initialize properly, especially for GPU-based or resource-intensive workloads.

  4. Use CloudWatch alarms: Set up CloudWatch alarms to monitor ECS agent metrics and trigger actions when agents disconnect. These alarms can be configured to perform auto-scaling actions or send notifications.

  5. Implement a Lambda function: Create a Lambda function that periodically checks for disconnected ECS agents in your cluster and terminates the corresponding EC2 instances, allowing the ASG to launch replacements.

  6. Review IAM roles and security groups: Ensure your EC2 instances have the correct IAM roles with necessary permissions to communicate with ECS, and that security groups and network ACLs allow proper communication between EC2 instances and ECS service endpoints.

  7. Check VPC configuration: Verify that your VPC route tables, DNS resolution, and instance metadata access are correctly configured, as these can prevent the ECS agent from functioning properly.

For troubleshooting existing issues, examine the ECS agent logs (typically at "/var/log/ecs/ecs-agent.log") for detailed error messages and compare configurations between working and non-working instances.
Sources
Troubleshoot schedule issues for Amazon ECS tasks on EC2 instances | AWS re:Post
EC2 scale from auto scaling group not start ecs agent | AWS re:Post
Troubleshoot unhealthy instances in Amazon EC2 Auto Scaling - Amazon EC2 Auto Scaling

answered 4 months ago
0

Hello.

Does your EC2 instance have the SSM Agent installed?
If it is installed, it can be managed with Systems Manager, so it's a good idea to create a system that uses SSM RunCommand to periodically check whether the ECS Agent process is running.
https://docs.aws.amazon.com/systems-manager/latest/userguide/running-commands-console.html

EXPERT
answered 4 months ago
0

Hi, my team has encountered this exact same issue, wondering whether you managed to find a solution for this?

AWS
answered a day ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.