Skip to content

How do I troubleshoot a disconnected Amazon ECS agent?

7 minute read
1

My container instances for Amazon Elastic Container Service (Amazon ECS) are disconnected.

Short description

It's expected for your Amazon ECS container agent to disconnect and reconnect multiple times in an hour as part of normal operation. Change events that last for only a few minutes are expected and might not indicate issues with the container agent or your container instance.

However, if the container agent remains in the disconnected state for longer, then the container instance can't operate as part of your Amazon ECS cluster. This issue might be caused by one of the following reasons:

  • Networking issues prevent communication between the instance and Amazon ECS.
  • The container agent doesn't have the required AWS Identity and Access Management (IAM) permissions to communicate with Amazon ECS endpoints.
  • There are problems with the host or Docker daemon inside the container instance.
  • There's resource contention in the underlying host.

It's a best practice to use the latest version of the Amazon ECS container agent.

Resolution

Note: The following resolution applies to Amazon ECS-optimized Amazon Linux 2023 AMIs.

You can use SSH keys to connect to your Amazon EC2 instances. If you don't have the SSH keys generated, then you can use Session Manager, a capability of AWS Systems Manager, to connect to your instance. By default, Systems Manager Agent is installed on Amazon Linux 2023 AMIs and the Amazon Linux 2023 ECS-optimized base AMI.

Verify that the container agent is running on the container instance

To verify the status and connectivity of the Amazon ECS container agent, run one of the following commands on your container instance:

sudo systemctl status ecs
sudo docker ps -f name=ecs-agent

The output specifies active and looks similar to the following:

ecs.service - Amazon Elastic Container Service - container agent
        Loaded: loaded (/usr/lib/systemd/system/ecs.service; enabled; preset: disabled)
        Active: active (running) since Thu 2026-02-19 08:42:39 UTC; 1min 17s ago
          Docs: https://aws.amazon.com/documentation/ecs/
     Main PID: 2578 (amazon-ecs-init)
        Tasks: 5 (limit: 9497)
       Memory: 136.5M
          CPU: 214ms
       CGroup: /system.slice/ecs.service
               └─2578 /usr/libexec/amazon-ecs-init start

CONTAINER ID  IMAGE                           COMMAND   CREATED         STATUS
8ab4e7c372d7  amazon/amazon-ecs-agent:latest  "/agent"  2 minutes ago   Up 2 minutes (healthy)

If the issue is caused by a disconnected agent, then run the following command to restart the ECS agent:

sudo systemctl restart ecs

Note: There's no output returned after you run the preceding command.

To verify that the agent is running, run the following command:

sudo systemctl status ecs

Verify that the Docker service is running on the container instance

To verify that the Docker service is running on the affected container instance, run the following command:

sudo systemctl status docker

The output specifies active and looks similar to the following:

docker.service - Docker Application Container Engine
        Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; preset: disabled)
        Active: active (running) since Thu 2026-02-19 08:42:37 UTC; 3min 46s ago
   TriggeredBy: docker.socket
          Docs: https://docs.docker.com
     Main PID: 2314 (dockerd)
        Tasks: 13
       Memory: 485.3M
          CPU: 4.846s
       CGroup: /system.slice/docker.service
               └─2314 /usr/bin/dockerd -H fd:// --containerd=/run/containerd/containerd.sock --default-ulimit nofile=32768:65536

If the Docker service is inactive, then run the following command to restart the Docker service:

sudo systemctl restart docker

Note: There's no output returned after you run the preceding command.

To verify that the Docker service has restarted, run the following command:

sudo systemctl status docker

Review log files for the container agent and Docker

If your container instance is still disconnected, then review the log files on the container host for the container agent and Docker.

Check the following log files for keywords, such as "error", "warn", or "agent transition state":

  • View the Amazon ECS container agent's latest logs at /var/log/ecs/ecs-agent.log. You can view the rotated log by filtering to /var/log/ecs/ecs-agent-log.timestamp
  • View the Amazon ECS init log at /var/log/ecs/ecs-init.log
  • View the userdata execution logs at /var/log/cloud-init.log
  • View the Docker Daemon logs with the command sudo journalctl -u docker

If you use Linux, then you can also review the exit codes for more information on the stopped agent container.

To get the exit code, run the following command:

docker inspect <your container ID>

Note: Replace your container ID with the ID of the stopped container.

You can use the Amazon ECS logs collector to collect general operating system logs, Docker logs, and container agent logs for Amazon ECS.

Verify that the IAM instance profile has the necessary permissions

If the container agent is still disconnected, verify that the IAM instance profile associated with the container instance has the necessary IAM permissions:

  1. Use SSH or Session Manager to connect to the instance.

  2. To view the instance metadata on the instance profile associated with the instance, run the following command:

    TOKEN=$(curl -X PUT "http://169.254.169.254/latest/api/token" -H "X-aws-ec2-metadata-token-ttl-seconds: 21600")
    curl -H "X-aws-ec2-metadata-token: $TOKEN" http://169.254.169.254/latest/meta-data/iam/info
    

    The output looks similar to the following:

    {
       "Code" : "Success",
       "LastUpdated" : "2026-02-19T08:42:14Z",
       "InstanceProfileArn" : "arn:aws:iam::1122334455:instance-profile/ecsInstanceRole",
       "InstanceProfileId" : "AIPA4VIZXOFF55F72XIZN"
    }
    
  3. Verify that the IAM role contains the correct permissions for your container instances.

  4. To verify specific credential errors, run the following command to check the container agent log:

    cat /var/log/ecs/ecs-agent.log.YYYY-MM-DD-##
    

    Note: Replace YYYY-MM-DD-## with the relevant timestamp.

    The container agent log is rotated every hour. The suffix automatically changes to reflect the current date and time. Update the command to include the date range and log ID for when the issue occurred.

Verify that your container instance has enough resources to run the ECS agent

If your tasks have high memory or CPU utilization, then your container instance might not have enough resources to run the ECS agent.

The Amazon ECS container agent uses the Docker ReadMemInfo() function to query the amount of memory available for the operating system.

Run the following command on your container instance to view the total memory recognized by the operating system:

free -b

Example output for a t2.large instance running the Amazon ECS-optimized Amazon Linux 2023 AMI:

               total        used        free      shared  buff/cache   available
Mem:     8327938048   337494016  6402355200      557056  1588088832  7745118208
Swap:             0           0           0

You can reserve some memory for the Amazon ECS container agent and other critical system processes on your container instances. Reserving this memory helps confirm that your task's containers don't contend for the same memory. For more information, see Reserving Amazon ECS Linux container instance memory.

Verify that the environment variable ECS_CLUSTER has the correct cluster name

If the Amazon ECS container agent configuration parameter ECS_CLUSTER has the incorrect cluster name, then the container instance can't join the cluster. To check the contents of the /etc/ecs/ecs.config file and verify this parameter, run the following command:

cat /etc/ecs/ecs.config

Verify that the ECS agent can communicate to ECS endpoints

To connect with ECS endpoints, the network access control lists and container instance security group must allow outbound connections on port 443 (HTTPS).

If your container instance is in a public subnet, verify that the instance has a public IP address and the subnet's route table has a route to an internet gateway.

If your container instance is in a private subnet, verify that the subnet's route table has a route to a NAT gateway, or that you have configured VPC endpoints for Amazon ECS.

To check the outbound connections to ECS endpoints (ACS/TCS), run one of the following commands on your container instance:

sudo yum install telnet -y
telnet ecs.REGION.amazonaws.com 443

or

curl https://ecs.REGION.amazonaws.com

Note: Replace REGION with your AWS Region.

Review the following best practices:

  • Unless your application requires a specific operating system or a Docker version that's not available in the Amazon ECS-optimized AMI, use the Amazon ECS-optimized Linux AMIs to run your ECS workloads.
  • Use the latest version of the Amazon ECS container agent. The latest version includes enhanced features and provides important updates.
  • Configure tasks with CPU and memory limits.

Related information

Amazon ECS troubleshooting

Amazon ECS container instance IAM role

Viewing Amazon ECS container agent logs

AWS OFFICIALUpdated a month ago
3 Comments

I have an Agent Disconnected error on my container instance. My instance is in private VPC but I have enabled VPC endpoint to ECS and added a SG rule for incoming traffic from all IPs for now. After following the steps above this is what I found - curl https://ecs.us-west-2.amazonaws.com gives me a HTTP 400 error but the connection goes through with TLS connected. ecs.us-west-2.amazonaws.com resolves to an internal IP 10.2.25.146.

However, ecs-t.us-west-2.amazonaws.com and ecs-a.us-west-2.amazonaws.com resolve to a public IP, since I have disabled public access for my VPC connection to these domains fail.

This is from my ecs_agent.log files on the EC2 instance - Error creating a websocket client: dial tcp 52.119.163.142:443: i/o timeout" URL="https://ecs-t.us-west-2.amazonaws.com/tcs/5/ws?agentHash=... "Error connecting to TCS" error="websocket client: unable to dial ecs-t.us-west-2.amazonaws.com response: : dial tcp 52.119.163.142:443: i/o timeout" msg="Failed to connect to ACS" containerInstanceARN="arn:aws:ecs:us-west-2::container-instance//*" error="websocket client: unable to dial ecs-a.us-west-2.amazonaws.com response: : dial tcp 54.71.31.148:443: i/o timeout"

The /etc/resolv.conf has this entry - search us-west-2.compute.internal nameserver 10.2.16.2

Perhaps the nameserver is not able to resolve the ec-*.us-west-2.amazonaws.com domain internally but resolves ecs.us-west-2.amazonaws.com properly?

replied a year ago

Turns out I needed a VPC Endpoint for ecs-agent and ecs-telemetry.

replied a year ago

This article was reviewed and updated on 2026-03-05.

AWS
MODERATOR
replied a month ago