- Newest
- Most votes
- Most comments
As far as I understand, the issue is a race condition between the Fargate ENI's public IP propagation and the boto3 credential resolution chain.
1. The "Why":
- Static Credentials: boto3 has the keys in memory instantly. It waits for the network to be ready and then calls SSM.
- Task Role: boto3 must first fetch credentials from the Task Metadata Endpoint (169.254.170.2). If the container starts and tries to fetch these before the internal routing/ENI is fully initialized, the credential provider fails or hangs, leading to the ConnectTimeoutError when it finally tries to call SSM with incomplete or missing context.
2. Metadata vs. Public IP: The Task Metadata Endpoint (169.254.170.2) is internal and usually available immediately, but if your code hits SSM before the Public IP is routable, the call will time out. Because boto3 caches the failure or the "environment" state during the first few seconds, subsequent retries in the same client session can behave inconsistently.
try the following:
- Best Practice (PrivateLink): Create VPC Interface Endpoints for SSM (com.amazonaws.ap-northeast-1.ssm). This removes the dependency on the Public IP and the Internet Gateway entirely. The task will connect to SSM instantly via the internal AWS backbone.
- Boto3 Configuration: Increase the specific connection timeouts and retries in your client config to account for the "warm-up" time of the Fargate ENI:
from botocore.config import Config config = Config(connect_timeout=30, read_timeout=30, retries={'max_attempts': 10}) ssm = boto3.client('ssm', config=config) - IAM Trust Policy: Double-check that your Task Role (not just the Execution Role) has a Trust Relationship allowing ecs-tasks.amazonaws.com to sts:AssumeRole. Without this, the metadata service cannot return valid tokens.
Why Static Credentials Work but Task Role Fails
Both need internet to reach ssm.ap-northeast-1.amazonaws.com. The difference:
- Static credentials: boto3 already has the keys, calls SSM directly over HTTPS
- Task role: boto3 first fetches temporary creds from
169.254.170.2(instant), then calls SSM over HTTPS
The metadata endpoint works immediately. The problem is reaching the SSM API endpoint over the public internet.
Root Cause
Your public subnet ENI takes 60-90s to become routable — but you said retries for 2.5 minutes still fail. That points to one of:
- Subnet NACLs blocking return traffic (NACLs are stateless, unlike security groups)
- Route table missing
0.0.0.0/0 → igw - DNS resolution failing inside the container
Quick Diagnosis
Run inside the container during the failure:
import socket, os, requests # DNS resolution print(socket.getaddrinfo("ssm.ap-northeast-1.amazonaws.com", 443)[0][4]) # Metadata endpoint (should work) uri = os.environ.get("AWS_CONTAINER_CREDENTIALS_RELATIVE_URI") print(requests.get(f"http://169.254.170.2{uri}", timeout=5).status_code) # Public internet (will it timeout?) print(requests.get("https://ap-northeast-1.amazonaws.com", timeout=10).status_code)
If metadata succeeds but public internet times out — confirmed network issue, not credentials.
Fix
Add a VPC Interface Endpoint for SSM:
com.amazonaws.ap-northeast-1.ssm
This routes SSM traffic privately within the VPC — no internet dependency, works instantly at container start. ~$7.50/month per AZ.
Answers
- Why static creds work: They skip the metadata fetch, and the slightly different timing/code path may land after the public IP is routable. Or the real issue is NACLs/routing that you haven't hit with static creds due to timing.
- Task role credential availability: Credentials from
169.254.170.2are available immediately. The timeout is reaching SSM, not getting creds. - 169.254.169.254 (EC2 IMDS): Not used by Fargate. boto3 won't fall back to it.
I switch to secret manager and it work, so this mean network is not issue because I do not touch NACL, security group is allow all traffic to/from inbound/outbound. Here is error from log: May 9, 2026, 13:40 [entrypoint] SSM attempt 2/5... flyway-container May 9, 2026, 13:40 [entrypoint] SSM not reachable yet, retrying in 5s... flyway-container May 9, 2026, 13:40 aws: [ERROR]: Connect timeout on endpoint URL: "https://ssm.ap-northeast-1.amazonaws.com/"
