Skip to content

ECS Fargate — SSM GetParameter ConnectTimeoutError with task role, works fine with static credentials

1

I have an ECS Fargate task in a public subnet that needs to load a TLS certificate from SSM Parameter Store (SecureString) at container startup via Python/boto3.

Setup:

Launch type: Fargate, public subnet, assignPublicIp: ENABLED Security group: outbound All traffic allowed No VPC Interface Endpoints for SSM Region: ap-northeast-1 The problem:

When the container calls ssm.get_parameter() using the IAM task role, it always times out:

urllib3.exceptions.ConnectTimeoutError: Connection to ssm.ap-northeast-1.amazonaws.com timed out. (connect timeout=10) I added a retry loop (15 attempts × 10s = 2.5 min) — still fails every attempt.

Key observation: When I set AWS_ACCESS_KEY_ID + AWS_SECRET_ACCESS_KEY as environment variables in the task (static IAM user credentials), SSM works perfectly. Same subnet, same security group, same code.

Another observation: The task's public IP does not appear immediately at container start — it takes about 60–90 seconds for the ENI public IP to become routable. I suspected this was the cause and added the retry loop, but it still fails even after waiting past that window.

IAM Policy Simulator shows ssm:GetParameter is allowed — but I may have simulated against the wrong principal (my IAM user instead of the task role).

Questions:

  1. Why would static credentials succeed but task role credentials fail for the same SSM endpoint, given identical network config?
  2. Is there a known timing issue with ECS Fargate task role credential availability (AWS_CONTAINER_CREDENTIALS_RELATIVE_URI) vs ENI public IP assignment?
  3. Is the ECS task metadata endpoint (169.254.170.2) available before the public IP is routed? Could boto3 credential resolution be failing silently and falling back to a provider that requires internet (EC2 IMDS 169.254.169.254)?
3 Answers
3

As far as I understand, the issue is a race condition between the Fargate ENI's public IP propagation and the boto3 credential resolution chain.

1. The "Why":

  • Static Credentials: boto3 has the keys in memory instantly. It waits for the network to be ready and then calls SSM.
  • Task Role: boto3 must first fetch credentials from the Task Metadata Endpoint (169.254.170.2). If the container starts and tries to fetch these before the internal routing/ENI is fully initialized, the credential provider fails or hangs, leading to the ConnectTimeoutError when it finally tries to call SSM with incomplete or missing context.

2. Metadata vs. Public IP: The Task Metadata Endpoint (169.254.170.2) is internal and usually available immediately, but if your code hits SSM before the Public IP is routable, the call will time out. Because boto3 caches the failure or the "environment" state during the first few seconds, subsequent retries in the same client session can behave inconsistently.

try the following:

  • Best Practice (PrivateLink): Create VPC Interface Endpoints for SSM (com.amazonaws.ap-northeast-1.ssm). This removes the dependency on the Public IP and the Internet Gateway entirely. The task will connect to SSM instantly via the internal AWS backbone.
  • Boto3 Configuration: Increase the specific connection timeouts and retries in your client config to account for the "warm-up" time of the Fargate ENI:
    from botocore.config import Config
    config = Config(connect_timeout=30, read_timeout=30, retries={'max_attempts': 10})
    ssm = boto3.client('ssm', config=config)
    
  • IAM Trust Policy: Double-check that your Task Role (not just the Execution Role) has a Trust Relationship allowing ecs-tasks.amazonaws.com to sts:AssumeRole. Without this, the metadata service cannot return valid tokens.
EXPERT
answered a month ago
2

Why Static Credentials Work but Task Role Fails

Both need internet to reach ssm.ap-northeast-1.amazonaws.com. The difference:

  • Static credentials: boto3 already has the keys, calls SSM directly over HTTPS
  • Task role: boto3 first fetches temporary creds from 169.254.170.2 (instant), then calls SSM over HTTPS

The metadata endpoint works immediately. The problem is reaching the SSM API endpoint over the public internet.

Root Cause

Your public subnet ENI takes 60-90s to become routable — but you said retries for 2.5 minutes still fail. That points to one of:

  1. Subnet NACLs blocking return traffic (NACLs are stateless, unlike security groups)
  2. Route table missing 0.0.0.0/0 → igw
  3. DNS resolution failing inside the container

Quick Diagnosis

Run inside the container during the failure:

import socket, os, requests

# DNS resolution
print(socket.getaddrinfo("ssm.ap-northeast-1.amazonaws.com", 443)[0][4])

# Metadata endpoint (should work)
uri = os.environ.get("AWS_CONTAINER_CREDENTIALS_RELATIVE_URI")
print(requests.get(f"http://169.254.170.2{uri}", timeout=5).status_code)

# Public internet (will it timeout?)
print(requests.get("https://ap-northeast-1.amazonaws.com", timeout=10).status_code)

If metadata succeeds but public internet times out — confirmed network issue, not credentials.

Fix

Add a VPC Interface Endpoint for SSM:

com.amazonaws.ap-northeast-1.ssm

This routes SSM traffic privately within the VPC — no internet dependency, works instantly at container start. ~$7.50/month per AZ.

Answers

  • Why static creds work: They skip the metadata fetch, and the slightly different timing/code path may land after the public IP is routable. Or the real issue is NACLs/routing that you haven't hit with static creds due to timing.
  • Task role credential availability: Credentials from 169.254.170.2 are available immediately. The timeout is reaching SSM, not getting creds.
  • 169.254.169.254 (EC2 IMDS): Not used by Fargate. boto3 won't fall back to it.
answered a month ago
EXPERT
reviewed a month ago
2

I switch to secret manager and it work, so this mean network is not issue because I do not touch NACL, security group is allow all traffic to/from inbound/outbound. Here is error from log: May 9, 2026, 13:40 [entrypoint] SSM attempt 2/5... flyway-container May 9, 2026, 13:40 [entrypoint] SSM not reachable yet, retrying in 5s... flyway-container May 9, 2026, 13:40 aws: [ERROR]: Connect timeout on endpoint URL: "https://ssm.ap-northeast-1.amazonaws.com/"

answered a month ago
EXPERT
reviewed a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.