How do I troubleshoot the error message "unable to pull secrets or registry auth" in Amazon ECS?

8 minute read
0

I receive an "unable to pull secrets or registry auth" error message when I launch an Amazon Elastic Container Service (Amazon ECS) task.

Short description

When you launch an Amazon ECS task, you receive one of the following error messages:

  • "ResourceInitializationError: unable to pull secrets or registry auth: pull command failed: : signal: killed"
  • "ResourceInitializationError: unable to pull secrets or registry auth: execution resource retrieval failed: unable to retrieve secret from asm: service call has been retried."

The AWS Fargate platform version 1.4.0 uses the task elastic network interface to pull the image and secrets. All network traffic flows through the elastic network interface within your Amazon Virtual Private Cloud (Amazon VPC). View this traffic through your Amazon VPC Flow Logs. However, the task uses your network configuration because the elastic network interfaces are placed within your Amazon VPC.

The Amazon ECS container agent uses the task execution AWS Identity and Access Management (IAM) role to get information from the following services:

  • AWS Systems Manager Parameter Store
  • AWS Secrets Manager

For data that's encrypted with a customer managed AWS Key Management Service (AWS KMS) key, grant the following permissions to the task execution IAM role:

  • ssm:GetParameters
  • secretsmanager:GetSecretValue
  • kms:Decrypt

Resolution

Use the AWSSupport-TroubleshootECSTaskFailedToStart runbook to troubleshoot the Amazon ECS tasks that fail to start. If the runbook's output doesn't provide recommendations, then use the manual troubleshooting steps in the sections that follow.

Important:

  • Use the runbook in the same AWS Region where your ECS cluster resources are located.
  • When you use the runbook, use the most recently failed task ID so that the task state cleanup doesn't interrupt the analysis during automation. If the failed task is part of the Amazon ECS service, then use the most recently failed task in the service. The failed task must be visible in ECS:DescribeTasks during automation execution. By default, stopped ECS tasks are visible for 1 hour after the tasks enter the Stopped state.

Use the TroubleshootECSTaskFailedToStart runbook

To run the AWSSupport-TroubleshootECSTaskFailedToStart runbook, complete the following steps:

  1. Open the AWS Systems Manager console.
  2. In the navigation pane, under Change Management, choose Automation.
  3. Choose Execute automation.
  4. Choose the Owned by Amazon tab.
  5. Under Automation document, search for TroubleshootECSTaskFailedToStart.
  6. Select the AWSSupport-TroubleshootECSTaskFailedToStart card.
    Note: Make sure that you select the radio button on the card and not the hyperlinked automation name.
  7. Choose Next.
    Note: After execution, analysis results are populated in the Global output section. However, wait for the document status to move to Success. Also, watch for any exceptions in the Output section.
  8. For Execute automation document, choose Simple execution.
  9. In the Input parameters section, for AutomationAssumeRole, enter the ARN of the role that allows Systems Manager Automation to perform actions.
    Note: Be sure that either the AutomationAssumeRole or the IAM user or role has the required IAM permissions to run the AWSSupport-TroubleshootECSTaskFailedToStart runbook. If you don't specify an IAM role, then Systems Manager Automation uses the permissions of the IAM user or role that runs the runbook. For more information about how to create the assume role for Systems Manager Automation, see Task 1: Create a service role for Automation.
  10. For ClusterName, enter the name of the cluster where the task failed to start.
  11. For TaskId, enter the identification for the task that most recently failed.
  12. Choose Execute.

Based on the output of the automation, use one of the following manual troubleshooting steps.

Check the routes from your subnets to the internet

If your Fargate task is in a public subnet, then verify that your task has an assigned public IP address. Also, confirm that the task has a default route (0.0.0.0/0) to an internet gateway. When you launch a new task or create a new service, turn on Auto-assign public.

If you use the following configurations, then don't use the internet gateway in the public subnet to reach the Secrets Manager or Systems Manager:

  • The Secrets Manager or Systems Manager VPC endpoints are in a public subnet.
  • You turned on AmazonProvidedDNS in your Amazon VPC DHCP settings.

Instead, use an Amazon VPC endpoint.

Note: You can't turn on Auto-assign public for existing tasks. To reconfigure existing services, use the AWS Command Line Interface (AWS CLI). Don't use the AWS Management Console. If you used an AWS CloudFormation stack to create your Amazon ECS service, then modify the NetworkConfiguration property for AWS::ECS::Service to update the service.

If your Fargate task is in a private subnet, then verify that your task has a default route (0.0.0.0/0) to the internet connectivity source.

The internet connectivity source can be a NAT gateway, AWS PrivateLink, or other source:

  • If you use a NAT gateway, then place your NAT gateway in a public subnet. For more information, see Architecture with an internet gateway and a NAT gateway.
  • If you use PrivateLink, then be sure that your Fargate infrastructure can use the security groups for your Amazon VPC endpoints.
  • If you use a custom name domain server, then confirm the DNS query's settings. The query must have outbound access on port 53, and use UDP and TCP protocol. Also, it must have HTTPS access on port 443.

Check your network ACL and security group settings

Verify that your network access control list (network ACL) and security groups don't block outbound access to port 443 from the subnet. For more information, see Control traffic to your AWS resources using security groups.

Note: Fargate tasks must have outbound access to port 443 to allow outgoing traffic and access Amazon ECS endpoints.

Check your Amazon VPC endpoints

If you use PrivateLink, then you must create the required endpoints. The following endpoints are required for Fargate platform versions 1.4.0 or later:

  • com.amazonaws.region.ecr.dkr
  • com.amazonaws.region.ecr.api
  • S3 gateway endpoint
  • com.amazonaws.region.logs

For more information, see Considerations for Amazon Elastic Container Registry (Amazon ECR) VPC endpoints.

Note: If your task definition uses Secrets Manager, Systems Manager parameters, or Amazon CloudWatch Logs, then you might need to define endpoints. For more information, see the following documentation:

For PrivateLink, check that Amazon VPC's security group allows traffic from the Fargate task security group or Fargate task VPC CIDR range on TCP port 443.

To confirm that the Fargate infrastructure has service access, check the VPC endpoint policies and endpoint policies for Amazon Simple Storage Service (Amazon S3).

Check your IAM roles and permissions

The task execution role grants the required permissions to the Amazon ECS container and Fargate agents to make API calls for the task. Fargate requires this role when you take the following actions:

  • Pull a container image from Amazon ECR.
  • Use the awslogs log driver.
  • Use private registry authentication.
  • Use Secrets Manager secrets or Systems Manager Parameter Store parameters to reference sensitive data.

If your use case involves any of the preceding scenarios, then define the required permissions in your task execution role. For a complete list of required permissions, see Amazon ECS task execution IAM role.

Check the referenced sensitive information in the Amazon ECS task definition

Note: If you receive errors when you run AWS CLI commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Check if the secret and parameter names match the referenced names in your Amazon ECS task definition. Then, check if the values in the container definition match the values in your Amazon ECS task definition. For more information, see How can I pass secrets or sensitive information securely to containers in an Amazon ECS task?

If the Systems Manager Parameter Store parameter and task are in the same Region, then use the full ARN or the name of the secret. If the parameter exists in a different Region, then you must specify the full ARN.

To check the Systems Manager parameter name and ARN, complete the following steps:

  1. Open the AWS Systems Manager console.
  2. In the navigation pane, choose Parameter Store, and then confirm your Parameter Store name.
  3. To get the parameter's ARN, use the AWS CLI to run the following command. Replace name_of_parameter_store_secret with your Parameter Store secret name:
    $ aws ssm get-parameter —name <name_of_parameter_store_secret> —with-decryption
    Note: Parameters that reference Secrets Manager secrets can't use the Parameter Store versioning or history features. For more information, see Restrictions.

Related information

Checking stopped tasks for errors

AWS OFFICIAL
AWS OFFICIALUpdated 8 months ago
4 Comments

It amazes me how poor the product-market-fit for Fargate ECS (and Batch) tasks is. The concept is to provide users with a lightweight ability to run a docker container, without the need for an EC2 fleet, infrastructure, networking etc. (I quote from the AWS landing page: "Deploy and manage your applications, not infrastructure.")

Now, when using the service, the user must go through a cascade of infrastructure configurations: ECR, ECS clusters, ECS task definitions, ECS task create/run definition, VPCs, subnets, security groups... None of them are optional, but all are core to making Fargate run. If the user does not manipulate the routing table, you can't pull your private docker image from ECR - unless using a "globally public IP". Really? And things get even more complex when running "Batch" jobs.

It's astonishing how the value proposition has grown into a user-unfriendly journey that requries customers to hire network admins. :( Now, being tech-savvy, I understand that the above complexity might be required, even desired, in advanced use cases. But you have abandonded the entry-level users: If running a "Hello, world" example in a private (secure) environment from a private docker image takes 100 configuration steps, rather than following a 3-step launch wizard, you lose me. (Maybe a friendly analogy would be: I currently feel like installing a printer on Windows 3.1, troubleshooting whether the serial port has been connected while the BIOS loaded...)

replied a year ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied a year ago

The "AWSSupport-TroubleshootECSTaskFailedToStart" link is broken. https://awssupport-troubleshootecstaskfailedtostart/ is not a valid URL. So frustrating that something so complicated as getting a simple (in concept) ECS task running only to have docs that fail. +1 to the first comment above about the overwhelming complexity of doing something conceptually very simple.

jimmbo
replied 10 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 10 months ago