How do I troubleshoot scaling issues with my Amazon ECS capacity provider?

6 minute read
0

I have set up a capacity provider for my Amazon Elastic Container Service (Amazon ECS) cluster. However, the capacity provider doesn’t scale like it's supposed to.

Short description

The capacity provider for your Amazon ECS cluster doesn't automatically scale in or scale out because of one or more of the following reasons:

  • The Amazon ECS service isn't associated with the capacity provider.
  • The scaling policies related to the capacity provider aren't attached to the Auto Scaling group.
  • The target capacity percentage isn't configured in the capacity provider correctly.
  • You're using managed scaling for the capacity provider, and the Auto Scaling group has custom scaling policies attached to it.
  • The Auto Scaling group has launched the container instance, but it can't join the cluster.
  • Your container instances are protected from scaling in.
  • The capacity provider is stuck in failed state.
  • The Auto Scaling group is stuck in a loop of scaling out and scaling in.

Resolution

Note: If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

To scale the capacity providers, complete the following tasks.

The Amazon ECS service isn't associated with the capacity provider

To check that the Amazon ECS service is associated with the capacity provider, run the describe-services command:

aws ecs describe-services --cluster example-cluster --services example-service --region example-region --query services[].capacityProviderStrategy

If your Amazon ECS service is associated with the capacity provider, then the output looks similar to the following example:

[
  [
    {
      "capacityProvider": "example-capacity-provider",
      "weight": 1,
      "base": 1
    }
  ]
]

Be sure that the capacityProviderStrategy field isn't null in the output. To view the configuration of the service, review AWS CloudTrail events for CreateService and UpdateService API calls.

To resolve this issue, run the update-service command to update the Amazon ECS service. You can also use the Amazon ECS console to update the service.

The scaling policies related to the capacity provider aren't attached to the Auto Scaling group

When you create a capacity provider and you associate it with an Auto Scaling group, the Auto Scaling group creates a scaling policy. That scaling policy uses target tracking to modify the capacity so that it can accommodate cluster loads.

To troubleshoot this issue, review CloudTrail events for UpdateAutoScalingGroup, CreateCapacityProvider, and UpdateCapacityProvider APIs.

To verify that the Auto Scaling group is created as a cluster attachment, run the describe-cluster command:

aws ecs describe-clusters --clusters example-cluster --include ATTACHMENTS --region example-region --query clusters[].attachments[]

The output of the command looks similar to the following example:

[
  {
    "id": "100a23456-5f0b-4abc-b998-d6789d111a",
    "type": "as_policy",
    "status": "CREATED",
    "details": [
      {
        "name": "capacityProviderName",
        "value": "example-capacityProvider"
      },
      {
        "name": "scalingPlanName",
        "value": "ECSManagedAutoScalingPlan-bb60c8fa-3ed7-4808-b39c-abcdef2345"
      }
    ]
  }
]

If you use a managed scaling policy, then complete the following steps to check whether the policy is attached to the Auto Scaling group:

  1. Open the Amazon ECS console.
  2. Open the cluster that you want to check.
  3. Choose the Infrastructure tab.
  4. Under the Capacity providers tab, choose the Auto Scaling group for the capacity provider that you want to check. The Auto Scaling groups page in the Amazon EC2 console appears.
  5. Choose the Automatic Scaling tab.
  6. For Auto Scaling group, confirm that the scaling policy uses the metric CapacityProviderReservation.

The target capacity percentage isn't correctly configured in the capacity provider

The CloudWatch metric uses the same target capacity values as the Amazon ECS managed target tracking scaling policy. The target capacity value is used for the CloudWatch metric that's used in the Amazon ECS managed target tracking scaling policy. This target capacity value is matched on a best effort basis. The allowed values for this metric are integers from 1 to 100. For example, if you set the target capacity to 100%, then all instances are utilized. Any instances that are not running tasks are scaled in. To set up spare capacity, set the target capacity to a value that's lower than 100% based on your requirement.

To update the capacity provider with the correct target capacity percentage, see Updating an Amazon ECS capacity provider.

The Auto Scaling group has launched the container instance, but it can't join the cluster

Complete the following steps:

Your container instances are protected from scaling in

For capacity providers that use managed termination protection, Amazon ECS prevents the termination of Amazon EC2 instances with tasks during a scale-in action. For more information, see Control the instances Amazon ECS terminates.

To make sure that the Auto Scaling group can terminate old instances when you change the desired capacity, complete the following tasks:

For more information, see How do I resolve the error "The managed termination protection setting for the capacity provider is invalid" in Amazon ECS?

The capacity provider is stuck in failed state

When you use a capacity provider, it's a best practice to create a new Auto Scaling group and not reuse an existing group. Instances in the running state that are associated with the existing group and registered to an Amazon ECS cluster might not correctly register.

To see the status of the capacity provider, run the describe-capacity-providers command. Also, review CloudTrail events, and check for errors related to the CreateCapacityProvider API.

The Auto Scaling group is stuck in a loop of scaling out and scaling in

When the metric value that's specified in your Amazon ECS service scaling policy spikes, the Auto Scaling group scales out and launches instances. However, if the metric value drops after the sudden spike, then the Auto Scaling group tries to scale in the instances. If the metric value fluctuates several times within a short time, then the Auto Scaling group gets stuck in a scaling loop. To avoid this issue, define the threshold value of the metric in the scaling policy to match your workload.

Related information

Deep dive on Amazon ECS cluster auto scaling

How do I resolve the DELETE_FAILED error when I delete the capacity provider in Amazon ECS?

Amazon ECS clusters for the Fargate launch type

Amazon ECS capacity providers for the EC2 launch type

AWS OFFICIAL
AWS OFFICIALUpdated 12 days ago