Complete a 3 Question Survey and Earn a re:Post Badge
Help improve AWS Support Official channel in re:Post and share your experience - complete a quick three-question survey to earn a re:Post badge!
How do I troubleshoot issues when I set up Cluster Autoscaler on an Amazon EKS cluster?
I want to troubleshoot issues when I launch Cluster Autoscaler on my Amazon Elastic Kubernetes Service (Amazon EKS) cluster.
Resolution
Prerequisite
Install or update eksctl to the latest version.
Note: The --region variable isn't always defined in the commands because the default value for your AWS Region is used. To check the default value, run the AWS Command Line Interface (AWS CLI) configure command. If you change the AWS Region, then use the --region flag. If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.
Cluster Autoscaler pod is in a CrashLoopBackOff status
Note: Replace the placeholder values in code snippets with your own values.
-
To check the Cluster Autoscaler pod status, run the following command:
kubectl get pods -n kube-system | grep cluster-autoscaler
The following is an example of a Cluster Autoscaler pod that has a CrashLoopBackOff status:
NAME READY STATUS RESTARTS AGE cluster-autoscaler-xxxx-xxxxx 0/1 CrashLoopBackOff 3 (20s ago) 99s
-
To describe the cluster-autoscaler pod, run the following command:
kubectl describe pod cluster-autoscaler-xxxx-xxxxx -n kube-system
The following is an example of the output:
Name: cluster-autoscaler-xxxx-xxxxx Namespace: kube-system State: Waiting Reason: CrashLoopBackOff Last State: Terminated Reason: OOMKilled Exit Code: 137 ...
-
If the output shows an OOMKilled issue, then increase the values for memory resource limits and the requests of the cluster-autoscaler deployment.
-
To view the Cluster Autoscaler pod logs, run the following command:
kubectl logs -f -n kube-system -l app=cluster-autoscaler
The following is an example of a log that shows AWS Identity and Access Management (IAM) permissions issues:
Failed to create AWS Manager: cannot autodiscover ASGs: AccessDenied: User: xxx is not authorized to perform: autoscaling: DescribeTags because no identity-based policy allows the autoscaling:DescribeTags action status code: 403, request id: xxxxxxxx
If the logs show that there are IAM permissions issues, then complete the following steps:
Check that an OIDC provider is associated with the EKS cluster
-
To check whether you already have an IAM OpenID Connect (OIDC) provider for your cluster, run the following command:
oidc_id=$(aws eks describe-cluster --name example-cluster --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
-
To check whether there is already an IAM OIDC provider with your cluster's ID in your account, run the following command:
aws iam list-open-id-connect-providers | grep $oidc_id | cut -d "/" -f4
Note: If an output is returned, then you already have an IAM OIDC provider for your cluster, skip the next step. If no output is returned, then proceed to the next step.
-
To create an IAM OIDC identity provider for your cluster, run the following command:
eksctl utils associate-iam-oidc-provider --cluster example-cluster --approve
Check that the Cluster Autoscaler service account is annotated with the IAM role
-
Run the following command:
kubectl get serviceaccount cluster-autoscaler -n kube-system -o yaml
The following is the expected outcome:
apiVersion: v1kind: ServiceAccount metadata: annotations: eks.amazonaws.com/role-arn: arn:aws:iam::012345678912:role/<cluster_auto_scaler_iam_role> name: cluster-autoscaler namespace: kube-system
Check the IAM policy
Make sure that the correct IAM policy is attached to the preceding IAM role. For more information, see IAM policy on the GitHub website.
Check that the trust relationship is configured correctly
See the following example:
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "Federated": "arn:aws:iam::<example_awsaccountid>:oidc-provider/oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>" }, "Action": "sts:AssumeRoleWithWebIdentity", "Condition": { "StringEquals": { "oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>:aud": "sts.amazonaws.com", "oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>:sub": "system:serviceaccount:kube-system:cluster-autoscaler" } } } ] }
Restart the Cluster Autoscaler pod each time a change is made to the service account role or IAM policy.
If the logs show any networking issues such as I/O timeout, then do the following:
Note: The following is an example of a log that shows networking issues:
Failed to create AWS Manager: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post https://sts.region.amazonaws.com/: dial tcp: i/o timeout
- Check that the Amazon EKS cluster is configured with the required networking setup. Verify that the worker node subnet has a route table that can route traffic to the following endpoints, either on global or Regional endpoints:
- Amazon Elastic Compute Cloud (Amazon EC2)
- AWS Auto Scaling
- AWS Security Token Service (AWS STS)
- Make sure that the subnet network access control list (network ACL) or the worker node security group don't block traffic that communicates to these endpoints.
- If the Amazon EKS cluster is private, then check the setup of the relevant Amazon Virtual Private Cloud (VPC) endpoints. For example, Amazon EC2, AWS Auto Scaling, and AWS STS.
Note: The security group of each VPC endpoint is required to allow the Amazon EKS worker node security group. It's also required to allow the Amazon EKS VPC CIDR block on 443 port on the ingress traffic.
Cluster Autoscaler doesn't scale in or scale out nodes
Check the Cluster Autoscaler pod logs
Run the following command:
kubectl logs -f -n kube-system -l app=cluster-autoscaler
To check whether the pod that's in a Pending status contains any scheduling rules, such as the affinity rule, run the following describe pod command:
kubectl describe pod <example_podname> -n <example_namespace>
For more information, see Affinity and anti-affinity on the Kubernetes website.
Check the events section from the output. This section shows information that explains why a pod is in a pending status.
Note: Cluster Autoscaler respects nodeSelector and requiredDuringSchedulingIgnoredDuringExecution in nodeAffinity. Make sure that your node groups are labeled with these values. If a pod can't be scheduled with nodeSelector or requiredDuringSchedulingIgnoredDuringExecution, then Cluster Autoscaler considers only node groups that meet those requirements for expansion. Modify the scheduling rules defined on pods or nodes so that a pod is scheduled on a node.
Check the Auto Scaling group tagging for the Cluster Autoscaler
Cluster Autoscaler can't discover the Auto Scaling group unless the node group's corresponding Auto Scaling group is tagged as follows:
Tag 1:
- key: k8s.io/cluster-autoscaler/example-cluster
- value: owned
Tag 2:
- key: k8s.io/cluster-autoscaler/enabled
- value: true
Check the configuration of the deployment manifest
-
Run the following command:
kubectl -n kube-system edit deployment.apps/cluster-autoscaler
-
Check whether the manifest is configured with the correct node-group-auto-discovery argument.
containers:- command ./cluster-autoscaler --v=4 --stderrthreshold=info --cloud-provider=aws --skip-nodes-with-local-storage=false --expander=least-waste --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-cluster --balance-similar-node-groups --skip-nodes-with-system-pods=false
Check the current number of nodes
-
To check whether the current number of nodes has reached the managed node group's minimum or maximum values, run the following command:
aws eks describe-nodegroup --cluster-name <example-cluster> --nodegroup-name <example-nodegroup>
-
If the minimum or maximum values are reached, then modify the values with the new workload requirements.
Check the pod resource request
-
To check whether the pod resource request can be fulfilled by the current node instance types, run the following command:
kubectl -n <example_namespace> get pod <example_podname> -o yaml | grep resources -A6
-
If the pod resource request can't be fulfilled, then either modify the pod resource requests or create a new node group. When you create a new node group, make sure that the nodes' instance type can fulfill the resource requirement for pods.
Check the taint configuration for the node in the node group
-
To check whether taints are configured for the node and the pod can tolerate the taints, run the following command:
kubectl describe node <example_nodename> | grep taint -A2
-
If the taints are configured, then remove the taints defined on the node. If the pod can't tolerate taints, then define tolerations on the pod so that the pod can be scheduled on the node with the taints. For more information, see Taints and tolerations on the Kubernetes website.
Check whether the node is annotated with scale-down-disable
-
Run the following command:
kubectl describe node <example_nodename> | grep scale-down-disable
The following is the expected outcome:
cluster-autoscaler.kubernetes.io/scale-down-disabled: true
-
If scale-down-disable is set to true, then run the following command to remove the annotation for the node and to scale down:
kubectl annotate node <example_nodename> cluster-autoscaler.kubernetes.io/scale-down-disabled-
For more information on troubleshooting, see Cluster Autoscaler FAQ on the GitHub website.

Relevant content
- asked 2 years agolg...
- asked a year agolg...
- asked 4 months agolg...
- AWS OFFICIALUpdated 10 months ago