How do I troubleshoot issues when I set up Cluster Autoscaler on an Amazon EKS cluster?

7 minute read
0

I want to troubleshoot issues when I launch Cluster Autoscaler on my Amazon Elastic Kubernetes Service (Amazon EKS) cluster.

Resolution

Prerequisite 

Install or update eksctl to the latest version.

Note: The --region variable isn't always defined in the commands because the default value for your AWS Region is used. To check the default value, run the AWS Command Line Interface (AWS CLI) configure command. If you change the AWS Region, then use the --region flag. If you receive errors when you run AWS Command Line Interface (AWS CLI) commands, then see Troubleshoot AWS CLI errors. Also, make sure that you're using the most recent AWS CLI version.

Cluster Autoscaler pod is in a CrashLoopBackOff status

Note: Replace the placeholder values in code snippets with your own values.

  1. To check the Cluster Autoscaler pod status, run the following command:

    kubectl get pods -n kube-system | grep cluster-autoscaler

    The following is an example of a Cluster Autoscaler pod that has a CrashLoopBackOff status:

    NAME                            READY   STATUS             RESTARTS      AGE
    cluster-autoscaler-xxxx-xxxxx   0/1     CrashLoopBackOff   3 (20s ago)   99s
  2. To describe the cluster-autoscaler pod, run the following command:

    kubectl describe pod cluster-autoscaler-xxxx-xxxxx -n kube-system

    The following is an example of the output:

    Name:               cluster-autoscaler-xxxx-xxxxx
    Namespace:          kube-system
    State:              Waiting
    Reason:             CrashLoopBackOff
    Last State:         Terminated
    Reason:             OOMKilled
    Exit Code:          137
    ...
  3. If the output shows an OOMKilled issue, then increase the values for memory resource limits and the requests of the cluster-autoscaler deployment.

  4. To view the Cluster Autoscaler pod logs, run the following command:

    kubectl logs -f -n kube-system -l app=cluster-autoscaler

    The following is an example of a log that shows AWS Identity and Access Management (IAM) permissions issues:

    Failed to create AWS Manager: cannot autodiscover ASGs: AccessDenied: User: xxx is not authorized to perform: autoscaling: DescribeTags because no identity-based policy allows the autoscaling:DescribeTags action status code: 403, request id: xxxxxxxx

If the logs show that there are IAM permissions issues, then complete the following steps:

Check that an OIDC provider is associated with the EKS cluster

  1. To check whether you already have an IAM OpenID Connect (OIDC) provider for your cluster, run the following command:

    oidc_id=$(aws eks describe-cluster --name example-cluster --query "cluster.identity.oidc.issuer" --output text | cut -d '/' -f 5)
  2. To check whether there is already an IAM OIDC provider with your cluster's ID in your account, run the following command:

    aws iam list-open-id-connect-providers | grep $oidc_id | cut -d "/" -f4

    Note: If an output is returned, then you already have an IAM OIDC provider for your cluster, skip the next step. If no output is returned, then proceed to the next step.

  3. To create an IAM OIDC identity provider for your cluster, run the following command:

    eksctl utils associate-iam-oidc-provider --cluster example-cluster --approve

Check that the Cluster Autoscaler service account is annotated with the IAM role

  1. Run the following command:

    kubectl get serviceaccount cluster-autoscaler -n kube-system -o yaml

    The following is the expected outcome:

    apiVersion: v1kind: ServiceAccount
    metadata:
      annotations:
        eks.amazonaws.com/role-arn: arn:aws:iam::012345678912:role/<cluster_auto_scaler_iam_role>
      name: cluster-autoscaler
      namespace: kube-system

Check the IAM policy

Make sure that the correct IAM policy is attached to the preceding IAM role. For more information, see IAM policy on the GitHub website.

Check that the trust relationship is configured correctly

See the following example:

{  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<example_awsaccountid>:oidc-provider/oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>:aud": "sts.amazonaws.com",
          "oidc.eks.<example_region>.amazonaws.com/id/<example_oidcid>:sub": "system:serviceaccount:kube-system:cluster-autoscaler"
        }
      }
    }
  ]
}

Restart the Cluster Autoscaler pod each time a change is made to the service account role or IAM policy.

If the logs show any networking issues such as I/O timeout, then do the following:

Note: The following is an example of a log that shows networking issues:

Failed to create AWS Manager: cannot autodiscover ASGs: WebIdentityErr: failed to retrieve credentials caused by: RequestError: send request failed caused by: Post https://sts.region.amazonaws.com/: dial tcp: i/o timeout
  1. Check that the Amazon EKS cluster is configured with the required networking setup. Verify that the worker node subnet has a route table that can route traffic to the following endpoints, either on global or Regional endpoints:
    • Amazon Elastic Compute Cloud (Amazon EC2)
    • AWS Auto Scaling
    • AWS Security Token Service (AWS STS)
  2. Make sure that the subnet network access control list (network ACL) or the worker node security group don't block traffic that communicates to these endpoints.
  3. If the Amazon EKS cluster is private, then check the setup of the relevant Amazon Virtual Private Cloud (VPC) endpoints. For example, Amazon EC2, AWS Auto Scaling, and AWS STS.

Note: The security group of each VPC endpoint is required to allow the Amazon EKS worker node security group. It's also required to allow the Amazon EKS VPC CIDR block on 443 port on the ingress traffic.

Cluster Autoscaler doesn't scale in or scale out nodes

Check the Cluster Autoscaler pod logs

Run the following command:

kubectl logs -f -n kube-system -l app=cluster-autoscaler

To check whether the pod that's in a Pending status contains any scheduling rules, such as the affinity rule, run the following describe pod command:

kubectl describe pod <example_podname> -n <example_namespace>

For more information, see Affinity and anti-affinity on the Kubernetes website. 

Check the events section from the output. This section shows information that explains why a pod is in a pending status.

Note: Cluster Autoscaler respects nodeSelector and requiredDuringSchedulingIgnoredDuringExecution in nodeAffinity. Make sure that your node groups are labeled with these values. If a pod can't be scheduled with nodeSelector or requiredDuringSchedulingIgnoredDuringExecution, then Cluster Autoscaler considers only node groups that meet those requirements for expansion. Modify the scheduling rules defined on pods or nodes so that a pod is scheduled on a node.

Check the Auto Scaling group tagging for the Cluster Autoscaler

Cluster Autoscaler can't discover the Auto Scaling group unless the node group's corresponding Auto Scaling group is tagged as follows:

Tag 1:

  • key: k8s.io/cluster-autoscaler/example-cluster
  • value: owned

Tag 2:

  • key: k8s.io/cluster-autoscaler/enabled
  • value: true

Check the configuration of the deployment manifest

  1. Run the following command:

    kubectl -n kube-system edit deployment.apps/cluster-autoscaler
  2. Check whether the manifest is configured with the correct node-group-auto-discovery argument.

    containers:- command
       ./cluster-autoscaler
       --v=4
       --stderrthreshold=info
       --cloud-provider=aws
       --skip-nodes-with-local-storage=false
       --expander=least-waste
       --node-group-auto-discovery=asg:tag=k8s.io/cluster-autoscaler/enabled,k8s.io/cluster-autoscaler/example-cluster
       --balance-similar-node-groups
       --skip-nodes-with-system-pods=false

Check the current number of nodes

  1. To check whether the current number of nodes has reached the managed node group's minimum or maximum values, run the following command:

    aws eks describe-nodegroup --cluster-name <example-cluster> --nodegroup-name <example-nodegroup>
  2. If the minimum or maximum values are reached, then modify the values with the new workload requirements.

Check the pod resource request

  1. To check whether the pod resource request can be fulfilled by the current node instance types, run the following command:

    kubectl -n <example_namespace> get pod <example_podname> -o yaml | grep resources -A6
  2. If the pod resource request can't be fulfilled, then either modify the pod resource requests or create a new node group. When you create a new node group, make sure that the nodes' instance type can fulfill the resource requirement for pods.

Check the taint configuration for the node in the node group

  1. To check whether taints are configured for the node and the pod can tolerate the taints, run the following command:

    kubectl describe node <example_nodename> | grep taint -A2
  2. If the taints are configured, then remove the taints defined on the node. If the pod can't tolerate taints, then define tolerations on the pod so that the pod can be scheduled on the node with the taints. For more information, see Taints and tolerations on the Kubernetes website. 

Check whether the node is annotated with scale-down-disable

  1. Run the following command:

    kubectl describe node <example_nodename> | grep scale-down-disable

    The following is the expected outcome:

    cluster-autoscaler.kubernetes.io/scale-down-disabled: true
  2. If scale-down-disable is set to true, then run the following command to remove the annotation for the node and to scale down:

    kubectl annotate node <example_nodename> cluster-autoscaler.kubernetes.io/scale-down-disabled-

For more information on troubleshooting, see Cluster Autoscaler FAQ on the GitHub website.

AWS OFFICIAL
AWS OFFICIALUpdated 10 months ago