Skip to content

How do I troubleshoot custom nodepool and nodeclass provisioning issues in Amazon EKS Auto Mode?

6 minute read
Content level: Intermediate
2

I'm facing issues with custom nodepool and nodeclass creation while using EKS Auto Mode.

Short description

When using Amazon EKS (Elastic Kubernetes Service) Auto Mode, you might encounter provisioning issues when working with custom nodepools or nodeclasses where they get stuck in an "Unknown" or "NotReady" or "Empty" state. This problem often arises from configuration errors in either the nodepool or the nodeclass. This article covers some of the common causes for these provisioning failures and provides troubleshooting steps to resolve them.

Issue - Custom NodePool is not ready.

# kubectl get nodepool
NAME              NODECLASS   NODES   READY   AGE
custom-nodepool   default1    0       False   39s
general-purpose   default     0       True    15m
system            default     1       True    15m

Causes -

1. A NodePool can stay in "not ready" state when the NodeClass referenced in the NodePool configuration doesn't exist or is misspelled.

Troubleshooting - To troubleshoot the issue, describe the nodepool to check the events.

# kubectl describe nodepool {nodepool-name}
Events:
  Type     Reason               Age                  From       Message
  ----     ------               ----                 ----       -------
  Normal   ValidationSucceeded  2m27s                karpenter  Status condition transitioned, Type: ValidationSucceeded, Status: Unknown -> True, Reason: ValidationSucceeded
  Normal   NodeClassReady       2m27s                karpenter  Status condition transitioned, Type: NodeClassReady, Status: Unknown -> False, Reason: NodeClassNotFound, Message: NodeClass not found on cluster
  Normal   Ready                2m27s                karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> False, Reason: UnhealthyDependents, Message: NodeClassReady=False
  Warning                       20s (x2 over 2m23s)  karpenter  Failed resolving NodeClass

Resolution - Update the NodeClass reference either in the NodePool directly using the command below, or modify the manifest and reapply the changes.

To modify the configuration directly in the cluster, execute the following command

# kubectl edit nodepool {nodepool-name} 

Update the nodeclass reference

 nodeClassRef:
    group: eks.amazonaws.com
    kind: NodeClass
    name: {nodeclass-name} --> Update name here 

2. EKS Auto Mode uses different labels than Karpenter. Labels related to EC2 managed instances in Auto Mode start with eks.amazonaws.com. If an unsupported label is present in the auto mode’s nodepool configuration, it can result in nodepool’s not ready state.

Troubleshooting - To troubleshoot the issue, describe the nodepool to check the events and errors (sample below)

# kubectl describe nodepool {nodepool-name}
Events:
  Type    Reason               Age    From       Message
  ----    ------               ----   ----       -------
  Normal  ValidationSucceeded  9m48s  karpenter  Status condition transitioned,  Type: ValidationSucceeded,Status: Unknown -> False, Reason: NodePoolValidationFailed, 
  Message: invalid value: label is restricted; specify a well known label or a custom label that does not use a restricted domain (label=karpenter.k8s.aws/instance-category, 
   restricted-labels=[k8s.io karpenter.k8s.aws karpenter.sh kubernetes.io], well-known-labels=[app.kubernetes.io/managed-by eks.amazonaws.com/capacity-reservation-id 
  eks.amazonaws.com/capacity-reservation-type eks.amazonaws.com/compute-type eks.amazonaws.com/instance-accelerator-count eks.amazonaws.com/instance-accelerator-manufacturer 
  eks.amazonaws.com/instance-accelerator-name eks.amazonaws.com/instance-category  eks.amazonaws.com/instance-cpu eks.amazonaws.com/instance-cpu-manufacturer 
  eks.amazonaws.com/instance-cpu-sustained-clock-speed-mhz eks.amazonaws.com/instance-ebs-bandwidth eks.amazonaws.com/instance-encryption-in-transit-supported eks.amazonaws.com/instance-family 
  eks.amazonaws.com/instance-generation eks.amazonaws.com/instance-gpu-count eks.amazonaws.com/instance-gpu-manufacturer eks.amazonaws.com/instance-gpu-memory 
  eks.amazonaws.com/instance-gpu-name eks.amazonaws.com/instance-hypervisor eks.amazonaws.com/instance-local-nvme eks.amazonaws.com/instance-memory 
  eks.amazonaws.com/instance-network-bandwidth eks.amazonaws.com/instance-size karpenter.sh/capacity-type karpenter.sh/nodepool kubernetes.io/arch kubernetes.io/os 
  node.kubernetes.io/instance-type node.kubernetes.io/windows-build topology.k8s.aws/zone-id topology.kubernetes.io/region topology.kubernetes.io/zone]) 
  in requirements, restricted

Resolution - EKS Auto Mode uses different labels than Karpenter. Labels related to EC2 managed instances start with eks.amazonaws.com.

To fix the issue, update the nodepool configuration either using imperative way i.e kubectl edit command or update the manifest file with correct labels and re-apply the configuration.

List of EKS Auto Mode Support labels can be found here.

Issue - Custom NodePool status is empty.

# kubectl get nodepool
NAME              NODECLASS   NODES   READY   AGE
eks-auto-mode     default                     8s
general-purpose   default     0       True    25m

Troubleshooting - To troubleshoot the issue, describe the nodepool and check the Group and Kind in nodeclass reference :

# kubectl describe nodepool {nodepool-name}

Node Class Ref:
    Group:  karpenter.k8s.aws
    Kind:   EC2NodeClass
    Name:   default

Cause - Incorrect Group or Kind in node class reference.

Resolution - In EKS Auto Mode, the NodeClassRef group must be "eks.amazonaws.com" and the Kind must be "NodeClass".

To fix the issue, update the Group / Kind in the manifest file and re-create the nodepool or to update it on directly on the cluster, please follow the below mentioned steps.

  1. Run below command to open nodepool in edit mode
# kubectl edit nodepool eks-auto-mode
  1. Update the group name in NodeClassRef Section and save the file. This will create a temporary file in /tmp folder (Sample Output below)
→ A copy of your changes has been stored to 
/var/folders/7z/3l7pdd_96td5hyx68fwk6qmh0000gr/T/kubectl-edit-25******89.yaml
  1. Update the nodepool configuration using force option and by passing the temporary file created in previous step.
# kubectl apply -f {file name from output of step 2} --force

Issue - Custom NodeClass is in NotReady State.

# kubectl get nodeclass
NAME           ROLE                    READY   AGE
customclass    AutoModeCustomRole      False   6m29s
default        AmazonEKSAutoNodeRole   True    41h

Cause - A NotReady status for custom nodeclass can be caused by missing Kubernetes permissions on the Auto Mode Node IAM role.

Troubleshooting -

  1. To troubleshoot the issue, describe the NodeClass and look for Unauthorized Error in the output.
# kubectl describe nodeclass customclass
 Events:
  Type    Reason                Age   From       Message
  ----    ------                ----  ----       -------
  Normal  SubnetsReady          50m   karpenter  Status condition transitioned, Type: SubnetsReady, Status: Unknown -> True, Reason: SubnetsReady
  Normal  SecurityGroupsReady   50m   karpenter  Status condition transitioned, Type: SecurityGroupsReady, Status: Unknown -> True, Reason: SecurityGroupsReady
  Normal  InstanceProfileReady  50m   karpenter  Status condition transitioned, Type: InstanceProfileReady, Status: Unknown -> False, Reason: UnauthorizedNodeRole, Message: Role AutoModeCustomRole is unauthorized to join nodes to the cluster
  Normal  ValidationSucceeded   50m   karpenter  Status condition transitioned, Type: ValidationSucceeded, Status: Unknown -> False, Reason: DependenciesNotReady, Message: Awaiting Instance Profile, Security Group, and Subnet resolution
  Normal  Ready                 50m   karpenter  Status condition transitioned, Type: Ready, Status: Unknown -> False, Reason: UnhealthyDependents, Message: ValidationSucceeded=False, InstanceProfileReady=False  

Below error in output confirms Permissions issue.

Type: InstanceProfileReady, Status: Unknown → False, Reason: UnauthorizedNodeRole, 
Message: Role AutoModeCustomRole is unauthorized to join nodes to the cluster
  1. Check CloudTrail events history :

Filter the CloudTrail logs to show events only from eks.amazonaws.com, and examine these records for any Unauthorized errors.

Resolution - Node IAM Role should minimum required IAM permissions and AmazonEKSAutoNodePolicy EKS Access Policy.

EKS automatically creates an Access Entry for the default node IAM role during cluster creation or when you create the built-in nodeclass and nodepools in scenarios like Migration to Auto Mode If you change the node IAM role associated with a NodeClass, you will need to ensure that the node role has the minimum required IAM permissions and you will need to create a new Access Entry. For more information on EKS Access Entry, see Grant IAM users access to Kubernetes with EKS access entries and for details on IAM permissions needed for node role, see Amazon EKS Auto Mode node IAM role.

For more troubleshooting scenarios on EKS auto mode, check the following links.

Troubleshoot EKS Auto Mode

How to debug Auto-Mode custom NodePool

How do I troubleshoot EKS Auto Mode built-in node pools with Unknown Status

AWS
EXPERT
published 6 months ago1.1K views