How do I troubleshoot Amazon EKS managed node group creation failures?

6 minute read
0

My Amazon Elastic Kubernetes Service (Amazon EKS) managed node group failed to create. Nodes can't join the cluster and I received an error similar to the following: "Instances failed to join the kubernetes cluster."

Resolution

Use the automation runbook to identify common issues

Use the AWSSupport-TroubleshootEKSWorkerNode runbook to find common issues.

Important: For the automation to work, your worker nodes must have permission to access AWS Systems Manager and have Systems Manager running. To grant permission, attach the AmazonSSMManagedInstanceCore AWS managed policy to the AWS Identity and Access Management (IAM) role that corresponds to your Amazon Elastic Compute Cloud (EC2) instance profile. This is the default configuration for Amazon EKS managed node groups that're created through eksctl.

Complete the following steps:

  1. Open the runbook.
  2. Check that the AWS Region in the AWS Management Console is set to the same Region as your cluster.
    Note: Review the Runbook details section of the runbook for more information about the runbook.
  3. In the Input parameters section, specify the name of your cluster in the ClusterName field and instance ID in the WorkerID field.
  4. (Optional) In the AutomationAssumeRole field, specify the IAM role to allow Systems Manager to perform actions. If not specified, then the IAM permissions of your current IAM entity are used to perform the actions in the runbook.
  5. Choose Execute.
  6. Check the Outputs section to see why your worker node can't join your cluster and steps that you can take to resolve the error.

Confirm worker node security group traffic requirements

Confirm that your control plane's security group and worker node security group are configured with the requirements for inbound and outbound traffic. By default, Amazon EKS applies the cluster security group to the instances in your node group to facilitate communication between nodes and the control plane. If you specify custom security groups in the launch template for your managed node group, then Amazon EKS doesn't add the cluster security group.

Verify worker node IAM permissions

Make sure that the IAM instance role that's associated with the worker node has the AmazonEKSWorkerNodePolicy and AmazonEC2ContainerRegistryReadOnly policies attached.

Note: You must attach the Amazon managed policy AmazonEKS_CNI_Policy to an IAM role. You can attach it to the node instance role. However, it's a best practice to attach the policy to a role that's associated with the aws-node Kubernetes service account in the kube-system namespace. For more information, see Configure Amazon VPC CNI plugin to use IRSA.

Confirm that the Amazon VPC for your cluster has support for a DNS hostname and resolution

After you configure private access for your EKS cluster endpoint, you must turn on a DNS hostname and DNS resolution for your Amazon Virtual Private Cloud (Amazon VPC). When you activate endpoint private access, Amazon EKS creates an Amazon Route 53 private hosted zone for you. Then, Amazon EKS associates it with your cluster's Amazon VPC. For more information, see Control network access to cluster API server endpoint.

Update the aws-auth ConfigMap with the NodeInstanceRole of your worker nodes

Verify that the aws-auth ConfigMap is configured correctly with the IAM role of your worker nodes, not the instance profile.

Set the tags for your worker nodes

For the Tag property of your worker nodes, set key to kubernetes.io/cluster/clusterName and set value to owned.

Confirm that the Amazon VPC subnets for the worker node have available IP addresses

If your Amazon VPC runs out of IP addresses, then associate a secondary CIDR to your existing Amazon VPC. For more information, see View Amazon EKS networking requirements for VPC and subnets.

Confirm that your Amazon EKS worker nodes can reach the API server endpoint for you cluster

You can launch worker nodes in any subnet within your cluster VPC or peered subnet if there's an internet route through the following gateways:

  • NAT
  • Internet
  • Transit

If your worker nodes are launched in a restricted private network, then confirm that your worker nodes can reach the Amazon EKS API server endpoint. For more information, see the requirements to run Amazon EKS in a private cluster without outbound internet access.

Note: It's a best practice to create the NAT gateway in a public subnet for nodes that're in a private subnet backed by a NAT gateway.

If you aren't using AWS PrivateLink endpoints, then verify access to API endpoints through a proxy server for the following AWS services:

  • Amazon EC2
  • Amazon ECR
  • Amazon S3

To verify that the worker node has access to the API server, use SSH to connect and then run the following netcat command:

nc -vz 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com 443

Note: Replace 9FCF4EA77D81408ED82517B9B7E60D52.yl4.eu-north-1.eks.amazonaws.com with your API server endpoint.

Check the kubelet logs while still connected to your instance:

journalctl -f -u kubelet

If the kubelet logs don't provide information on the source of the issue, then check the status of the kubelet on the worker node:

sudo systemctl status kubelet

Collect the Amazon EKS logs and the operating system logs for further troubleshooting.

Verify that the API endpoints can reach your AWS Region

Use SSH to connect to one of the worker nodes, and then run the following commands for each service:

$ nc -vz ec2.region.amazonaws.com 443
$ nc -vz ecr.region.amazonaws.com 443
$ nc -vz s3.region.amazonaws.com 443

Note: Replace region with the AWS Region for your worker node.

Configure the user data for your worker node

For managed node group launch templates with a specified AMI, you must supply bootstrap commands for worker nodes to join your cluster. Amazon EKS doesn't merge the default bootstrap commands into your user data. For more information, see Introducing launch template and custom AMI support in Amazon EKS Managed Node Groups and Specifying an AMI.

Example launch template with bootstrap commands:

#!/bin/bashset -o xtrace/etc/eks/bootstrap.sh 
${ClusterName} ${BootstrapArguments}

Note: Replace ${ClusterName} with the name of your Amazon EKS cluster. Replace ${BootstrapArguments} with additional bootstrap values, if needed.

Related information

Troubleshoot problems with Amazon EKS clusters and nodes

How do I get my worker nodes to join my Amazon EKS cluster?