By using AWS re:Post, you agree to the AWS re:Post Terms of Use

EKS cluster update from 1.27 to 1.28 Worker Node Group doesn't not join the cluster

0

I've tried to update my EKS cluster with two worker nodes from 1.27 to 1.28. Since it should be updating worker nodes first before the cluster update, one out of two worker nodes (cpu node) could be updated successfully to Kubernetes version 1.28. However, the gpu node could not be update was failed giving the error;

"NodeCreationFailure - Couldn't proceed with upgrade process as new nodes are not joining node group"

During the inspection process I also noticed that previous cluster updates were carried out without updating the worker nodes in first place. The node group versions were kept at version 1.25 until now.

Below are the steps already taken so far according to troubleshoot options provided by AWS post.

1. VPC and Networking Checks:

  • IP Address Availability: Confirmed that the Virtual Private Cloud (VPC) has an adequate number of IP addresses available.
  • Subnet Inspection: Verified that the subnets linked to the node group have a sufficient number of free IP addresses.
  • Security Group Configuration: Ensured that the security groups are set up to permit the necessary traffic between the nodes and the control plane.

2. IAM Roles and Policies Verification:

  • Node IAM Role Permissions: Checked and confirmed that the node IAM role includes all required permissions for joining the cluster.
  • Policy Attachments: Verified that the AmazonEKSWorkerNodePolicy, AmazonEKS_CNI_Policy, and AmazonEC2ContainerRegistryReadOnly policies are correctly attached to the node role.

3. CloudWatch Logs Examination:

  • Error Message Review: Analyzed the CloudWatch logs for any specific error messages that occurred during the node creation process.
  • Issue Identification: Discovered an issue within the logs; there's an issue during unmounting process of the EFS volume in the GPU node because it is not empty

4. AWS Support Runbook Utilization:

  • Runbook Execution: Implemented the AWSSupport-TroubleshootEKSWorkerNode runbook to conduct a detailed analysis of the EC2 worker nodes and the EKS cluster.

5. Kubernetes Version Compatibility Check:

  • Compatibility Assurance: Ensured that the Kubernetes version targeted for the upgrade is compatible with our existing setup, including any custom configurations or extensions in use.

Cluster Check

  1. Checking if the cluster Security Group is allowing traffic from the worker node:
  • Passed
  1. Checking DHCP options of the cluster VPC:
  • Passed
  1. Checking cluster IAM role arn: <Account> for the required permissions:
  • Passed
  1. Checking control plane Elastic Network Interfaces(ENIs) in the cluster VPC:
  • Passed
  1. Checking cluster VPC for required DNS attributes: -Passed

Worker Node Check

  1. Checking if the EC2 instance family is supported:
  • Passed
  1. Checking the worker node network configuration:
  • No issue
  1. Checking the IAM instance Profile of the worker node:
  • Passed
  1. Checking the worker node <name> tags:
  • Passed
  1. Checking the AMI version for EC2 instance <name>:
  • [WARNING]: Worker node AMI version 1.25 must be at the same cluster version 1.28 or one minor version behind
  1. Checking worker node Elastic Network Interfaces(ENIs) and Private IP addresses to check if CNI is running:
  • Passed
  1. Checking the outbound SG rules for worker node:
  • Passed
  1. Checking if the worker node is running in AWS Outposts subnet , Checking basic NACL rules, Checking STS regional endpoint availability, Checking if Instance Metadata http endpoint is enabled on the worker node, Checking if SSM agent is running and reachable on worker node
  • Passed
  1. Consideration of Force Upgrade: I have carried out the force upgrade method as well a the last option which failed as well.

"Error Upgrading nodegroup Stack: waiter state transitioned to failure"

Is there any other checks I have missed since I still can't update the gpu node. Please Support

asked a month ago92 views
3 Answers
0

Based on the information provided, it appears that you've encountered a significant issue while attempting to update your EKS cluster from version 1.27 to 1.28, particularly with the GPU worker node. The problem seems to stem from a mismatch between the worker node and cluster versions, which has persisted through previous updates.

The most critical issue identified is the AMI version mismatch. The warning message indicates that the worker node AMI version is still at 1.25, while the cluster is attempting to update to 1.28. This version gap is likely the primary reason for the node creation failure and the inability of new nodes to join the cluster.

To resolve this issue, you should focus on the following steps:

  1. Update Worker Node AMIs: Before attempting to update the cluster version, ensure that all worker nodes, including both CPU and GPU nodes, are running on an AMI version that is compatible with Kubernetes 1.28. Ideally, you should update the worker nodes to version 1.27 or 1.28.

  2. Staged Update Process: Instead of jumping directly from 1.25 to 1.28 for the worker nodes, consider a staged update process. Update the worker nodes to 1.26, then 1.27, and finally to 1.28. This gradual approach can help mitigate compatibility issues.

  3. GPU Node Specific Checks: Since the GPU node is failing to update, verify that you're using a GPU-compatible AMI and that all necessary GPU drivers and software are up-to-date and compatible with the target Kubernetes version.

  4. Add-on Compatibility: Ensure that all cluster add-ons, especially those related to networking (like VPC CNI) and GPU support, are updated to versions compatible with Kubernetes 1.28.

  5. Node Group Update Strategy: Review and possibly adjust your node group update strategy. Consider using a blue/green deployment strategy for worker nodes to minimize downtime and reduce the risk of update failures.

  6. Cluster Upgrade Process: In the future, always ensure that worker nodes are updated to a compatible version before upgrading the cluster control plane. This helps maintain version alignment and reduces the risk of compatibility issues.

  7. CloudWatch Logs Deep Dive: Conduct a more thorough analysis of the CloudWatch logs, focusing specifically on the GPU node's startup process and any errors it encounters when attempting to join the cluster.

  8. IAM Role Review: Double-check the IAM roles associated with the GPU nodes to ensure they have all necessary permissions, including any GPU-specific requirements.

If these steps do not resolve the issue, you may need to consider creating a new node group with the correct AMI version and gradually migrating workloads to the new group. In extreme cases, you might need to create a new cluster with the desired version and migrate your workloads to ensure a clean, up-to-date environment.
Sources
Update existing cluster to new Kubernetes version - Amazon EKS
UpdateClusterVersion - Amazon EKS
Updating an Amazon EKS add-on - Amazon EKS
Community | Amazon EKS Upgrade Guidance (v1.25 to 1.26)

profile picture
answered a month ago
0

I will be adding below to the above troubleshooting steps

  1. Review EFS Volume Configuration Since you identified an issue with unmounting the EFS volume, ensure that: EFS is not in use: If there are still running pods that have a volume mount to the EFS, they may be preventing unmounting. Try checking for any active pods using that volume. Unmount the EFS volume: If it's safe to do so, manually unmount the EFS from the node or terminate any pods that are still using it. You can find the pods using the following command:
kubectl get pods --all-namespaces -o wide | grep <your-efs-volume-name>

Cleanup: Make sure any leftover resources related to the EFS are cleaned up to allow for a clean node update.

  1. Node Group Configuration Check the GPU Node Configuration: Ensure that the GPU node is configured correctly: Ensure that the correct AMI with GPU support is being used for your GPU nodes. The warning message indicates that the AMI version should be the same as the cluster version or one minor version behind. You might need to launch a new node group specifically for the 1.28 version with the appropriate GPU AMI if the existing one is not compatible.

  2. Scaling and Health Checks Scale Down: Temporarily scale down your node group and then scale it back up to see if this resolves the issue. This can sometimes trigger re-evaluation and re-creation of the nodes. Health Checks: Check the health status of the node group and ensure that it is in a healthy state. You can do this through the EKS console or via CLI:

aws eks describe-nodegroup --cluster-name <your-cluster-name> --nodegroup-name <your-nodegroup-name>
  1. Cluster Events and Logs Check EKS Events: Run the following command to get cluster events, which might provide more insight into the issue:
kubectl get events --all-namespaces

EC2 and EKS Logs: Review both EC2 instance logs and EKS control plane logs in CloudWatch for any additional error messages that might indicate what went wrong during the node creation process.

  1. Manual Node Creation If the automated process is failing, consider manually creating a new GPU node to see if that succeeds. This can provide insights into whether there’s a more fundamental issue with the auto-scaling or the node group configuration.

  2. Check IAM Role and Policy Updates Ensure that there have not been any recent changes to IAM roles or policies that may affect the ability of the worker nodes to join the cluster. Sometimes, policies can change, which may affect permissions required for EKS operations.

  3. Network ACLs and Route Tables Although you mentioned that network configurations passed, double-check the Network ACLs and Route Tables associated with your subnets. Any misconfigurations here can lead to communication issues between the worker nodes and the control plane.

  4. Force Upgrade with Specific Parameters If you haven't already, consider using specific parameters with your force upgrade command that might help troubleshoot further.

Focusing on the EFS issue and ensuring that the AMI and node configurations are correct is critical. If the problem persists, involving AWS Support could be the best path forward, especially if there’s a deeper issue at play.

profile picture
EXPERT
answered a month ago
0

Hello,

If your cluster has node groups with GPU support (for example, p3.2xlarge), you must update the NVIDIA device plugin for Kubernetes DaemonSet on your cluster. Replace vX.X.X with your desired NVIDIA/k8s-device-plugin version before running the following command.

kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml

More info: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#step4

profile pictureAWS
SUPPORT ENGINEER
answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions