- Newest
- Most votes
- Most comments
Based on the information provided, it appears that you've encountered a significant issue while attempting to update your EKS cluster from version 1.27 to 1.28, particularly with the GPU worker node. The problem seems to stem from a mismatch between the worker node and cluster versions, which has persisted through previous updates.
The most critical issue identified is the AMI version mismatch. The warning message indicates that the worker node AMI version is still at 1.25, while the cluster is attempting to update to 1.28. This version gap is likely the primary reason for the node creation failure and the inability of new nodes to join the cluster.
To resolve this issue, you should focus on the following steps:
-
Update Worker Node AMIs: Before attempting to update the cluster version, ensure that all worker nodes, including both CPU and GPU nodes, are running on an AMI version that is compatible with Kubernetes 1.28. Ideally, you should update the worker nodes to version 1.27 or 1.28.
-
Staged Update Process: Instead of jumping directly from 1.25 to 1.28 for the worker nodes, consider a staged update process. Update the worker nodes to 1.26, then 1.27, and finally to 1.28. This gradual approach can help mitigate compatibility issues.
-
GPU Node Specific Checks: Since the GPU node is failing to update, verify that you're using a GPU-compatible AMI and that all necessary GPU drivers and software are up-to-date and compatible with the target Kubernetes version.
-
Add-on Compatibility: Ensure that all cluster add-ons, especially those related to networking (like VPC CNI) and GPU support, are updated to versions compatible with Kubernetes 1.28.
-
Node Group Update Strategy: Review and possibly adjust your node group update strategy. Consider using a blue/green deployment strategy for worker nodes to minimize downtime and reduce the risk of update failures.
-
Cluster Upgrade Process: In the future, always ensure that worker nodes are updated to a compatible version before upgrading the cluster control plane. This helps maintain version alignment and reduces the risk of compatibility issues.
-
CloudWatch Logs Deep Dive: Conduct a more thorough analysis of the CloudWatch logs, focusing specifically on the GPU node's startup process and any errors it encounters when attempting to join the cluster.
-
IAM Role Review: Double-check the IAM roles associated with the GPU nodes to ensure they have all necessary permissions, including any GPU-specific requirements.
If these steps do not resolve the issue, you may need to consider creating a new node group with the correct AMI version and gradually migrating workloads to the new group. In extreme cases, you might need to create a new cluster with the desired version and migrate your workloads to ensure a clean, up-to-date environment.
Sources
Update existing cluster to new Kubernetes version - Amazon EKS
UpdateClusterVersion - Amazon EKS
Updating an Amazon EKS add-on - Amazon EKS
Community | Amazon EKS Upgrade Guidance (v1.25 to 1.26)
I will be adding below to the above troubleshooting steps
- Review EFS Volume Configuration Since you identified an issue with unmounting the EFS volume, ensure that: EFS is not in use: If there are still running pods that have a volume mount to the EFS, they may be preventing unmounting. Try checking for any active pods using that volume. Unmount the EFS volume: If it's safe to do so, manually unmount the EFS from the node or terminate any pods that are still using it. You can find the pods using the following command:
kubectl get pods --all-namespaces -o wide | grep <your-efs-volume-name>
Cleanup: Make sure any leftover resources related to the EFS are cleaned up to allow for a clean node update.
-
Node Group Configuration Check the GPU Node Configuration: Ensure that the GPU node is configured correctly: Ensure that the correct AMI with GPU support is being used for your GPU nodes. The warning message indicates that the AMI version should be the same as the cluster version or one minor version behind. You might need to launch a new node group specifically for the 1.28 version with the appropriate GPU AMI if the existing one is not compatible.
-
Scaling and Health Checks Scale Down: Temporarily scale down your node group and then scale it back up to see if this resolves the issue. This can sometimes trigger re-evaluation and re-creation of the nodes. Health Checks: Check the health status of the node group and ensure that it is in a healthy state. You can do this through the EKS console or via CLI:
aws eks describe-nodegroup --cluster-name <your-cluster-name> --nodegroup-name <your-nodegroup-name>
- Cluster Events and Logs Check EKS Events: Run the following command to get cluster events, which might provide more insight into the issue:
kubectl get events --all-namespaces
EC2 and EKS Logs: Review both EC2 instance logs and EKS control plane logs in CloudWatch for any additional error messages that might indicate what went wrong during the node creation process.
-
Manual Node Creation If the automated process is failing, consider manually creating a new GPU node to see if that succeeds. This can provide insights into whether there’s a more fundamental issue with the auto-scaling or the node group configuration.
-
Check IAM Role and Policy Updates Ensure that there have not been any recent changes to IAM roles or policies that may affect the ability of the worker nodes to join the cluster. Sometimes, policies can change, which may affect permissions required for EKS operations.
-
Network ACLs and Route Tables Although you mentioned that network configurations passed, double-check the Network ACLs and Route Tables associated with your subnets. Any misconfigurations here can lead to communication issues between the worker nodes and the control plane.
-
Force Upgrade with Specific Parameters If you haven't already, consider using specific parameters with your force upgrade command that might help troubleshoot further.
Focusing on the EFS issue and ensuring that the AMI and node configurations are correct is critical. If the problem persists, involving AWS Support could be the best path forward, especially if there’s a deeper issue at play.
Hello,
If your cluster has node groups with GPU support (for example, p3.2xlarge), you must update the NVIDIA device plugin for Kubernetes DaemonSet on your cluster. Replace vX.X.X with your desired NVIDIA/k8s-device-plugin version before running the following command.
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/vX.X.X/deployments/static/nvidia-device-plugin.yml
More info: https://docs.aws.amazon.com/eks/latest/userguide/update-cluster.html#step4
Relevant content
- asked 2 years ago
- asked 5 years ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago