Skip to content

EKS 1.33 with NVIDIA GPU Node - failed to provision g5.2xlarge

0

Hi, I am trying to provision NVIDIA GPU worker node in EKS 1.33. I have also increased the SPOT and On-Demand instances quota to 12. Tried the below configuration both for Spot and On-demand but it failed to launce the instances in both the cases:

Instance Type: g5.2xlarge AMI ID: ami-0106255e3dc6be482 Used amazon-linux-2023/x86_64/nvidia for the latest approved AL2023 NVIDIA x86 based instances. AMI Location: /aws/service/eks/optimized-ami/1.33/amazon-linux-2023/x86_64/nvidia/recommended/image_id EKS Cluster version: 1.33

Used launch template to mention the Image ID (retrive from SSM parameter) and 100GB EBS as mentioned above.

Errors: NodeCreationFailure Instances failed to join the kubernetes cluster

AsgInstanceLaunchFailures Could not launch Spot Instances. UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration. Please adjust your request and try again. Launching EC2 instance failed.

Why am I getting EC2 launch error and EC2 is not able to join EKS cluster?

Any help would be highlight appriciated. Thanks.

1 Answer
0
Accepted Answer

Based on the error messages you're encountering, there are two main issues:

  1. Insufficient Capacity Error: The error "UnfulfillableCapacity - Unable to fulfill capacity due to your request configuration" indicates that AWS doesn't currently have enough available capacity of g5.2xlarge instances to fulfill your request in the region/availability zone you're trying to use. This is a common issue with specialized GPU instance types like g5.2xlarge, which have more limited availability compared to general-purpose instances.

  2. Instances Failed to Join Kubernetes Cluster: This secondary error would only be relevant if the instances could actually launch, which they currently cannot due to the capacity issue.

Here are some recommendations to resolve these issues:

For the capacity issue:

  • Try launching the instance in a different Availability Zone within your region
  • Consider using an alternative GPU instance type that might have more available capacity, such as g4dn instances
  • If your workload allows, try using a different region where capacity might be available
  • For critical workloads, you might want to explore using Capacity Reservations

For GPU setup in EKS:

  • Ensure you're using the correct AMI (which you are - the EKS optimized accelerated AMI)
  • When using the AL2023 Accelerated AMI, note that while it includes the NVIDIA GPU driver, you'll need to install the NVIDIA Kubernetes Device Plugin separately (typically via a DaemonSet) unless you're using eksctl which can handle this automatically
  • If using eksctl, it will detect GPU instance types and install the NVIDIA Kubernetes device plugin automatically

The primary issue you need to solve first is the capacity limitation. Once you can successfully launch the instances, you can address any remaining configuration issues related to the GPU setup in your EKS cluster.
Sources
aws ap-northeast-1 gpu instance is not work!!! help | AWS re:Post
Compute and Autoscaling - Amazon EKS
GPU Support - Eksctl User Guide

answered 3 months ago
AWS
SUPPORT ENGINEER
reviewed 3 months ago
  • I tried with On-Demand but still having the issue Errors: NodeCreationFailure Instances failed to join the kubernetes cluster

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.