- Newest
- Most votes
- Most comments
The issue you're experiencing is not due to broken cluster nodes, but rather an insufficient capacity problem with the specific instance type (hpc7g.8xlarge) in the Ireland region. This is a known issue that can occur when AWS doesn't have enough available capacity to provision the requested instance type in a particular region or availability zone.
Insufficient capacity errors can persist for extended periods, especially for specialized or high-demand instance types like the HPC-optimized instances. The fact that this has been ongoing for 5 days is not uncommon in such situations.
Here are some points to consider and potential solutions:
-
The issue is specific to the hpc7g.8xlarge instance type, which is why other instance types are working fine.
-
Restarting nodes or resubmitting jobs won't resolve the issue as it's a capacity problem on AWS's end.
-
To address this, you could try the following:
a. Use a different instance type: Consider using an equivalent instance type that might have more available capacity.
b. Try a different Availability Zone within the Ireland region if possible.
c. Wait and retry: Sometimes, capacity issues resolve themselves as resources become available.
d. If this is a critical workload, you might want to consider temporarily moving your workload to a different AWS region where capacity is available.
-
For long-term solutions, you could:
a. Request a capacity increase from AWS Support for your preferred instance type in the Ireland region.
b. Implement a multi-region strategy for your HPC workloads to mitigate regional capacity issues.
Remember, this is not an issue with your cluster configuration or nodes, but rather a temporary limitation in AWS's available capacity for the specific instance type you're requesting.
Sources
Insufficient-capacity when trying to add a Reader to ap-southeast-2a Aurora Postgres of Size db.r8g.large | AWS re:Post
Jobs stuck in a RUNNABLE status - AWS Batch
Slurm cluster fast insufficient capacity fail-over - AWS ParallelCluster
Relevant content
- asked 3 years ago
- asked 3 years ago
