Hello AWS team!
I am trying to run a suite of inference recommendation jobs leveraging NVIDIA Triton Inference Server on a set of GPU instances (ml.g5.12xlarge, ml.g5.8xlarge, ml.g5.16xlarge) as well as AWS Inferentia machines (ml.inf2.2xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge).
The following parameters customize each job:
A number of jobs is being spawned for each instance (as shown in the InferenceRecommender panel in SageMaker):
-
ml.g5.8xlarge, ml.g5.16xlarge, ml.inf2.2xlarge - 1 job
- All fail with error: Image size 12399514599 is greater than supported size 10737418240
-
ml.inf2.24xlarge - 2 jobs
- 1 job fails with error: Image size 12399514599 is greater than supported size 10737418240
- 1 job fails with "Benchmark failed to finish within job duration"
-
ml.inf2.8xlarge - 3 jobs
- 2 jobs fail with error: Image size 12399514599 is greater than supported size 10737418240
- 1 job fails with "Benchmark failed to finish within job duration"
-
ml.g5.12xlarge - 4 jobs
- 3 jobs fail with error: Image size 12399514599 is greater than supported size 10737418240
- 1 job successfully completes!!
Since the models I am experimenting with consist of LLMs, their size combined with the associated image exceed the 10GB threshold discussed in this community question.
CloudWatch deep dive:
Looking into the logs associated with the Inferentia jobs, the following messages were repeatedly surfaced:
- The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support
- [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
- Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
- CUDA memory pool disabled
- Failed to load '/opt/ml/model/::router' version 1: Invalid argument: instance group router_0 of model router specifies invalid or unsupported gpu id 0. GPUs with at least the minimum required CUDA compute compatibility of 6.000000
My questions are:
- What is the number of spawned jobs associated with each machine related to? (GPU count, Inferentia cores ?)
- How can one use the Inference Recommendation service for LLMs considering they routinely exceed the 10GB AWS Lambda threshold?
- Why does 1 job successfully complete on the ml.g5.12xlarge when the remaining jobs (for this instance and others as well) failed with the image size error?
- How does one avoid the "Benchmark failed to finish within job duration" error?
- Are there specific settings that one must account for when running recommendation jobs on Inferentia machines?
Hello Giovanni,
This question represents an upgraded repost of the questions I addressed in the first link you attached to this answer (as a consequence of the superficial treatment of the topic) Unfortunately, none of the resources attached in your answer address the questions I am raising.