Inference Recommendation fails due to image size error

0

Hello AWS team!

I am trying to run a suite of inference recommendation jobs leveraging NVIDIA Triton Inference Server on a set of GPU instances (ml.g5.12xlarge, ml.g5.8xlarge, ml.g5.16xlarge) as well as AWS Inferentia machines (ml.inf2.2xlarge, ml.inf2.8xlarge, ml.inf2.24xlarge).

The following parameters customize each job:

  • SAGEMAKER_MODEL_SERVER_WORKERS = 1

  • OMP_NUM_THREADS =3

  • JobType = Default ( not Advanced )

A number of jobs is being spawned for each instance (as shown in the InferenceRecommender panel in SageMaker):

  • ml.g5.8xlarge, ml.g5.16xlarge, ml.inf2.2xlarge - 1 job

    • All fail with error: Image size 12399514599 is greater than supported size 10737418240
  • ml.inf2.24xlarge - 2 jobs

    • 1 job fails with error: Image size 12399514599 is greater than supported size 10737418240
    • 1 job fails with "Benchmark failed to finish within job duration"
  • ml.inf2.8xlarge - 3 jobs

    • 2 jobs fail with error: Image size 12399514599 is greater than supported size 10737418240
    • 1 job fails with "Benchmark failed to finish within job duration"
  • ml.g5.12xlarge - 4 jobs

    • 3 jobs fail with error: Image size 12399514599 is greater than supported size 10737418240
    • 1 job successfully completes!!

Since the models I am experimenting with consist of LLMs, their size combined with the associated image exceed the 10GB threshold discussed in this community question.

CloudWatch deep dive: Looking into the logs associated with the Inferentia jobs, the following messages were repeatedly surfaced:

  • The NVIDIA Driver was not detected. GPU functionality will not be available. Use the NVIDIA Container Toolkit to start this container with GPU support
  • [Torch-TensorRT] - Unable to read CUDA capable devices. Return status: 35
  • Unable to allocate pinned system memory, pinned memory pool will not be available: CUDA driver version is insufficient for CUDA runtime version
  • CUDA memory pool disabled
  • Failed to load '/opt/ml/model/::router' version 1: Invalid argument: instance group router_0 of model router specifies invalid or unsupported gpu id 0. GPUs with at least the minimum required CUDA compute compatibility of 6.000000

My questions are:

  • What is the number of spawned jobs associated with each machine related to? (GPU count, Inferentia cores ?)
  • How can one use the Inference Recommendation service for LLMs considering they routinely exceed the 10GB AWS Lambda threshold?
  • Why does 1 job successfully complete on the ml.g5.12xlarge when the remaining jobs (for this instance and others as well) failed with the image size error?
  • How does one avoid the "Benchmark failed to finish within job duration" error?
  • Are there specific settings that one must account for when running recommendation jobs on Inferentia machines?
1 Answer
-1
  • Hello Giovanni,

    This question represents an upgraded repost of the questions I addressed in the first link you attached to this answer (as a consequence of the superficial treatment of the topic) Unfortunately, none of the resources attached in your answer address the questions I am raising.

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions