Error using Sagemaker, Custom Triton Container, Huggingface/Pytorch Sagemaker Pre Built Docker Image

0

I'm using Sagemaker to host a multi-container endpoint which includes a multi-model container and a post-processing single-model container: I'm setting this up as so:

mme_container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
    "Environment": {
        "SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8",
    }
}

torch_container = {
    'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04',
    'ModelDataUrl': '{bucket_url}/post_process.tar.gz'
}
instance_type = "ml.g5.xlarge"
response = sm_client.create_model(
              ModelName        = serial_model_name,
              ExecutionRoleArn = role,
              Containers       = [mme_container,torch_container]
)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": serial_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Our Pre-Build Triton Docker Container Extension DockerBuild file:

# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.07-py3
# FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3

LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

ENV SAGEMAKER_MULTI_MODEL=true
ENV SAGEMAKER_BIND_TO_PORT=8080

EXPOSE 8080

RUN pip install -U pip

RUN pip install --upgrade diffusers==0.25.0 transformers==4.36.1 accelerate numpy xformers scipy omegaconf torch torchvision pytorch_lightning pynvml

RUN pip install git+https://github.com/sberbank-ai/Real-ESRGAN.git

RUN apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

The Errors: The endpoint is in the creating status for about 1-2 hours, and in that time it follows the following pattern:

  • There are no logs from either container_1 or container-2 for the first ~15-30 minutes
  • When the logs do finally show up, it is only from container-2 all the way until the endpoint fails

Interestingly enough, when using old PyTorch or hugging-face docker images, the containers both load successfully.

We've tried various things such as:

  • increasing the instance_type to 4x large
  • Adding various environment variables into the MME container such as: 'SAGEMAKER_PROGRAM': '', 'SAGEMAKER_SUBMIT_DIRECTORY': '',"SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8", "SAGEMAKER_MULTI_MODEL": "true", "SM_LOG_LEVEL": "10"

Through the various things we've done, the only way we've managed to get logs from container_1 were by using the following docker images: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-gpu-py36-cu110-ubuntu18.04 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.3-gpu-py3

And while the above pre-built docker images worked with our custom extended sagemaker-triton docker image, they were too old to handle the necessary requirements of our model.

Any help as to debugging this issue would be greatly appreciated.

CS Ayo
質問済み 4ヶ月前61ビュー
回答なし

ログインしていません。 ログイン 回答を投稿する。

優れた回答とは、質問に明確に答え、建設的なフィードバックを提供し、質問者の専門分野におけるスキルの向上を促すものです。

質問に答えるためのガイドライン

関連するコンテンツ