Error using Sagemaker, Custom Triton Container, Huggingface/Pytorch Sagemaker Pre Built Docker Image

I'm using Sagemaker to host a multi-container endpoint which includes a multi-model container and a post-processing single-model container: I'm setting this up as so:

mme_container = {
    "Image": mme_triton_image_uri,
    "ModelDataUrl": model_data_url,
    "Mode": "MultiModel",
    "Environment": {
        "SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8",
    }
}

torch_container = {
    'Image': '763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.0.0-transformers4.28.1-gpu-py310-cu118-ubuntu20.04',
    'ModelDataUrl': '{bucket_url}/post_process.tar.gz'
}
instance_type = "ml.g5.xlarge"
response = sm_client.create_model(
              ModelName        = serial_model_name,
              ExecutionRoleArn = role,
              Containers       = [mme_container,torch_container]
)
create_endpoint_config_response = sm_client.create_endpoint_config(
    EndpointConfigName=endpoint_config_name,
    ProductionVariants=[
        {
            "InstanceType": instance_type,
            "InitialVariantWeight": 1,
            "InitialInstanceCount": 1,
            "ModelName": serial_model_name,
            "VariantName": "AllTraffic",
        }
    ],
)

Our Pre-Build Triton Docker Container Extension DockerBuild file:

# SageMaker PyTorch image
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/sagemaker-tritonserver:22.07-py3
# FROM 301217895009.dkr.ecr.us-west-2.amazonaws.com/sagemaker-tritonserver:22.07-py3

LABEL com.amazonaws.sagemaker.capabilities.accept-bind-to-port=true
LABEL com.amazonaws.sagemaker.capabilities.multi-models=true

ENV SAGEMAKER_MULTI_MODEL=true
ENV SAGEMAKER_BIND_TO_PORT=8080

EXPOSE 8080

RUN pip install -U pip

RUN pip install --upgrade diffusers==0.25.0 transformers==4.36.1 accelerate numpy xformers scipy omegaconf torch torchvision pytorch_lightning pynvml

RUN pip install git+https://github.com/sberbank-ai/Real-ESRGAN.git

RUN apt-get update && apt-get install ffmpeg libsm6 libxext6  -y

The Errors: The endpoint is in the creating status for about 1-2 hours, and in that time it follows the following pattern:

There are no logs from either container_1 or container-2 for the first ~15-30 minutes
When the logs do finally show up, it is only from container-2 all the way until the endpoint fails

Interestingly enough, when using old PyTorch or hugging-face docker images, the containers both load successfully.

We've tried various things such as:

increasing the instance_type to 4x large
Adding various environment variables into the MME container such as: 'SAGEMAKER_PROGRAM': '', 'SAGEMAKER_SUBMIT_DIRECTORY': '',"SAGEMAKER_TRITON_MODEL_LOAD_GPU_LIMIT": "0.8", "SAGEMAKER_MULTI_MODEL": "true", "SM_LOG_LEVEL": "10"

Through the various things we've done, the only way we've managed to get logs from container_1 were by using the following docker images: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:1.7.1-transformers4.6.1-gpu-py36-cu110-ubuntu18.04 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-inference:1.3-gpu-py3

And while the above pre-built docker images worked with our custom extended sagemaker-triton docker image, they were too old to handle the necessary requirements of our model.

Any help as to debugging this issue would be greatly appreciated.

トピック

機械学習と AI

タグ

Amazon SageMaker

言語

English

CS Ayo

質問済み 4ヶ月前61ビュー

回答なし

新しい順
投票が多い順
コメントが多い順

Error using Sagemaker, Custom Triton Container, Huggingface/Pytorch Sagemaker Pre Built Docker Image

関連するコンテンツ