How do I troubleshoot issues when I bring my custom container to SageMaker for training or inference?

4 minute read
0

I want to troubleshoot issues when I bring my custom container to Amazon SageMaker for training or inference.

Short description

Customize your container images in SageMaker using one of the following approaches:

  • Extend a prebuilt SageMaker container.
  • Bring your own container.
  • Build a container image from scratch.

With these approaches, you get errors related to the incorrect build of the container image. Be sure to configure the container correctly.

Resolution

Extend a prebuilt SageMaker container

Use this approach to customize your environment or framework by adding additional functionalities. With this approach, you don't build the container image from scratch because the deep learning libraries are predefined.

Be sure that the environment variables SAGEMAKER_SUBMIT_DIRECTORY and SAGEMAKER_PROGRAM are set in the Dockerfile. Then, install the required additional libraries in your Dockerfile.

To install the required additional libraries, run the following commands:

# SageMaker PyTorch imageFROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/pytorch-training:1.5.1-cpu-py36-ubuntu16.04
ENV PATH="/opt/ml/code:${PATH}"

# this environment variable is used by the SageMaker PyTorch container to determine our user code directory.
ENV SAGEMAKER_SUBMIT_DIRECTORY /opt/ml/code

# install the libraries using pip
COPY requirements.txt./requirements.txt
RUN pip install requirements.txt

# /opt/ml and all subdirectories are utilized by SageMaker, use the /code subdirectory to store your user code.
COPY cifar10.py /opt/ml/code/cifar10.py
# Defines cifar10.py as script
entrypoint 
ENV SAGEMAKER_PROGRAM cifar10.py

After the image build succeeds, run the container in local mode. Make sure that the image works as expected.

For more information, see Extend a pre-built container.

Bring your own container

Use this approach when you have an image for processing data, model training, or real-time inference with features and safety requirements that prebuilt SageMaker images don't support.

Be sure that you installed the respective SageMaker Toolkit libraries for training or inference. These toolkits define the location for code and other resources. They also define the entry point that contains the code to run when the container is started. When you create a SageMaker training job or inference endpoint, SageMaker creates the following directories:

/opt/ml  
    ├── input
    │
    ├── model
    │
    ├── code
    │
    ├── output
    │
    └── failure

When you run a training job, the /opt/ml/input directory contains information about the data channel that's used to access the data stored in Amazon Simple Storage Service (Amazon S3). The training script (train.py), along with its dependencies, is stored in opt/ml/code. Be sure that the script writes the final model in the /opt/ml/model directory after the training job completes.

When you host a trained model on SageMaker to make inferences, the model is stored in /opt/ml/model. The inference code (inference.py) is stored in /opt/ml/code.

For more information, see Adapting your own Docker container to work with SageMaker.

Build a container from scratch

If you have a custom algorithm and don't have a custom container image, then it's a best practice to use this approach.

To make sure that the container runs as an executable, use the exec form of ENTRYPOINT instruction in your Dockerfile:

ENTRYPOINT \["python", "cifar10.py"\]

If the training job is successful, then the training script must exit with 0. If the training is unsuccessful, then it must have a non-zero exit code.

Be sure that the final model is written to /opt/ml/model, and all the dependencies and artifacts are stored in /opt/ml/output. If a training job fails, then the script must write the failure information to /opt/ml/output/failure.

When you create an inference endpoint, save the model in the FILENAME.tar.gz format. The container must implement HTTP POST request on /invocations for inference and HTTP GET request on /ping for endpoint health check. For more information, see Create a container with your own algorithms and models.

Related information

Use the Amazon SageMaker local mode to train on your notebook instance

AWS OFFICIAL
AWS OFFICIALUpdated a month ago