Why does my SageMaker endpoint go into the failed state when I create or update an endpoint?

5 minute read
0

I want to troubleshoot why the creation or update of my Amazon SageMaker endpoint has failed.

Resolution

When the creation or update of your SageMaker endpoint fails, SageMaker provides the reason for the failure. Use either of the following options to review this reason:

  • Check the endpoint in the SageMaker Console. The reason for the failure is reported in the console.
  • Run the AWS Command Line Interface (AWS CLI) command describe-endpoint. Check the FailureReason field to see the reason for the failure.

Note: If you receive errors when you run AWS CLI commands, then be sure that you have the most recent version of the AWS CLI.

The following are some of the failure reasons and their resolution methods:

Unable to provision requested ML compute capacity due to InsufficientInstanceCapacity error

You might get the following error when you try to create an endpoint:

Unable to provision requested ML compute capacity due to InsufficientInstanceCapacity error

This error occurs when AWS doesn't have sufficient capacity to provision the instances requested for your endpoint.

You can resolve this error with the following options:

  • Wait for a few minutes and try again because capacity can shift frequently.
  • If you use multiple instances for your endpoint, then try to create the endpoint with a smaller number of instances. If you have Auto Scaling configured, then SageMaker scales up or down as required and as capacity permits.
  • Try a different instance type that supports your workload. After you create an endpoint, update the endpoint with the desired instance type. SageMaker uses a blue/green deployment method to maximize availability that allows you to transition to a new instance type without affecting your current production workloads.

The container for production variant <variant> did not pass the ping health check. Please check CloudWatch logs for this endpoint.

Containers used for SageMaker endpoints must implement a web server that responds to the /invocations and /ping endpoints. When you create an endpoint, SageMaker starts sending periodic GET requests to the /ping endpoint after the container starts.

A container must respond with an HTTP 200 OK status code and an empty body to indicate that the container can accept inference requests. This error occurs when SageMaker doesn't get consistent responses from the container within four minutes after the container starts up. SageMaker doesn't consider that the endpoint is healthy because the endpoint doesn't respond to the health check. Therefore, the endpoint is marked as Failed.

The health check might fail when you use one of the AWS Deep Learning Containers images. These images use either TorchServe or Multi Model Server to serve the models that implement the HTTP endpoints for inference and health checks. These frameworks check whether the model is loaded before responding to SageMaker with a 200 OK response. If the server can't see that the model is loaded, then the health check fails. A model might not load for many reasons, including memory usage. The corresponding error messages are logged into Amazon CloudWatch Logs for the endpoint. If the code loaded into the endpoint caused the failure, then the errors are logged in to AWS CloudTrail. For example, the code might fail because of the model_fn for PyTorch. To add more words to these logs, update the SAGEMAKER_CONTAINER_LOG_LEVEL environmental variable for the model with the log levels for Python logging.

A health check request must receive a response within two seconds to be successful. Start your model container locally and send a GET request to the container to check the response.

Failed to extract model data archive for container

SageMaker expects a TAR file with the model data for use in your endpoint. After SageMaker downloads the TAR file, the data archive is extracted. This error might occur if SageMaker can't extract this data archive. For example, SageMaker can't extract the data archive if the model artifact contains symbolic links for files located in the TAR file.

When you create an endpoint, be sure that the model artifacts don't include symbolic or hard links within the TAR file. To check if the TAR file includes symbolic or hard links, extract the model data, and then run the following command inside the artifacts:

find . -type l -ls

This command returns all the symbolic links found after searching through the current directory and any of its subdirectories. Replace any link that's returned with the actual copies of the file.

CannotStartContainerError

This error occurs when SageMaker fails to start the container to prepare the container for inference.

When SageMaker starts the endpoint, your container is started with the following command:

docker run <image-id> serve

When this command runs, your container must start the serving process.

To resolve this error, use local mode for the SageMaker Python SDK. Or, try run your inference image with the docker run command. The SageMaker Python SDK loads up your model similar to a SageMaker endpoint. However, Docker doesn't load the model unless you configure the command or container to do so. You can use a command similar to the following to load your model locally:

docker run -v /path/to/your/untarred/model/directory:/opt/ml/model -p 8080:8080 --rm ${image} 
AWS OFFICIAL
AWS OFFICIALUpdated 13 days ago
3 Comments

The article needs an update in the section Failed to extract model data archive for container. Apart from Soft -symbolic- links, the model's tarball cannot have hard links:

Error: { " ErrorCode: ""INVALID_MODEL_DATA_ARCHIVE""," " Message: ""error: Invalid archive s3://xxxxxxxxxxxxx/ccccccccccc.tar.gz. Model data archive cannot contain hard links""" " }," " Status: ""DownloadFailedClientError"","

After removing hard links from the model's tarball, the SageMaker endpoint was deployed with no issues.

AWS
replied 7 months ago

Thank you for your comment. We'll review and update the Knowledge Center article as needed.

profile pictureAWS
MODERATOR
replied 7 months ago

Sagemaker will also fail with error "Model data archive cannot be uncompressed" if the S3 artifact cannot be accessed or deciphered by sagemaker. This can happen if IAM policies are not correct. Check : the S3 bucket policy, the KMS key used to cipher the model on S3, the resource policy on the KMS key and the policy of the sagemaker execution role associated to the model.

replied 5 months ago