Skip to content

Best practice for stable docker image: SageMaker BYOC async inference images built via CI/CD stopped working

0

Context

  • Base image (AWS DLC): 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference:2.6.0-transformers4.49.0-gpu-py312-cu124-ubuntu22.04
  • This had been stable for months.

Endpoints started failing health checks with:

ImportError: /opt/conda/lib/python3.12/site-packages/torch/lib/libtorch_cuda.so:
undefined symbol: ncclCommShrink

What that means ncclCommShrink exists only in NCCL ≥ 2.19. The error indicates PyTorch was built against a newer NCCL, but at runtime the container loaded an older NCCL (no ncclCommShrink symbol).

Investigation so far

  • My Dockerfile used a floating tag (above).
  • Successful images from different dates showed identical first ~31 filesystem layers (so “felt” the same), but I can’t retrieve the exact base image digest each was built from.

Fix I applied - Rebuilt with a new Dockerfile based on a freshly pulled base image and got things running again

Please help me understand how can I avoid this in future and what actually happened?

Should I Pin by image digest, not just by tag?

1 Answer
0

Yes, you should absolutely pin by image digest rather than just by tag to avoid this issue in the future.

What Actually Happened:

Even though you were using what appeared to be a specific tag (2.6.0-transformers4.49.0-gpu-py312-cu124-ubuntu22.04), AWS can update the underlying image content behind that same tag. This is exactly what happened - AWS updated the Deep Learning Container image to include a newer version of NCCL, but the tag remained the same. Your CI/CD pipeline was pulling what it thought was the "same" base image, but it had actually been updated with incompatible dependencies.

Best Practices to Prevent This:

  1. Pin by Image Digest: Instead of using just the tag, specify the exact image digest in your Dockerfile. This ensures you're always building from the exact same base image. Your FROM statement should look like:
FROM 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-inference@sha256:<digest>
  1. Retrieve and Store Digests: When you build a Docker image, retrieve and document the base image digest. You can get the digest using the describe-images command to verify which exact image exists in the Amazon Elastic Container Registry repository.

  2. Control Your Update Cycle: By pinning to a digest, you control when to update your base image rather than having changes forced upon you. When you're ready to update, you can pull the latest version, test it thoroughly, and then update your Dockerfile with the new digest.

  3. Consider Tag Immutability: For your own custom images in ECR, enable tag immutability to prevent tags from being overwritten, ensuring consistency across deployments.

This approach gives you stability and predictability in your CI/CD pipeline while still allowing you to update base images on your own schedule after proper testing.
Sources
Deploy models with DJL Serving - Amazon SageMaker AI
Use latest container image digest during Amazon ECS deployments | AWS re:Post

answered 4 months ago
EXPERT
reviewed 4 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.