Seeking Advice to Optimize Cold Start Time for AWS Serverless Inference Endpoint with S3 Hosted HuggingFace Model

0

I'm currently leveraging a custom HuggingFace model, stored in an S3 bucket, for my serverless inference endpoint. The model size is approximately 750MB. However, I'm encountering significant cold start delays—over 30 seconds—if the endpoint isn't accessed at least once every 5 minutes.

I'm exploring potential solutions to reduce this cold start time. I've come across suggestions such as hosting the model on Amazon ElastiCache or directly on an EBS volume instead of S3, which are said to potentially minimize these delays. Before I proceed, I’d love to gather insights or recommendations from this community on the best practices for this scenario.

Additionally, here’s a snippet of the endpoint's code for context:

from sagemaker.huggingface.model import HuggingFaceModel
from sagemaker.serverless import ServerlessInferenceConfig

# create Hugging Face Model Class
huggingface_model = HuggingFaceModel(
   model_data=s3_location,       # path to your model and script
   role=role,                    # iam role with permissions to create an Endpoint
   transformers_version="4.26",  # transformers version used
   pytorch_version="1.13",        # pytorch version used
   py_version='py39',            # python version used
)

# Specify MemorySizeInMB and MaxConcurrency in the serverless config object
serverless_config = ServerlessInferenceConfig(
    memory_size_in_mb=3072, max_concurrency=10,
)

# deploy the endpoint endpoint
predictor = huggingface_model.deploy(
    serverless_inference_config=serverless_config
)

Does anyone have experience with optimizing such serverless environments, particularly for ML model inference? Any advice on whether moving the model storage from S3 to another AWS service like ElastiCache or EBS, or any other strategies, would be beneficial?

Thank you in advance for your help and suggestions!

1 Answer
0

Hi,

From your question, I understand that you are running your Inference endpoint on on-demand Serverless option.

If your on-demand Serverless Inference endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. To overcome this cold start problem you can choose following option:

  1. You can use provisioned concurrency. SageMaker keeps the endpoint warm and ready to respond in milliseconds, for the number of Provisioned Concurrency that you allocated.
  2. You can schedule an fake invocation Lambda with CloudWatch Alarms to target tracking the metric Invocations and OverheadLatency. This will make sure you always maintain a certain invocation rate when idle and in case of real request you can stop sending fake requests.

Also, the alternative option that you mentioned in the questions is also a viable solution but this will have lots of operational pain as you need to manage entire infrastructure.

You can compare all 3 options on the basis of operational head/cost and decide which one is best for your use case.

Thank you

AWS
answered 5 months ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions