By using AWS re:Post, you agree to the AWS re:Post Terms of Use

Serverless inference for huggingface vision model memory limitation

0

Hi,

I am trying to deploy a serverless endpoint for inference for a huggingface vision model. Most vision models that perform well are 7b parameters and up but require close to 16gb memory to run. Is serverless inference even a practical possibility for this use case?

If not, then what are my options? Deploying a dedicated instance would be an overkill for my use-case, we need to run inference only a few times a day. Auto-scaling for a dedicated instance requires a minimum instance size of 1.

Thanks

asked a month ago83 views
4 Answers
3
Accepted Answer

As noted here, the maximum RAM allocation currently supported by SageMaker Serverless Inference is 6GB. These endpoints (like AWS Lambda) also don't support GPU acceleration - so inference is also likely to be pretty slow.

For responding in real-time to volume on the scale of "a few times a day" without reserved compute, you'd pretty much expect to scale up from zero for every request... In which context 7 billion parameters is a very large model: Something, somewhere would have to download several GB of weights, load them into memory, and then calculate through them to get your output.

If you really want a large Foundation Model, I'd probably suggest to just use one of the vision-capable models available in Amazon Bedrock - like Llama 3.2 or Anthropic Claude - since you'll be able to pay per token but leverage the fact that the service already has the models up and running. If you're using custom models that are still based on the Llama architecture, check out the documentation and pricing details to see if Bedrock Custom Model Import could be a viable option?

Otherwise, for typical computer vision tasks, there are many more classical models that can provide very strong performance for much smaller model sizes: Like YOLO variants, YOLO-World for open-vocabulary detection at ~100M parameters; SAM 2; etc etc.

AWS
EXPERT
answered a month ago
profile picture
EXPERT
reviewed a month ago
3

It should work. Several additional points. When you import custom model to Bedrock your pricing model changes and you pay the the actual compute that is used to run the model and not for the tokens. As an alternative you also can use Batch inference in SageMaker. You mentioned that you only need to run the inference for several hours per day. So it can be a good candidate. Batch inference spins the compute, execute the job and shuts itself down.

AWS
answered a month ago
profile picture
EXPERT
reviewed a month ago
0

This is very helpful, thank you!

Given the limitations of serverless inference for vision LLMs, I'm leaning more towards using something like Meta Llama3.2 out of the box from bedrock. We also want to keep our options open if we need to fine-tune the model for our use-case. In which case, it seems fine-tuning Meta Llama3.2 from hugging face and importing as a custom model should work. Is that right?

Thanks

answered a month ago
0

ok, that is good to know. Thank you!

answered a month ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.

Guidelines for Answering Questions