- Newest
- Most votes
- Most comments
As noted here, the maximum RAM allocation currently supported by SageMaker Serverless Inference is 6GB. These endpoints (like AWS Lambda) also don't support GPU acceleration - so inference is also likely to be pretty slow.
For responding in real-time to volume on the scale of "a few times a day" without reserved compute, you'd pretty much expect to scale up from zero for every request... In which context 7 billion parameters is a very large model: Something, somewhere would have to download several GB of weights, load them into memory, and then calculate through them to get your output.
If you really want a large Foundation Model, I'd probably suggest to just use one of the vision-capable models available in Amazon Bedrock - like Llama 3.2 or Anthropic Claude - since you'll be able to pay per token but leverage the fact that the service already has the models up and running. If you're using custom models that are still based on the Llama architecture, check out the documentation and pricing details to see if Bedrock Custom Model Import could be a viable option?
Otherwise, for typical computer vision tasks, there are many more classical models that can provide very strong performance for much smaller model sizes: Like YOLO variants, YOLO-World for open-vocabulary detection at ~100M parameters; SAM 2; etc etc.
It should work. Several additional points. When you import custom model to Bedrock your pricing model changes and you pay the the actual compute that is used to run the model and not for the tokens. As an alternative you also can use Batch inference in SageMaker. You mentioned that you only need to run the inference for several hours per day. So it can be a good candidate. Batch inference spins the compute, execute the job and shuts itself down.
This is very helpful, thank you!
Given the limitations of serverless inference for vision LLMs, I'm leaning more towards using something like Meta Llama3.2 out of the box from bedrock. We also want to keep our options open if we need to fine-tune the model for our use-case. In which case, it seems fine-tuning Meta Llama3.2 from hugging face and importing as a custom model should work. Is that right?
Thanks
Relevant content
- asked a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 24 days ago
- AWS OFFICIALUpdated 4 months ago