1 Answer
- Newest
- Most votes
- Most comments
0
Hi,
From your question, I understand that you are running your Inference endpoint on on-demand Serverless option.
If your on-demand Serverless Inference endpoint does not receive traffic for a while and then your endpoint suddenly receives new requests, it can take some time for your endpoint to spin up the compute resources to process the requests. To overcome this cold start problem you can choose following option:
- You can use provisioned concurrency. SageMaker keeps the endpoint warm and ready to respond in milliseconds, for the number of Provisioned Concurrency that you allocated.
- You can schedule an fake invocation Lambda with CloudWatch Alarms to target tracking the metric Invocations and OverheadLatency. This will make sure you always maintain a certain invocation rate when idle and in case of real request you can stop sending fake requests.
Also, the alternative option that you mentioned in the questions is also a viable solution but this will have lots of operational pain as you need to manage entire infrastructure.
You can compare all 3 options on the basis of operational head/cost and decide which one is best for your use case.
Thank you
answered 5 months ago
Relevant content
- Accepted Answerasked 3 months ago
- AWS OFFICIALUpdated 2 months ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago