- Newest
- Most votes
- Most comments
You may need to check where this acceleration comes from to determine the warm up process. In CloudWatch metrics, you have ModelLatency
and OverheadLatency
.
SageMaker Endpoint has a front-end router which maintains some caches for meta data and credentials. If the requests are frequent enough, the cache will be retained and auto renewed. This will reduce the OverheadLatency
.
If you see a big drop in ModeLatency
with warm-up requests, this may mean your algorithm container could have been configured to retrain some temporary data longer.
Normally, you could schedule an invocation Lambda with CloudWatch Alarms to target tracking the metricInvokationPerInstance
. This will make sure you always maintain a certain invocation rate when idle and those fake requests could settle down when real requests are picking up.
The issue with warm-up is that we stops the normal auto-scaling process of endpoints. The endpoint may not scale down properly.
Relevant content
- asked a year ago
- asked 8 months ago
- Accepted Answer
- asked 16 days ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 8 months ago
To better understand, are you using SageMaker serverless inference (https://docs.aws.amazon.com/sagemaker/latest/dg/serverless-endpoints.html)?