- 最新
- 最多得票
- 最多評論
Hi User-8002955,
Have you tried using the SageMaker managed Autoscaling capability? Instead of having 4 endpoints and implementing the load balancing yourselves, SageMaker can implement an autoscaling policy and do the load balancing for you (at no additional cost). Using an autoscaling policy means that during peak hours, SageMaker will scale out (scale horizontally) your endpoint (add more instances) and when traffic is low again, it can scale in (remove excess instances).
If your access pattern is purely batch (you have all the data you want to inference available at the same time) you can use SageMaker Batch transform https://docs.aws.amazon.com/sagemaker/latest/dg/batch-transform.html . If this fits your needs it can probably be the most cost efficient as you are only being charged for the number of seconds that the transform job was running for (and not for availability time)
The 3rd option to have a look at is Async Endpoints https://docs.aws.amazon.com/sagemaker/latest/dg/async-inference.html Async endpoints are similar to real-time in the sense that you pass in the data for inference one by one as the requests come in. You can also leverage the autoscaling capability mentioned above, with the addition that if there is no traffic at all at a given time period, the endpoint can scale to 0. When making an inference request, the endpoint would reply immediately that it "got the request", will save it in a queue for processing and will write the result in the defined S3 location for you to access, and optionally may notify you that the inference request is ready.
Please note that 2nd and 3rd option above do not implement multi-model, so your application would need to handle which of the 2 models should be used and either trigger a different Batch job or use a different async endpoint.
Based on your application needs, all 3 of the above options can scale to cover your needs of 200TPS or above. It all depends on what is the right access pattern for your use-case. Hope the above helped
相關內容
- AWS 官方已更新 2 年前
- AWS 官方已更新 2 年前