2 Answers
- Newest
- Most votes
- Most comments
1
The maximum concurrency for a single endpoint is the limit of simultaneous invocations that one particular endpoint can handle, set up to 200. The maximum total concurrency across all serverless endpoints is the sum of concurrent invocations that all endpoints combined can handle, capped at 10 for the entire account. Even though a single endpoint can handle up to 200 invocations, the overall limit of 10 invocations across the account is the overriding constraint.
0
I confirm the Service Quotas for total concurrency of Serverless Inference also dropped from 200 to 10!!!
answered 7 months ago
Relevant content
- asked 2 years ago
- asked 2 months ago
- asked 8 months ago
- AWS OFFICIALUpdated 2 days ago
- AWS OFFICIALUpdated a year ago
- AWS OFFICIALUpdated 8 months ago
Thanks Marte, in AWS document it said "You can set the maximum concurrency for a single endpoint up to 200, and the total number of serverless endpoint variants you can host in a Region is 50. The total concurrency you can share between all serverless endpoints per Region in your account is 200".
So each endpoint can handle 200 concurrent invocations. Additionally, all endpoints together can handle 200 concurrent invocations (not 10). I think AWS should update their dev document to avoid this misleading.
https://sagemaker-examples.readthedocs.io/en/latest/serverless-inference/huggingface-serverless-inference/huggingface-text-classification-serverless-inference.html#:~:text=You%20can%20set%20the%20maximum,in%20your%20account%20is%20200