Rate Limits and Statistics for LLama 3 8B on Bedrock

Question

I want to understand the following benchmarks for LLama 3 8B on AWS Bedrock:
1. Max concurrency for a single user/API
2. Latency
3. Tokens per second
4. Rate limits, if any (number of requests per minute/day or number of tokens per minute/day)
We have a use case where we plan to scale to 30 concurrent users, with good amount of token and request usage.

Answer

Hi. It's hard to give a precise answer because it depends on a lot of factors. Just to give you a few examples, latency can depend on the region where the model is deployed, the prompt itself, and so on. However, I'd suggest you use e benchmarking tooling that you can find on github. I'm proving the repo link:

https://github.com/aws-samples/foundation-model-benchmarking-tool

The quotas can be found here:

https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html

https://github.com/aws-samples/foundation-model-benchmarking-tool

The quotas can be found here:

https://docs.aws.amazon.com/bedrock/latest/userguide/quotas.html

Rate Limits and Statistics for LLama 3 8B on Bedrock

Relevant content