This article demonstrates how to build a quota monitoring dashboard for AWS developers and ML engineers working with Amazon Bedrock across multiple models and regions. As organizations scale their generative AI workloads, understanding quota consumption becomes critical for preventing throttling, optimizing costs, and ensuring reliable application performance.
This sample is available at: https://github.com/aws-samples/sample-quota-dashboard-for-amazon-bedrock

The Challenge: TPM Quota Calculation Complexity
While Amazon Bedrock provides excellent CloudWatch metrics for monitoring model usage, calculating actual TPM (Tokens Per Minute) quota consumption requires navigating Amazon Bedrock's token calculation system. Amazon Bedrock calculates token quota consumption in three stages:
-
At Request Start - Reserves quota based on:
Total Input Tokens + max_tokens
-
During Processing - Periodically adjusts quota based on actual output generation
-
At Request End - Final calculation using:
InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × BurndownRate)
Why Throttling Occurs
The most common cause of unexpected throttling is the quota reservation at request start. Even if your actual token usage is low, Bedrock reserves quota based on your max_tokens parameter, which can be dramatically larger than actual output.
If max_tokens is not set, it defaults to the model's maximum output capacity. For Claude Sonnet, this is 64,000 tokens. This default behavior is a source for some of the most unexpected throttling issues.
Example with Claude Sonnet (64K max_tokens default):
- Request: 1,000 input tokens,
max_tokens: 64,000 (default if not set)
- Reserved at start: 65,000 tokens (1,000 + 64,000)
- Actual output: 100 tokens
- Final consumption: Calculated using model's burndown rate
- Difference: 63,900 tokens were temporarily held but not consumed
This massive gap between reserved and actual consumption explains why applications experience throttling even when actual token usage appears low. Without tracking the max_tokens parameter, it's essentially impossible to understand your true quota reservation.
Teams working with Bedrock quotas face:
- Manual formula application - The calculations must be applied manually for each model with correct burndown rates
- Burndown rate research - Different models have different rates (1x for most models, 5x for Claude 3.7/4 series) as documented in the quota token burndown documentation
- No direct TPM Quota Usage metrics - CloudWatch provides individual token metrics, but not calculated TPM Quota consumption or peak reservation tracking
Implementation Overview
In this post, we demonstrate how to deploy a serverless Amazon Bedrock TPM/RPM Quota Dashboard that automatically performs TPM Quota usage calculations for the initial reservation Total Input Tokens + max_tokens and the actual consumption InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × BurndownRate) in a CloudWatch dashboard. Step 2, where the number of tokens is re-calculated over the process of the request is unable to be measured as this occurs in Amazon Bedrock. This sample implementation shows how to eliminate manual computation by applying the correct burndown rates and formulas for each model, while providing visibility into both initial quota reservation and end quota usage.
Understanding Quota Usage Estimates
The dashboard displays two different quota metrics, not real-time actual usage. Amazon Bedrock dynamically adjusts quota consumption throughout output generation. As tokens are produced, the platform progressively releases the reserved quota. The two metrics serve different purposes:
- Initial Reservation shows what Bedrock reserves when your request arrives. This determines whether you get throttled at request start.
- Actual Consumption shows what you actually consumed after the request completes.
For models with 1x burndown rates, Actual Consumption will always be less than or equal to Initial Reservation. For models with 5x burndown rates, Actual Consumption can exceed Initial Reservation if the model generates substantial output.
Practical implications:
- If Initial Reservation exceeds the limit but you're not seeing throttling, this is expected. The quota reservation is being released as output generates.
- If you're experiencing throttling, consider reducing your
max_tokens parameter to lower the Initial Reservation.
- The gap between the two lines shows how much "buffer" your
max_tokens setting creates.
Key Components & Features
This serverless solution provides comprehensive quota monitoring through several integrated components:
Automated Quota Calculations
- QuotaFetcher Lambda: Retrieves Service Quota limits and publishes them as CloudWatch custom metrics, refreshed every 2.9 hours via EventBridge
- Type-Safe Registry: Pre-configured burndown rates and quota codes for 80+ models (Amazon Nova, Claude, Llama, Mistral, Titan) with compile-time validation
- Custom Metrics Integration: Tracks
max_tokens parameter values to understand initial quota reservations
Dual Quota Tracking Dashboard
- Initial Reservation:
InputTokens + CacheWriteTokens + MaxTokens - shows what causes throttling at request start
- Actual Consumption:
InputTokens + CacheWriteTokens + (OutputTokens × BurndownRate) - shows final usage after completion
- Direct Quota Visualization: Calculated TPM usage displayed against Service Quota limits with red warning lines
Flexible Architecture
- Region-Specific Deployment: Easy deployment to different AWS regions with region-specific quota codes
- Multi-Endpoint Support: Regional, cross-region, and global-cross-region endpoints
- Real-Time Updates: Quota usage updates in real-time while quota limits refresh automatically
Architecture Overview

This serverless monitoring solution tracks Amazon Bedrock model usage against Service Quotas via Amazon CloudWatch dashboards, providing complete visibility into both what causes throttling (peak consumption) and actual final usage.
Implementation Guide
Prerequisites
Before deploying, ensure you have:
- AWS CLI configured with appropriate permissions
- Node.js 18+ and npm installed
- AWS CDK CLI:
npm install -g aws-cdk
- Permissions for Amazon CloudWatch, AWS Lambda, AWS IAM, Service Quotas, and Amazon EventBridge
Step 1: Clone and Setup
git clone https://github.com/aws-samples/sample-quota-dashboard-for-amazon-bedrock
cd sample-quota-dashboard-for-amazon-bedrock
npm install
Step 2: Region Configuration
This sample uses a type-safe, region-specific registry architecture. Before deploying, configure your target region:
// In lib/bedrock-registries.ts
export * from './bedrock-registries/us-east-1'; // For US East 1
// export * from './bedrock-registries/us-west-2'; // For US West 2
Critical: The registry import must match your deployment region, or you'll get incorrect quota codes.
Note: This sample dashboard currently supports a subset of Bedrock models in US East 1 and US West 2 regions. To add support for additional models or regions, follow the registry pattern demonstrated in lib/bedrock-registries/us-east-1.ts and lib/bedrock-registries/us-west-2.ts. This is described further in the Adding New Models section of this article.
Step 3: Deploy the Stack
# Bootstrap CDK (first time only)
AWS_DEFAULT_REGION=us-east-1 npx cdk bootstrap
# Build and deploy
npm run build
AWS_DEFAULT_REGION=us-east-1 npx cdk deploy
Step 4: Configure Custom Metrics
To track initial reservation that causes throttling, implement custom metrics publishing in your application. This tracks the max_tokens parameter value for each Bedrock API call.
Implementation options:
- Boto3 Client Wrapper: Automatically publishes
max_tokens metrics while preserving all original Bedrock functionality
- Strands Agent Integration: Uses hooks to automatically publish metrics after each agent invocation
- Manual Integration: Add CloudWatch PutMetricData calls to your existing Bedrock integration
For detailed code examples and implementation patterns, see the repository's README.
Step 5: Verify Deployment
After deployment, the stack outputs:
- DashboardURL: Direct link to your CloudWatch dashboard
- DashboardName: Name of the created dashboard
Navigate to the dashboard URL to see your quota monitoring in action.
The dashboard provides immediate visibility into your Bedrock quota consumption across all configured models, with dual tracking showing:
- Initial Reservation: What causes throttling when your request arrives
- Actual Consumption: Final consumption after requests complete
This demonstrates the comprehensive monitoring capabilities you can build for your own production workloads.
Adding New Models
As Amazon Bedrock releases new models, you can extend this sample dashboard by following this pattern:
1. Add Model Configuration
In your region-specific registry file (e.g., us-east-1.ts):
YOUR_NEW_MODEL: createModelConfig({
modelId: 'provider.model-name-v1:0',
outputTokenBurndownRate: 1, // Check AWS burndown documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/quotas-token-burndown.html
supportedEndpoints: ['regional', 'cross-region'],
regional: {
tokenQuotaCode: 'L-<tokenquota1>',
requestQuotaCode: 'L-<requestquota1>'
},
crossRegion: {
tokenQuotaCode: 'L-<tokenquota2>',
requestQuotaCode: 'L-<requestquota2>'
}
})
2. Find Quota Codes
Use the included script to discover quota codes for your region:
AWS_DEFAULT_REGION=us-east-1 npx ts-node scripts/get-quota-codes.ts
This generates a timestamped file with all available Bedrock quota codes.
3. Configure Dashboard
Add the new model to your dashboard configuration:
const allDashboardConfigs: DashboardConfig[] = [
// ... existing configs
{ modelConfig: BEDROCK_MODELS.YOUR_PROVIDER.YOUR_NEW_MODEL, endpointType: 'regional' }
];
Cost Considerations
Monthly Operating Costs (~$5.73)
- CloudWatch Dashboard: $3.00
- Custom Metrics (9 metrics): $2.70
- Lambda + EventBridge: ~$0.03
Cost Scaling
Cost scales directly with monitored models. Using the default configuration as an example:
- 3 active models × 3 metrics per model (TokenQuota + RequestQuota + MaxTokens) = 9 custom metrics
- Each additional model adds $0.90/month (3 metrics × $0.30)
Important Cost Notes
- Custom metrics persist 15 months after deletion
- The MaxTokens metric is published with each Bedrock API call, potentially generating high-frequency data points
- API Request Costs: First 1,000,000 PutMetricData API requests are free monthly; high-volume applications (>1M Bedrock calls/month) incur $0.01 per 1,000 additional requests
- When you run
npx cdk destroy, resources stop immediately except CloudWatch custom metrics
- To eliminate all costs, manually delete metrics from the "Bedrock/Quotas" namespace
Multi-Region Deployment Strategy
For organizations operating across multiple AWS regions:
1. Deploy Per Region
Deploy separate stacks in each region where you use Bedrock:
# US East 1
AWS_DEFAULT_REGION=us-east-1 npx cdk deploy
# US West 2
AWS_DEFAULT_REGION=us-west-2 npx cdk deploy
# EU West 1
AWS_DEFAULT_REGION=eu-west-1 npx cdk deploy
2. Update Registry Imports
Before each regional deployment, update the registry import:
// For US East 1
export * from './bedrock-registries/us-east-1';
// For US West 2
export * from './bedrock-registries/us-west-2';
Conclusion
This Amazon Bedrock TPM/RPM Quota Dashboard sample demonstrates how to eliminate the manual complexity of calculating TPM quota consumption by automatically applying model-specific burndown rates. Instead of manually researching burndown rates and computing token calculations for each model, this sample provides a type-safe, automated approach that handles 80+ models across multiple regions.
Use this foundation to build production monitoring systems that provide your teams immediate insight into quota consumption patterns, throttling causes, and optimization opportunities across your generative AI workloads.