Skip to content

TPM & RPM Quota Monitoring Dashboard for Amazon Bedrock

9 minute read
Content level: Intermediate
2

This article shows how to auto-calculate Amazon Bedrock TPM Quota usage against Amazon Bedrock Quotas to understand when to raise service limits and debug potential throttling errors

This article demonstrates how to build a quota monitoring dashboard for AWS developers and ML engineers working with Amazon Bedrock across multiple models and regions. As organizations scale their generative AI workloads, understanding quota consumption becomes critical for preventing throttling, optimizing costs, and ensuring reliable application performance.

This sample is available at: https://github.com/aws-samples/sample-quota-dashboard-for-amazon-bedrock

Dashboard for Amazon Bedrock Quota Calculations

The Challenge: TPM Quota Calculation Complexity

While Amazon Bedrock provides excellent CloudWatch metrics for monitoring model usage, calculating actual TPM (Tokens Per Minute) quota consumption requires navigating Amazon Bedrock's token calculation system. Amazon Bedrock calculates token quota consumption in three stages:

  1. At Request Start - Reserves quota based on:

    Total Input Tokens + max_tokens
    
  2. During Processing - Periodically adjusts quota based on actual output generation

  3. At Request End - Final calculation using:

    InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × BurndownRate)
    

Why Throttling Occurs

The most common cause of unexpected throttling is the quota reservation at request start. Even if your actual token usage is low, Bedrock reserves quota based on your max_tokens parameter, which can be dramatically larger than actual output.

If max_tokens is not set, it defaults to the model's maximum output capacity. For Claude Sonnet, this is 64,000 tokens. This default behavior is a source for some of the most unexpected throttling issues.

Example with Claude Sonnet (64K max_tokens default):

  • Request: 1,000 input tokens, max_tokens: 64,000 (default if not set)
  • Reserved at start: 65,000 tokens (1,000 + 64,000)
  • Actual output: 100 tokens
  • Final consumption: Calculated using model's burndown rate
  • Difference: 63,900 tokens were temporarily held but not consumed

This massive gap between reserved and actual consumption explains why applications experience throttling even when actual token usage appears low. Without tracking the max_tokens parameter, it's essentially impossible to understand your true quota reservation.

Teams working with Bedrock quotas face:

  • Manual formula application - The calculations must be applied manually for each model with correct burndown rates
  • Burndown rate research - Different models have different rates (1x for most models, 5x for Claude 3.7/4 series) as documented in the quota token burndown documentation
  • No direct TPM Quota Usage metrics - CloudWatch provides individual token metrics, but not calculated TPM Quota consumption or peak reservation tracking

Implementation Overview

In this post, we demonstrate how to deploy a serverless Amazon Bedrock TPM/RPM Quota Dashboard that automatically performs TPM Quota usage calculations for the initial reservation Total Input Tokens + max_tokens and the actual consumption InputTokenCount + CacheWriteInputTokens + (OutputTokenCount × BurndownRate) in a CloudWatch dashboard. Step 2, where the number of tokens is re-calculated over the process of the request is unable to be measured as this occurs in Amazon Bedrock. This sample implementation shows how to eliminate manual computation by applying the correct burndown rates and formulas for each model, while providing visibility into both initial quota reservation and end quota usage.

Understanding Quota Usage Estimates

The dashboard displays two different quota metrics, not real-time actual usage. Amazon Bedrock dynamically adjusts quota consumption throughout output generation. As tokens are produced, the platform progressively releases the reserved quota. The two metrics serve different purposes:

  • Initial Reservation shows what Bedrock reserves when your request arrives. This determines whether you get throttled at request start.
  • Actual Consumption shows what you actually consumed after the request completes.

For models with 1x burndown rates, Actual Consumption will always be less than or equal to Initial Reservation. For models with 5x burndown rates, Actual Consumption can exceed Initial Reservation if the model generates substantial output.

Practical implications:

  • If Initial Reservation exceeds the limit but you're not seeing throttling, this is expected. The quota reservation is being released as output generates.
  • If you're experiencing throttling, consider reducing your max_tokens parameter to lower the Initial Reservation.
  • The gap between the two lines shows how much "buffer" your max_tokens setting creates.

Key Components & Features

This serverless solution provides comprehensive quota monitoring through several integrated components:

Automated Quota Calculations

  • QuotaFetcher Lambda: Retrieves Service Quota limits and publishes them as CloudWatch custom metrics, refreshed every 2.9 hours via EventBridge
  • Type-Safe Registry: Pre-configured burndown rates and quota codes for 80+ models (Amazon Nova, Claude, Llama, Mistral, Titan) with compile-time validation
  • Custom Metrics Integration: Tracks max_tokens parameter values to understand initial quota reservations

Dual Quota Tracking Dashboard

  • Initial Reservation: InputTokens + CacheWriteTokens + MaxTokens - shows what causes throttling at request start
  • Actual Consumption: InputTokens + CacheWriteTokens + (OutputTokens × BurndownRate) - shows final usage after completion
  • Direct Quota Visualization: Calculated TPM usage displayed against Service Quota limits with red warning lines

Flexible Architecture

  • Region-Specific Deployment: Easy deployment to different AWS regions with region-specific quota codes
  • Multi-Endpoint Support: Regional, cross-region, and global-cross-region endpoints
  • Real-Time Updates: Quota usage updates in real-time while quota limits refresh automatically

Architecture Overview

Architecture Diagram

This serverless monitoring solution tracks Amazon Bedrock model usage against Service Quotas via Amazon CloudWatch dashboards, providing complete visibility into both what causes throttling (peak consumption) and actual final usage.

Implementation Guide

Prerequisites

Before deploying, ensure you have:

  • AWS CLI configured with appropriate permissions
  • Node.js 18+ and npm installed
  • AWS CDK CLI: npm install -g aws-cdk
  • Permissions for Amazon CloudWatch, AWS Lambda, AWS IAM, Service Quotas, and Amazon EventBridge

Step 1: Clone and Setup

git clone https://github.com/aws-samples/sample-quota-dashboard-for-amazon-bedrock
cd sample-quota-dashboard-for-amazon-bedrock
npm install

Step 2: Region Configuration

This sample uses a type-safe, region-specific registry architecture. Before deploying, configure your target region:

// In lib/bedrock-registries.ts
export * from './bedrock-registries/us-east-1';  // For US East 1
// export * from './bedrock-registries/us-west-2';  // For US West 2

Critical: The registry import must match your deployment region, or you'll get incorrect quota codes.

Note: This sample dashboard currently supports a subset of Bedrock models in US East 1 and US West 2 regions. To add support for additional models or regions, follow the registry pattern demonstrated in lib/bedrock-registries/us-east-1.ts and lib/bedrock-registries/us-west-2.ts. This is described further in the Adding New Models section of this article.

Step 3: Deploy the Stack

# Bootstrap CDK (first time only)
AWS_DEFAULT_REGION=us-east-1 npx cdk bootstrap

# Build and deploy
npm run build
AWS_DEFAULT_REGION=us-east-1 npx cdk deploy

Step 4: Configure Custom Metrics

To track initial reservation that causes throttling, implement custom metrics publishing in your application. This tracks the max_tokens parameter value for each Bedrock API call.

Implementation options:

  • Boto3 Client Wrapper: Automatically publishes max_tokens metrics while preserving all original Bedrock functionality
  • Strands Agent Integration: Uses hooks to automatically publish metrics after each agent invocation
  • Manual Integration: Add CloudWatch PutMetricData calls to your existing Bedrock integration

For detailed code examples and implementation patterns, see the repository's README.

Step 5: Verify Deployment

After deployment, the stack outputs:

  • DashboardURL: Direct link to your CloudWatch dashboard
  • DashboardName: Name of the created dashboard

Navigate to the dashboard URL to see your quota monitoring in action.

The dashboard provides immediate visibility into your Bedrock quota consumption across all configured models, with dual tracking showing:

  • Initial Reservation: What causes throttling when your request arrives
  • Actual Consumption: Final consumption after requests complete

This demonstrates the comprehensive monitoring capabilities you can build for your own production workloads.

Adding New Models

As Amazon Bedrock releases new models, you can extend this sample dashboard by following this pattern:

1. Add Model Configuration

In your region-specific registry file (e.g., us-east-1.ts):

YOUR_NEW_MODEL: createModelConfig({
  modelId: 'provider.model-name-v1:0',
  outputTokenBurndownRate: 1, // Check AWS burndown documentation: https://docs.aws.amazon.com/bedrock/latest/userguide/quotas-token-burndown.html
  supportedEndpoints: ['regional', 'cross-region'],
  regional: { 
    tokenQuotaCode: 'L-<tokenquota1>', 
    requestQuotaCode: 'L-<requestquota1>' 
  },
  crossRegion: { 
    tokenQuotaCode: 'L-<tokenquota2>', 
    requestQuotaCode: 'L-<requestquota2>' 
  }
})

2. Find Quota Codes

Use the included script to discover quota codes for your region:

AWS_DEFAULT_REGION=us-east-1 npx ts-node scripts/get-quota-codes.ts

This generates a timestamped file with all available Bedrock quota codes.

3. Configure Dashboard

Add the new model to your dashboard configuration:

const allDashboardConfigs: DashboardConfig[] = [
  // ... existing configs
  { modelConfig: BEDROCK_MODELS.YOUR_PROVIDER.YOUR_NEW_MODEL, endpointType: 'regional' }
];

Cost Considerations

Monthly Operating Costs (~$5.73)

  • CloudWatch Dashboard: $3.00
  • Custom Metrics (9 metrics): $2.70
  • Lambda + EventBridge: ~$0.03

Cost Scaling

Cost scales directly with monitored models. Using the default configuration as an example:

  • 3 active models × 3 metrics per model (TokenQuota + RequestQuota + MaxTokens) = 9 custom metrics
  • Each additional model adds $0.90/month (3 metrics × $0.30)

Important Cost Notes

  • Custom metrics persist 15 months after deletion
  • The MaxTokens metric is published with each Bedrock API call, potentially generating high-frequency data points
  • API Request Costs: First 1,000,000 PutMetricData API requests are free monthly; high-volume applications (>1M Bedrock calls/month) incur $0.01 per 1,000 additional requests
  • When you run npx cdk destroy, resources stop immediately except CloudWatch custom metrics
  • To eliminate all costs, manually delete metrics from the "Bedrock/Quotas" namespace

Multi-Region Deployment Strategy

For organizations operating across multiple AWS regions:

1. Deploy Per Region

Deploy separate stacks in each region where you use Bedrock:

# US East 1
AWS_DEFAULT_REGION=us-east-1 npx cdk deploy

# US West 2  
AWS_DEFAULT_REGION=us-west-2 npx cdk deploy

# EU West 1
AWS_DEFAULT_REGION=eu-west-1 npx cdk deploy

2. Update Registry Imports

Before each regional deployment, update the registry import:

// For US East 1
export * from './bedrock-registries/us-east-1';

// For US West 2
export * from './bedrock-registries/us-west-2';

Conclusion

This Amazon Bedrock TPM/RPM Quota Dashboard sample demonstrates how to eliminate the manual complexity of calculating TPM quota consumption by automatically applying model-specific burndown rates. Instead of manually researching burndown rates and computing token calculations for each model, this sample provides a type-safe, automated approach that handles 80+ models across multiple regions.

Use this foundation to build production monitoring systems that provide your teams immediate insight into quota consumption patterns, throttling causes, and optimization opportunities across your generative AI workloads.

2 Comments

Fantastic dashboard solution for Bedrock TPM/RPM monitoring. This will save tons of time managing agent quotas in production—great use-case.

​Few Questions:

Does this dashboard auto-alert on 80% TPM quota thresholds? Any Slack/Teams integration examples?

How does it handle multi-model quotas (e.g., Claude + Llama)? Separate metrics or aggregated view?

For serverless Bedrock agents, what's the best way to track RPM bursts during peak RAG traffic?

replied 2 months ago

Thank you! I'm glad this dashboard will help with managing Service Quotas for Amazon Bedrock!

  1. Alerts: The dashboard currently doesn't include alarms, but they can be easily added since all required metrics are already available. The sample was designed to demonstrate how to calculate the metrics and provide a strong foundation for building custom alerting based on your specific requirements.

  2. Multi-model support: Each model is displayed in its own dedicated TPM and RPM tile. When you add multiple models (e.g., Claude + Llama), each appears as a separate tile showing its individual quota usage against the Service Quota limit with no aggregation which provides clear per-model visibility.

  3. RPM burst tracking: RPM data is already available through CloudWatch (Namespace: AWS/Bedrock, Metric name: Invocations, ModelId: <insert-desired-model-id>). This metric has a minimum period of 1 minute, so bursts are grouped into 1-minute intervals. Since Service Quotas for Invocations are also tracked per-minute, you can easily compare your usage against quota limits directly.

AWS
EXPERT
replied 2 months ago