Skip to content

Bedrock InvokeAgent intermittent internalServerException (Sonnet 4.6, us-west-2)

-1

Severity: Production system impaired (recommend urgent or high). Account: 082845737731 (orchestrator) / 563403381411 (Bedrock agents). Region: us-west-2. Service: Bedrock Agent Runtime → InvokeAgent. Model / inference profile: us.anthropic.claude-sonnet-4-6.

Summary For ~10 days we have observed intermittent internalServerException from InvokeAgent on long-running Sonnet 4.6 invocations (~600s). The error is not deterministic to a single agent, prompt, or token volume. SFN-level retry currently absorbs the failure end-to-end, but each occurrence burns several minutes of Bedrock compute before the retry succeeds.

Reproducible evidence Failed request ID (most recent, off-peak window): be220290-ae2c-486d-b22d-6bd6a4c62fba Timestamp: 2026-05-30T14:33:08Z (10:33 EDT) Caller: Lambda invoke-specialist at handler.py line 354 (EventStream consumer) Error: EventStreamError → An error occurred (internalServerException) when calling the InvokeAgent operation: Internal Server Exception Outer SFN execution: arn:aws:states:us-west-2:082845737731:execution:echo1-kb-gold-standard-orchestrator:00b92452-347b-40ff-b929-65105b991bb6 Outer status: SUCCEEDED (state-machine retry recovered after one failed attempt) Earlier on-peak observations consistently failed both SFN retry attempts at ~617s and ~620s, leading to outright execution failures. Same error reproduces across multiple agents (Channel Strategist, Discovery, Risk Auditor) — not agent-specific. What we have ruled out Not a Channel-Strategist-specific token volume issue (verified by E2E showing Discovery + Channel + Risk Auditor all failing in the same shape). Not a model-effort-tier issue (occurs at default Sonnet effort; switching effort tiers does not eliminate it). Not a client-side timeout (Lambda read_timeout widened to 850s; the EventStream is terminated server-side before that). Off-peak (00:00–13:00 UTC window) reduces frequency but does not eliminate it. Asks Is us.anthropic.claude-sonnet-4-6 in us-west-2 capacity-throttling at the inference-profile or model-tier level during peak (13:00–23:00 UTC)? If so, what is the current SLO? Can you confirm whether request be220290-ae2c-486d-b22d-6bd6a4c62fba (above) was throttled, dropped, or hit an internal-stack failure? Server-side reason code would help us decide between staying on retry-mask vs. switching strategies. Is there a path to provisioned throughput or a different inference profile that eliminates the internalServerException class without the cost step of a full PT commit? Any guidance on Bedrock invocation logging or X-Ray instrumentation that would surface the upstream cause of the EventStream termination in our own CloudWatch?

2 Answers
0

Kindly reach out to the Bedrock AWS Support Team from the production account that you are getting the error. As mentioned by our re:Post Agent, AWS Support should be able to investigate the internal logs to determine the root cause and provide specific guidance for your queries.

AWS
answered 6 days ago
-1

The InternalServerException you're experiencing is a temporary condition caused by an unexpected internal error within the AWS service. According to AWS documentation, this error should resolve itself with retries, and implementing exponential backoff retry logic is the recommended approach—which you've already done at the Step Functions level.

Understanding the Error: When the Bedrock Agent Runtime service returns an InternalServerException (HTTP 500), it indicates a server-side issue rather than a problem with your configuration or request structure. The error can include a reason field; specifically, if the reason is BEDROCK_MODEL_INVOCATION_SERVICE_UNAVAILABLE, it means the model invocation service is temporarily unavailable and you should retry your request.

Regarding Your Specific Questions:

  1. Capacity and Throttling: While the documentation confirms that InternalServerException is distinct from throttling exceptions (which return HTTP 429), I cannot provide real-time capacity information, SLOs, or confirm whether specific request IDs were throttled versus experiencing internal failures. These details require direct investigation by AWS Support with access to internal service logs.

  2. Request-Specific Analysis: The specific failure reason for request ID be220290-ae2c-486d-b22d-6bd6a4c62fba would need to be investigated by AWS Support, as they have access to internal service telemetry that isn't exposed through public APIs or documentation.

  3. Provisioned Throughput: I cannot confirm whether provisioned throughput would eliminate this specific error class without reviewing your exact use case with AWS Support. However, provisioned throughput generally provides more predictable performance characteristics.

  4. Logging and Instrumentation: For the InvokeAgent API, you can enable trace information by setting enableTrace to true in your request. This will include agent reasoning processes, action group invocations, and knowledge base queries in the response. However, this won't expose the internal service failure reasons that cause the InternalServerException.

Recommended Next Steps: Given the production impact and the specific account/request IDs you've provided, I strongly recommend opening an AWS Support case with your current support plan. Include all the details you've shared here—request IDs, timestamps, account numbers, and the pattern analysis you've already completed. AWS Support can investigate the server-side logs to determine the root cause and provide specific guidance for your workload.
Sources
InternalServerException - botocore 1.42.50 documentation
InvokeAgent - Amazon Bedrock
InvokeAgentRuntimeCommandStreamOutput - Amazon Bedrock AgentCore Data Plane

answered 10 days ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.