Skip to content

Request for Support: Intermittent 4xx and 5xx Errors on OpenSearch Serverless

0

We are using AWS OpenSearch Serverless for our search workloads and have observed intermittent HTTP 4xx and 5xx errors over the past few months. While the scale of the issue is low, we are reaching out to better understand the root cause and determine if there are any recommended actions or configurations to improve resilience or visibility.


Details:

  • Observed Errors:

    • 401 Unauthorized
    • 403 Forbidden
    • 503 Service Unavailable
  • Frequency:

    • ~20 events in the past 3 months
    • No clear or consistent pattern to the occurrences
    • Happens randomly, even during off-peak traffic
  • Traffic Volume:

    • ~100 queries per second (QPS) during peak hours
  • Notes:

    • We are using the serverless flavor of OpenSearch, and due to the abstracted infrastructure, we have limited observability into backend metrics or throttling behavior.

    • The errors are infrequent but notable, especially the 5xx responses which may indicate a service-side problem.

    • We are particularly curious about:

      • Why a 401 might occur randomly if credentials are unchanged
      • Whether a 503 are expected under certain load scenarios despite moderate QPS
      • If there are configuration recommendations or debugging tools available for better insight

Request:

  1. Investigate logs or telemetry on your end for our OpenSearch Serverless collection to help us identify the cause of these errors.
  2. Advise whether this behavior is expected and under what conditions it might occur.
  3. Provide any best practices or monitoring recommendations (e.g., CloudWatch insights, retry logic tuning, limits awareness, etc.)
  4. Let us know if we should be using a particular client SDK version or configuration to avoid known issues.

Impact:

This issue is not currently causing outages or major user impact, but understanding its cause will help us improve reliability and confidence in our system.

1 Answer
0

Based on the information provided, I can offer some insights into the intermittent errors you're experiencing with AWS OpenSearch Serverless.

For the 401 Unauthorized errors, these typically indicate network policy issues. If you're accessing OpenSearch Serverless from the public internet, ensure that public access is enabled in your network policy. If you're using private access, verify that your VPC endpoint configuration is correct. These errors can occur randomly if there are intermittent network connectivity issues or if your network policy doesn't consistently match your access patterns.

The 403 Forbidden errors generally point to permission issues. These can happen when:

  • Your IAM permissions are insufficient or incorrectly configured
  • The data access policy for your collection doesn't contain the correct permissions for specific indices
  • The Principal section in your policy doesn't correctly identify the IAM identity making the request
  • A request is incorrectly signed

For 503 Service Unavailable errors, these could be related to service limitations or backend issues. One known behavior with OpenSearch Serverless is that there's a delay (up to 30 seconds) after index creation before the index becomes fully operational. If you're creating indices and immediately sending data, this could cause intermittent 500-level errors.

Recommendations for improving resilience and visibility:

  1. Implement retry mechanisms with exponential backoff in your application code, especially for index creation and data ingestion operations.

  2. Monitor relevant CloudWatch metrics for your collection, such as IngestionRequestErrors and IngestionRequestSuccess, to gain insights into error patterns.

  3. For troubleshooting specific issues:

  • Create HAR files when reproducing errors in the browser to determine the exact cause
  • Review your data access policies to ensure they contain the correct permissions
  • Check that your IAM identity has the necessary permissions (including 'aoss:DashboardsAccessAll' and 'aoss:APIAccessAll' for dashboard access)
  1. If you're creating multiple indices in quick succession, consider implementing a queue system to manage these operations and avoid overwhelming the service.

  2. For errors related to OpenSearch Serverless, check CloudWatch Logs if you're using Amazon Data Firehose for data ingestion, as it provides detailed error codes and messages.

The random nature of these errors, especially at your moderate traffic levels, suggests they might be related to specific operations (like index creation) or intermittent network/permission issues rather than general service capacity problems.
Sources
Troubleshoot OpenSearch Serverless errors | AWS re:Post
AWS OpenSearch Serverless index creation delay | AWS re:Post
Monitor Amazon Data Firehose Using CloudWatch Logs - Amazon Data Firehose
Troubleshoot access to OpenSearch Serverless dashboard | AWS re:Post

answered a year ago

You are not logged in. Log in to post an answer.

A good answer clearly answers the question and provides constructive feedback and encourages professional growth in the question asker.