Skip to content

How do I troubleshoot search latency spikes in my OpenSearch Service cluster?

7 minute read
0

I have search latency spikes in my Amazon OpenSearch Service cluster.

Short description

For search requests, OpenSearch Service calculates the round trip time with the following formula:

Round trip = Time the query spends in the query phase + Time in the fetch phase + Time in the queue + Network latency

The SearchLatency OpenSearch Service metric in Amazon CloudWatch shows the time that the query spent in the query phase.

To troubleshoot search latency spikes in your OpenSearch Service cluster, take the following actions:

  • Check infrastructure metrics, such as CPU usage, disk usage, and memory for both Java Virtual Machine (JVM) memory pressure and garbage collection.
  • Check for spikes in the SearchRate metric.
  • Use the ThreadpoolSearchRejected metric to check for search rejections.
  • Use slow logs to identify long-running queries.
  • Resolve "504 gateway timeout" errors.
  • Optimize your configuration to reduce latency.

Resolution

Check infrastructure metrics

Frequent and long garbage collection on operating system (OS) data nodes occurs when your resource usage is high. If you don't provision enough resources on your cluster, then you might experience search latency spikes.

Troubleshoot high resource usage

Make sure that the CPUUtilization and JVMMemoryPressure metrics for your cluster are below 80%. If the metric values in CloudWatch are higher than 80%, then troubleshoot the high CPU usage or high JVM memory pressure.

To proactively monitor resource usage, set up CloudWatch alarms for OpenSearch Service.

To get node-level statistics on your cluster, run the following query multiple times at intervals of 5 minutes:

GET /_nodes/stats

In the output, look for significant changes in the cache usage, fielddata memory, and JVM heap values between runs. Consistent values show normal operation. Sudden spikes or drops might occur when there are issues.

Check your cache settings

OpenSearch Service uses the following caches to improve its performance and request response time:

  • The file system cache, or page cache, that exists on the OS level
  • The shard-level request cache and query cache that both exist on the OpenSearch Service level

To view information for the file system cache, run the following query:

GET /_nodes/stats/indices/request_cache?human

To view information for a shard-level request cache, run the following query:

GET /_nodes/stats/indices/query_cache?human

In the output, check for cache evictions. A high number of cache evictions means that the cache is too small to serve the request. To reduce your evictions, use bigger nodes with more memory. For more information about the pricing of node sizes, see OpenSearch Service pricing. For more information about OpenSearch caches, see Elasticsearch caching deep dive: Boosting query speed one cache at a time on the Elastic website.

To clear your cache, see Clear cache API on the OpenSearch website.

Aggregations on fields that contain highly unique values might cause an increase in the heap usage. Search operations for aggregation queries use fielddata. Fielddata also sorts and accesses the field values in the script. Fielddata evictions depend on the size of the indices.fielddata.cache.size file that accounts for 20% of the JVM heap.

To check how much memory fielddata uses across all nodes in the cluster, run the following query:

GET /_nodes/stats/indices/fielddata?human

Check for spikes in the SearchRate metric

Multiple search requests in a short period can strain a cluster's resources, and cause delays in query processing and slower response times for individual searches. High search rate in OpenSearch Service can cause increased search latency. If the SearchRate metric spikes, then check whether the spikes occur at the same time as the search latency spikes. If the spikes occur at the same time, then you must add more resources to your cluster or optimize queries to manage the search load.

Check for search rejections

Use the ThreadpoolSearchRejected metric to identify and resolve search rejections.

Use slow logs to identify long running queries

To identify long running queries and the time that a query spent on a specific shard, use slow logs. You can set thresholds for the query phase, and then fetch the phase for each index.

For a detailed summary of the time that your query spends in the query phase, set profile to true in your search query.

Example query:

GET /my_index/_search
{
  "profile": true,
  "query": {
    "match": {
      "field": "value"
    }
  }
}

Note: If you set the logging threshold too low, then your JVM memory pressure and cluster latency might increase. When you log more queries, you also increase your costs. A large output for a query with profile set to true adds overhead to other search queries. As a result, other searches temporarily slow down.

Resolve 504 gateway timeout errors

Prerequisite: Activate error logs to identify specific HTTP error codes.

Use the application logs of your OpenSearch Service cluster to check the specific HTTP error codes for individual requests. To resolve HTTP 504 gateway timeout errors, see How can I prevent HTTP 504 gateway timeout errors in OpenSearch Service?

Optimize your configuration

Manage your garbage collection activity

Frequent or long running garbage collection activity might cause search performance issues, pause threads, or increase search latency. For best practices to reduce garbage collection time, see A heap of trouble: Managing Elasticsearch's managed heap on the Elastic website.

Optimize your instance storage

Your Amazon Elastic Compute Cloud (Amazon EC2) instance type can use either Amazon Elastic Block Store (Amazon EBS) optimized storage or instance store volumes. Instance store volumes can help address I/O bottlenecks because they offer directly attached storage and higher IOPS capabilities. However, Amazon EBS-optimized instances offer persistent storage with consistent performance. Choose a storage type that aligns with your configuration requirements based on I/O, data persistence, and costs.

Before you change your instance type, it's a best practice to test performance between different instance types to verify that they meet your workload requirements. For a list of available OpenSearch Service instance types, see Free Tier and On-Demand Instance pricing on OpenSearch Service Pricing.

Note: If your cluster is in a virtual private cloud (VPC), then it's a best practice to run your applications within the same VPC.

Simplify your shard and segment configuration

A cluster with too many shards might increase resource usage, even when the cluster is inactive. Too many shards also slow down query performance. Although a larger replica shard count achieves faster searches, don't use more than 1000 shards on a single node. Also, make sure that the shard sizes are between 10 GiB and 50 GiB. It's a best practice to set the maximum number of shards on a node to 20 times the size of the heap. For information about how to reindex and change your shard strategy, see Optimize OpenSearch index shard sizes on the OpenSearch website.

Too many segments or too many deleted documents can also affect search performance. To improve performance, use force merge on read-only indexes and increase the refresh interval on active indexes. For more information, see Force merge API and Optimize OpenSearch refresh interval on the OpenSearch website.

Before you add replica shards to all nodes, evaluate your application's requirements. If your application must search all data from any node, then increase the number of replica shards to increase data availability. Otherwise, you might not need replica shards on every node.

Note: Replica shards allow clusters to use parallel processing and distribute search requests across multiple copies of the data. As a result, search performance improves. However, indexing operations become slower and you require additional storage for each complete data copy.

For indexes with many shards, use custom routing to improve search performance. With custom routing, you query only the shards that hold your data instead of all shards. To configure custom routing, see Customizing your document routing on the Elastic website.

Use UltraWarm storage for read-only data

Hot storage provides the fastest performance to index and search new data. However, UltraWarm nodes offer a cost-effective way to store large amounts of read-only data on your cluster. For read-only indexes that don't require high performance, use UltraWarm instead of hot data storage.

Increase your search speed

Search as few fields as possible, and avoid scripts and wildcard queries. For more information, see Tune for search speed on the Elastic website.

AWS OFFICIALUpdated 2 months ago