Global outage event
If you're experiencing issues with your AWS services, then please refer to the AWS Health Dashboard. You can find the overall status of ongoing outages, the health of AWS services, and the latest updates from AWS engineers.
How do I troubleshoot high CPU utilization on my OpenSearch Service cluster?
My data nodes show high CPU usage on my Amazon OpenSearch Service cluster.
Short description
To troubleshoot high CPU utilization on your cluster, take the following actions:
- Use an automated runbook to identify the cause of high CPU usage.
- Use anomaly detection to identify patterns.
- Use the nodes hot threads API to understand your resource usage.
- Check the write operation or bulk API thread pool.
- Check the search thread pool.
- Check the Apache Lucene merge thread pool.
- Check the Java virtual machine (JVM) memory pressure.
- Review your shard strategy.
- Scale your cluster.
- Optimize your queries.
It's a best practice to keep your CPU usage low enough for OpenSearch Service to perform its tasks. A cluster that consistently has high CPU usage can experience performance issues. OpenSearch Service doesn't respond to overloaded clusters, and you receive a timeout request.
Resolution
Use an automated runbook
Prerequisite: Make sure that you have the required AWS Identity and Access Management (IAM) permissions to run the runbook. For more information, see Required IAM permissions in AWSSupport-TroubleshootOpenSearchHighCPU.
Run the AWSSupport-TroubleshootOpenSearchHighCPU AWS Systems Manager automation runbook to troubleshoot the high CPU usage in OpenSearch Service.
The output displays the following information:
- Hot threads
- Running tasks
- Thread pool statistics for each node in the domain
- Information about the nodes in the domain sorted by their CPU usage
- Shard allocation to each data node and its disk space
- Health status and information about the health of the OpenSearch Service domain
Use the output of the runbook to identify the cause of the high CPU usage.
Use anomaly detection to identify patterns
To identify potential issues before they cause outages, use anomaly detection in OpenSearch Service to automatically detect unusual patterns in metrics such as CPU usage. For more information, see Tutorial: Detect high CPU usage with anomaly detection.
Use the nodes hot threads API
If there are constant CPU spikes in your OpenSearch Service cluster, then run the following command to view information for all nodes in the cluster:
GET/_nodes/hot_threads
The length of your output depends on how many nodes are running in your OpenSearch Service cluster. For more information about the node hot threads API, see Nodes hot threads API on the OpenSearch website.
Example output:
GET _nodes/hot_threads 100.0% (131ms out of 500ms) cpu usage by thread 'opensearch[abc][search][T#62]' 10/10 snapshots sharing following 10 elements sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737) java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647) java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269) org.opensearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
You can also use the cat nodes API to view the current breakdown of resource usage. To view nodes with the highest CPU usage, run the following command:
GET _cat/nodes?v&s=cpu:desc
The last column in the output displays the node name. For more information about the cat nodes API, see CAT nodes API on the OpenSearch website.
Then, run the following command for nodes with high CPU usage:
GET _nodes/node_id/hot_threads
Note: Replace node_id with the node ID.
The output shows the OpenSearch Service processes in the node that use the most CPU. If you see an Apache Lucene merge thread in the output, then see Check the Apache Lucene merge thread pool to troubleshoot.
Example output:
percentage of cpu usage by thread 'opensearch[nodeName][thread-name]'
Check the write operation or bulk API thread pool
If you receive a "429" error message, then the cluster might have too many bulk index requests. When there are constant CPU spikes in your cluster, OpenSearch Service rejects bulk index requests.
The write thread pool manages index requests and includes Bulk API operations. To check whether your domain is under strain because of too many bulk index requests, check the IndexingRate Amazon CloudWatch metric.
If your cluster has too many bulk index requests, then take the following actions:
- Reduce the number of bulk requests on your cluster.
- Reduce the size of each bulk request so that your nodes can process them more efficiently.
- If you use Logstash to upload data into your OpenSearch Service cluster, then reduce the batch size or the number of workers.
- If your cluster's ingestion rate slows down, then horizontally or vertically scale your cluster.
Check the search thread pool
A search thread pool that uses high CPU shows that search queries overwhelm your OpenSearch Service cluster. A single long-running query can overwhelm your cluster. An increase in the queries that your cluster performs can also affect your search thread pool.
To check whether a single query increases your CPU usage, run the following command:
GET _tasks?actions=*search&detailed
The task management API shows all active search queries that run on your cluster. For more information, see List tasks API on the OpenSearch website.
In the output, check the description field to identify the query that's running. The running_time_in_nanos field shows the amount of time that a query runs.
Example output:
{ "nodes": { "U4M_p_x2Rg6YqLujeInPOw": { "name": "U4M_p_x", "roles": [ "data", "ingest" ], "tasks": { "U4M_p_x2Rg6YqLujeInPOw:53506997": { "node": "U4M_p_x2Rg6YqLujeInPOw", "id": 53506997, "type": "transport", "action": "indices:data/read/search", "description": """indices[*], types[], search_type[QUERY_THEN_FETCH], source[{"size":10000,"query":{"match_all":{"boost":1.0}}}]""", "start_time_in_millis": 1541423217801, "running_time_in_nanos": 1549433628, "cancellable": true, "headers": {} } } } } }
Note: For search tasks, the task management API output includes only the description field.
To decrease your CPU usage, run the following command to cancel the search query has high CPU:
POST _tasks/U4M_p_x2Rg6YqLujeInPOw:53506997/_cancel
Note: Replace U4M_p_x2Rg6YqLujeInPOw:53506997 with your task ID.
The preceding query marks the task as cancelled, and then releases the dependent AWS resources. If multiple queries run on your cluster, then use the POST command to cancel each query until your cluster returns to a normal state.
To prevent CPU spikes, it's a best practice to set a timeout value in the query body. For more information, see Search settings on the OpenSearch website. To verify that the number of active queries decreased, check the SearchRate CloudWatch metric.
Note: When you cancel all active search queries at the same time in your OpenSearch Service cluster, errors might occur on the client application side.
Check the Apache Lucene merge thread pool
OpenSearch Service uses Apache Lucene to index and search documents on your cluster. When you create new shard segments, Apache Lucene runs merge operations to reduce the effective number of segments for each shard and remove deleted documents. For more information, see Merge settings on the Elastic website.
If an Apache Lucene merge thread affects your CPU usage, then run the following command to increase the refresh_interval setting of your indexes:
PUT /your-index-name/_settings { "index": { "refresh_interval": "value" } }
Note: Replace value with your new request interval. This update slows down cluster segment creation. For more information, see Refresh index API on the OpenSearch website.
When a cluster migrates indexes to UltraWarm storage, your CPU usage might increase. An UltraWarm migration typically uses a force merge API operation that can be CPU intensive. For more information, see Force merge API on the OpenSearch website.
To check for UltraWarm migrations, run the following command:
GET _ultrawarm/migration/_status?v
Check the JVM memory pressure
Check the JVM memory pressure percentage of the Java heap in a cluster node. In the following example log, the JVM is within the recommended range, but a long-running garbage collection affects the cluster:
[2022-06-28T10:08:12,066][WARN ][o.o.m.j.JvmGcMonitorService] [515f8f06f23327e6df3aad7b2863bb1f] [gc][6447732] overhead, spent [9.3s]collecting in the last [10.2s]
To resolve high JVM memory pressure issues, see How do I troubleshoot high JVM memory pressure on my OpenSearch Service cluster?
Review your shard strategy
Depending on the cluster size, your cluster performance might reduce because of too many shards. It's a best practice to have up to a maximum of 25 shards for each GiB of Java heap.
By default, OpenSearch Service has a shard strategy of 5:1, where each index has five primary shards. Within each index, each primary shard has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes and makes sure that there's a backup in case of a failure.
To redistribute your shards, see How do I rebalance the uneven shard distribution in my OpenSearch Service cluster?
Scale the cluster
Make sure that the cluster has enough CPU, memory, and disk space for your requirements. If your application performs large queries or frequent writes, then resize your cluster or nodes to meet the performance demands.
Also, use dedicated master nodes to improve cluster stability and resilience, especially in larger deployments. This configuration removes cluster management responsibilities from data nodes.
Optimize your queries
Heavy aggregations, wildcard queries such as leading wildcards, and regular expression (regex) queries might cause CPU usage spikes. To diagnose these queries, check your OpenSearch slow logs.
Related information
How can I improve the indexing performance on my OpenSearch Service cluster?
How do I resolve search or write rejections in OpenSearch Service?
- Language
- English

Relevant content
- asked 4 years ago