How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?
My data nodes show high CPU usage on my Amazon OpenSearch Service cluster.
Short description
It's a best practice to maintain your CPU utilization so that OpenSearch Service has enough resources to perform its tasks. A cluster that consistently performs at high CPU utilization can reduce cluster performance. When your cluster is overloaded, OpenSearch Service stops responding and results in a timeout request.
To troubleshoot high CPU utilization on your cluster, take the following actions:
- Use an automated runbook.
- Use the nodes hot threads API.
- Check the write operation or bulk API thread pool.
- Check the search thread pool.
- Check the Apache Lucene merge thread pool.
- Check the JVM memory pressure.
- Review your sharding strategy.
- Optimize your queries.
Resolution
Use an automated runbook
Use the AWSSupport-TroubleshootOpenSearchHighCPU AWS Systems Manager automation runbook to troubleshoot the High CPU utilization in OpenSearch Service.
Note: Before you use the runbook, review the Required AWS Identity and Access Management (IAM) permissions and Instructions sections in AWSSupport-TroubleshootOpenSearchHighCPU.
The output displays the following information:
- Hot threads.
- Currently running tasks.
- Thread pool statistics for each node in the domain.
- Information about the nodes in the domain sorted by their CPU usage.
- Shard allocation to each data node and their disk space.
- Health status and information about the health of the OpenSearch Service domain.
Use the output of the runbook to identify the cause of high CPU utilization.
Use the nodes hot threads API
If there are constant CPU spikes in your OpenSearch Service cluster, then use the nodes hot threads API. For more information, see Nodes hot threads API on the Elastic website.
Example output:
GET _nodes/hot_threads 100.0% (131ms out of 500ms) cpu usage by thread 'opensearch[xxx][search][T#62]' 10/10 snapshots sharing following 10 elements sun.misc.Unsafe.park(Native Method) java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) java.util.concurrent.LinkedTransferQueue.awaitMatch(LinkedTransferQueue.java:737) java.util.concurrent.LinkedTransferQueue.xfer(LinkedTransferQueue.java:647) java.util.concurrent.LinkedTransferQueue.take(LinkedTransferQueue.java:1269) org.opensearch.common.util.concurrent.SizeBlockingQueue.take(SizeBlockingQueue.java:162) java.util.concurrent.ThreadPoolExecutor.getTask(ThreadPoolExecutor.java:1067) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1127) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) java.lang.Thread.run(Thread.java:745)
Note: The nodes hot threads output lists information for each node. The length of your output depends on how many nodes are running in your OpenSearch Service cluster.
You can also use the cat nodes API to view the current breakdown of resource utilization. To narrow the subset of nodes with the highest CPU utilization, run the following command:
GET _cat/nodes?v&s=cpu:desc
The last column in your output displays your node name. For more information, see cat nodes API on the Elastic website.
Pass on the relevant node name to your hot threads API:
GET _nodes/<node-name>/hot_threads
For more information, see Hot threads API on the Elastic website.
Example output:
<percentage> of cpu usage by thread 'opensearch[<nodeName>][<thread-name>]'
The thread name indicates the OpenSearch Service processes that are using high CPU.
Check the write operation or bulk API thread pool
A 429 error in OpenSearch Service might indicate that your cluster is handling too many bulk indexing requests. When there are constant CPU spikes in your cluster, OpenSearch Service rejects the bulk indexing requests.
The write thread-pool handles indexing requests and includes Bulk API operations. To check whether your cluster is handling too many bulk indexing requests, check the IndexingRate metric in Amazon CloudWatch.
If your cluster is handling too many bulk indexing requests, then take the following actions:
- Reduce the number of bulk requests on your cluster.
- Reduce the size of each bulk request so that your nodes can process them more efficiently.
- If you use Logstash to push data into your OpenSearch Service cluster, then reduce the batch size or the number of workers.
- If your cluster's ingestion rate slows down, then horizontally or vertically scale your cluster. To scale up your cluster, increase the number of nodes and instance type so that OpenSearch Service can process the incoming requests.
For more information, see Bulk API on the Elastic website.
Check the search thread pool
A search thread pool that uses high CPU indicates that search queries are overwhelming your OpenSearch Service cluster. A single long-running query can overwhelm your cluster. An increase in the queries that your cluster performs can also affect your search thread pool.
To check whether a single query is increasing your CPU usage, use the task management API:
GET _tasks?actions=*search&detailed
The task management API gets all active search queries that are running on your cluster. For more information, see Task management API on the Elastic website.
Note: If there's a search task that the task management API lists, then the output includes only the description field.
Example output:
{ "nodes": { "U4M_p_x2Rg6YqLujeInPOw": { "name": "U4M_p_x", "roles": [ "data", "ingest" ], "tasks": { "U4M_p_x2Rg6YqLujeInPOw:53506997": { "node": "U4M_p_x2Rg6YqLujeInPOw", "id": 53506997, "type": "transport", "action": "indices:data/read/search", "description": """indices[*], types[], search_type[QUERY_THEN_FETCH], source[{"size":10000,"query":{"match_all":{"boost":1.0}}}]""", "start_time_in_millis": 1541423217801, "running_time_in_nanos": 1549433628, "cancellable": true, "headers": {} } } } } }
Check the description field to identify the query that's running. The running_time_in_nanos field indicates the amount of time a query is running. To decrease your CPU usage, cancel the search query that's using high CPU. The task management API also supports a _cancel call.
Note: To cancel a task, record the task ID from your output. In the following example, the task ID is U4M_p_x2Rg6YqLujeInPOw:53506997.
Example call:
POST _tasks/U4M_p_x2Rg6YqLujeInPOw:53506997/_cancel
The Task Management POST call marks the task as "cancelled", and releases any dependent AWS resources. If multiple queries are running on your cluster, then use the POST call to cancel each query until your cluster returns to a normal state.
To prevent high CPU spikes, it's also a best practice to set a timeout value in the query body. For more information, see Parameters on the Elastic website. To verify that the number of active queries decreased, check the SearchRate metric in CloudWatch. For more information, see Thread pools on the Elastic website.
Note: When you cancel all active search queries at the same time in your OpenSearch Service cluster, errors can occur on the client application side.
Check the Apache Lucene merge thread pool
OpenSearch Service uses Apache Lucene to index and search documents on your cluster. When you create new shard segments, Apache Lucene runs merge operations to reduce the effective number of segments for each shard and remove deleted documents.
If an Apache Lucene merge thread operation affects the CPU usage, then increase the refresh_interval setting of your OpenSearch Service cluster indices. The increase in the refresh_interval setting slows down cluster segment creation. For more information, see index.refresh_interval on the Elastic website.
Note: A cluster that's migrating indices to UltraWarm storage can increase your CPU utilization. An UltraWarm migration usually involves a force merge API operation that can be CPU intensive. For more information, see force merge API on the Elastic website.
To check for UltraWarm migrations, run the following command:
GET _ultrawarm/migration/_status?v
For more information, see Merge on the Elastic website.
Check the JVM memory pressure
Review your JVM memory pressure percentage of the Java heap in a cluster node. If JVM memory pressure reaches 75%, then Amazon OpenSearch Service initiates the Concurrent Mark Sweep (CMS) garbage collector. If JVM memory pressure reaches 100%, then OpenSearch Service JVM exits and eventually restarts on OutOfMemory (OOM).
In the following example log, the JVM is within the recommended range, but a long-running garbage collection is affecting the cluster:
[2022-06-28T10:08:12,066][WARN ][o.o.m.j.JvmGcMonitorService] [515f8f06f23327e6df3aad7b2863bb1f] [gc][6447732] overhead, spent [9.3s] collecting in the last [10.2s]
For more information, see How do I troubleshoot high JVM memory pressure on my Amazon OpenSearch Service cluster?
Review your sharding strategy
Depending on the cluster size, your cluster performance might reduce because of too many shards. It's a best practice to have up to a maximum of only 25 shards per GiB of Java heap.
By default, OpenSearch Service has a sharding strategy of 5:1, where each index is divided into five primary shards. Within each index, each primary shard also has its own replica. OpenSearch Service automatically assigns primary shards and replica shards to separate data nodes and makes sure that there's a backup in case of a failure.
For more information, see How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster?
Optimize your queries
Heavy aggregations, wildcard queries such as leading wildcards, and regex queries might cause CPU utilization spikes. To diagnose these queries, search and index slow logs. For more information, see Monitoring OpenSearch logs with Amazon CloudWatch Logs.
Related information
How can I improve the indexing performance on my Amazon OpenSearch Service cluster?
How do I resolve search or write rejections in Amazon OpenSearch Service?
Relevant content
- asked 2 years agolg...
- asked 2 years agolg...
- asked 2 years agolg...
- asked 2 years agolg...
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 5 months ago
- AWS OFFICIALUpdated 5 months ago