Short description
The Monitoring tab in your OpenSearch Service console shows the status of the least healthy index in your cluster. A red status occurs when OpenSearch Service hasn't allocated one or more primary shards and their replicas. The yellow status occurs when OpenSearch Service allocated all the primary shards but hasn't allocated one or more replica shards.
Important: A red cluster status indicates partial data unavailability. Although a yellow status doesn't indicate data loss, the yellow status means that your cluster lacks full redundancy. If a node fails, then you might experience data loss.
Resolution
Important: To reconfigure a domain, you must first resolve the red cluster status. If you try to reconfigure a domain that's in the red status, then it might get stuck in the "Modifying" state.
Identify the cause for your unassigned shards
To identify and troubleshoot the root cause of the unassigned shards, use the AWSSupport-TroubleshootOpenSearchRedYellowCluster runbook. For instructions, see Instructions on AWSSupport-TroubleshootOpenSearchRedYellowCluster.
Or, to manually identify the unassigned shards, run the following command:
curl -XGET 'domain-endpoint/_cat/shards?h=index,shard,prirep,state,unassigned.reason' | grep UNASSIGNED
Note: Replace domain-endpoint with your domain endpoint. In the output, note the shard ID.
Then, run the following command to get details for why the shard is unassigned:
curl -XGET 'domain-endpoint/_cluster/allocation/explain?pretty' -H 'Content-Type:application/json' -d'{
"index": "index-name",
"shard": shardID,
"primary": false
}'
Note: Replace domain-endpoint with your domain endpoint, index-name with your index name, and shardID with the unassigned shard ID. If the shard is a primary shard, then replace false with true.
Troubleshoot your red or yellow status
To identify why the cluster status is yellow or red, take the following actions:
-
Check the ClusterStatus.yellow, ClusterStatus.red, Shards.unassigned, CPUUtilization, JVMMemoryPressure, and FreeStorageSpace Amazon CloudWatch metrics.
-
Run the following query to identify the affected indexes:
GET /_cat/indices?v&health=yellow
GET /_cat/indices?v&health=red
-
Run the following query to understand why the shards are unassigned across all indexes:
GET /_cluster/allocation/explain
Note: This command's output shows you a comprehensive view of unassigned shards and their allocation status across your entire cluster. You can use this information to get a general overview of allocation issues.
To resolve a red cluster status, run the following command to delete the red indexes:
curl -XDELETE 'domain-endpoint/index-names'
Note: Replace domain-endpoint with your domain endpoint and index-names with your index name.
Then, restore your indexes from a snapshot.
If your yellow cluster status doesn't self-resolve, then use the information about why the shard is unassigned to address the root cause.
Not enough nodes to allocate to the shards
Primary and replica shards must reside on different nodes. As a result, single-node clusters with replica shards always initialize with a yellow status because OpenSearch Service can't allocate replica shards.
OpenSearch Service versions 7.x and later have a default quota of 1000 for cluster.max_shards_per_node. It's a best practice to use the default value for cluster.max_shards_per_node. For more information, see Cluster-level shard, block, and task settings on the OpenSearch website.
If you set shard allocation filters, then the shard can become unassigned because it doesn't have enough filtered nodes. For more information about shard allocation filters, see Index-level index settings on the OpenSearch website.
To avoid this issue, take the following actions:
For more information, see Sizing OpenSearch Service domains and Demystifying OpenSearch Service shard allocation.
Storage space issues
If there isn't enough disk space, then your cluster can enter a red or yellow health status. Your node must have enough disk space to accommodate shards before OpenSearch Service distributes the shards.
To check how much storage space is available for each node in your cluster, run the following command:
curl domain-endpoint/_cat/allocation?v
Note: Replace domain-endpoint with your domain endpoint.
If you unevenly distribute shards, then some nodes might run out of space when others have capacity. This can cause issues during shard reallocation, where OpenSearch Service can't assign new shards during the rebalance process.
To check your shard distribution settings, run the following command:
curl -XGET domain-endpoint/_cluster/settings?include_defaults=true&flat_settings=true
Note: Replace domain-endpoint with your domain endpoint.
It's a best practice to regularly monitor disk space and proactively address disk skew issues to address cluster health.
For more information, see How do I troubleshoot low storage space in my OpenSearch Service domain? and How do I rebalance the uneven shard distribution in my OpenSearch Service cluster?
High JVM memory pressure
Shard allocation is a resource-intensive process that consumes CPU, heap space, disk, and network resources. Consistently high Java Virtual Machine (JVM) memory pressure can interfere with successful shard allocation. To resolve this issue, troubleshoot the high JVM memory pressure. After you reduce JVM memory pressure, take the following actions to restore the cluster to a green status:
Node failures
Node failures cause their allocated shards to become unassigned. Without replica shards, even a single node failure can cause a red health status. However, when you configure indexes with replica shards, a node failure typically results in a temporary yellow status. This yellow status occurs as OpenSearch Service automatically recovers. The yellow status ends when the failed node returns to health or when OpenSearch Service reassigns shards to other nodes.
To protect against hardware failures, take the following actions:
For more information about how to identify a node failure, see Failed cluster nodes.
Recurring yellow cluster health
Your clusters might frequently be in the yellow health status for the following reasons:
- Transient node failures or restarts that occur when nodes fail temporarily and replica shards go unassigned.
Note: The cluster might recover on its own when that node comes back or when OpenSearch Service rebalances shards.
- You exceed the shard allocation failure or retry quota because of resource constrains or configuration issues.
- Scheduled maintenance, backup jobs, or heavy load spikes occur on clusters with high resource usage, so nodes fluctuate or reject shard allocations.
- A recurring upgrade or automatically created index created new replicas that exceed the cluster's capacity.
To prevent and troubleshoot the recurring yellow health status, take the following actions:
- For single-node clusters, make sure that all indexes have 0 replicas.
Note: For single-node clusters, OpenSearch Service automatically manages and configures system indexes such as opendistro_security. You can't modify settings for system indexes.
- For multi-node clusters, keep at least one replica node. For higher redundancy, increase your node and replica count.
- Configure a Multi-AZ domain for high availability and fault tolerance.
Note: If shard allocation fails, then verify that the number of nodes in your cluster, Availability Zones, and standby configuration are correct for your cluster requirements.
- If the shard failed to get in-memory lock, then increase the index.allocation.max_retries value.
- To avoid resource exhaustion, scale up your domain during high load.
- To proactively monitor changes in resource needs, create a CloudWatch alarm for the ClusterStatus.yellow, ClusterStatus.red, JVMMemoryPressure, AutomatedSnapshotFailure, and FreeStorageSpace metrics.
For more information, see Operational best practices for OpenSearch Service.