One of the nodes in my Amazon OpenSearch Service cluster is down. Or, my OpenSearch Service nodes keep crashing.
Resolution
Common causes of failed cluster nodes include high Java Virtual Machine (JVM) memory pressure and hardware failure.
Check for failed nodes
Complete the following steps:
- Open the OpenSearch Service console.
- In the navigation pane, under Managed clusters, choose Domains.
- Select your OpenSearch Service domain.
- Choose the Cluster health tab, and then choose Nodes. If the number of nodes is fewer than the number that you configured for your cluster, then a node is down.
Note: The Nodes metric can be inaccurate during changes to your cluster configuration or routine maintenance for the service. This behavior is expected.
Identify and troubleshoot high JVM memory pressure
Check and reduce JVM memory pressure on your OpenSearch Service cluster.
Identify and troubleshoot hardware failure issues
Hardware failures can also affect cluster node availability. To limit the effect of hardware failures, take the following actions.
Use replication to reduce the risk of data loss
Use more than one node in your cluster. A single-node cluster is a single point of failure. You can't use replica shards to back up your data because you can't assign primary and replica shards to the same node. If the node goes down, then you can restore data from a snapshot. You can't recover data that wasn't captured in the last snapshot. For more information, see Sizing Amazon OpenSearch Service domains and Creating and managing Amazon OpenSearch Service domains.
Set up at least one replica. A multi-node cluster can still experience data loss when there aren't any replica shards.
Turn on zone awareness
When you turn on zone awareness, OpenSearch Service launches data nodes in multiple Availability Zones. OpenSearch Service distributes primary shards and their corresponding replica shards to different Availability Zones. If there's a failure in one node or zone, your data is still available. For more information, see Configuring a Multi-AZ domain in Amazon OpenSearch Service.
Related information
Operational best practices for Amazon OpenSearch Service
How do I improve the fault tolerance of my OpenSearch Service domain?
How can I scale up or scale out an OpenSearch Service domain?
Why is my Amazon OpenSearch Service domain stuck in the "Processing" state?