How do I recognize and resolve cluster health issues that have too many shards?

5 minute read
1

I see a degradation in my Amazon OpenSearch Service cluster health and performance.

Short Description

Each shard consumes some amount of CPU/JVM resources to maintain. When you have too many shards, you might see a severe degradation in cluster performance. In some cases where there are too many shards, your entire cluster might become unresponsive.

To make sure that your clusters are healthy and they perform as expected, follow OpenSearch Service best practices.

Symptoms of too many shards

If your cluster has one or more of the following symptoms, then complete the resolution steps. To verify whether your domain is affected by too many shards, review trends in your cluster's health metrics over time.

  • More than 1,000 shards per node.
  • Indices with small sized shards each < 10 GB.
  • Consistent node drops.
  • High JVM/CPU resource metrics.
  • Complications of blue and green deployments.
  • Cluster state is extremely difficult for the elected master to handle.
  • T instance types are in use. Or, smaller instance types are in use. For example, the instance type c5.large is in use.

Resolution

Review your domain trends

Use the 3/6/12/14 months range to review your domain's Amazon CloudWatch metrics. If shard creations occur at regular intervals, then increase the graph's time window. Otherwise, you won't see the full health trend history. Note: You must change the metric period to 1 hour for the metrics to load properly in the longer time ranges.

The following occurs in a domain with too many shards:

  • Shards.active count increase. After the shard count passes 1000 total shards per node, the health of the cluster is at risk when the cluster faces increased traffic. This risk increases when traffic from search requests occurs across the increased number of shards.
  • Node drops occur when the cluster's Nodes metric isn't a solid line in the min/max/avg statistic. The OpenSearch Service process fails to run on the nodes when the JVM memory limits are reached. Then, the managed service automatically restarts the process.
  • JVMMemoryPressure increase. The min/avg/max values converge as G1 Garbage Collection (G1GC) becomes more frequent and less effective. In an ideal state, this metric oscillates between 0-75% in a sawtooth pattern. If the JVMMemoryPressure breaches 75% when aggravated and the average JVMMemoryPressure breaches 75%, then initial impact occurs. For more on JVM and garbage collection in OpenSearch Service, see Understanding the JVMMemoryPressure metric changes in Amazon OpenSearch Service. For more information, see How do I troubleshoot high JVM memory pressure on my OpenSearch Service cluster?
  • CPUUtilization increase. The cluster spends more resources and time on garbage collections for the shards that it maintains.

Troubleshoot too many shards

To troubleshoot your too many shards issue, choose one of the following resolution methods:

Reduce the shard count

It's an OpenSearch best practice to limit the number of shards per node based on the available JVM heap memory. Make sure that you have no more than 20-25 shards per GiB of JVM heap. In the OpenSearch managed service, the JVM heap has a size limit of 32 GiB. This limit means that you can have a maximum of 640-800 shards per node. There is a limit of 1,000 shards per node that can be modified through the cluster settings.

If data isn't needed, then remove older indices. Take manual snapshots of older index data before you delete them.

Note the following when rotating indices are used with ISM:

  • Review that your shard strategy puts the indices in the best-practice shard size range (10-50gb). For example, suppose you use daily ISM indices and default 5:1 sharding. This practice results in 10 shards per day and up to 300 shards per month.
  • Use an index template to reduce the shard count of smaller daily indices. For more information, see Index templates on the OpenSearch website. Note that this affects only newly created indices in the cluster and doesn't immediately improve cluster health.
  • Use the reindex OpenSearch API to combine older daily and weekly indices into a monthly index. For more information, see Combine one or more indexes on the OpenSearch website. Then, change the rotation period of your ISM indices to avoid the creation of indices that are too small.

Scale your instance

Note the following:

  • It's not a viable long-term solution to scale up or scale out. Be sure to follow OpenSearch Services best practices.
  • When you scale up the instance type, the available resources per node and the total number of shards that the node can handle increases.
  • When you scale out the data node count, the number of shards per node decreases. This decrease mitigates the effect on the other nodes.

Related information

Optimize OpenSearch index shard sizes on the OpenSearch website

AWS OFFICIAL
AWS OFFICIALUpdated 7 months ago