Skip to content

Why is my OpenSearch Service domain stuck in the "Modifying" state?

5 minute read
2

I want to troubleshoot my Amazon OpenSearch Service cluster that's stuck in the "Modifying" state.

Resolution

To troubleshoot a domain that's stuck in the Modifying state, take the following troubleshooting actions based on the issue that you encounter.

A validation check fails with errors

When you initiate a configuration change, OpenSearch Service performs validation checks to make sure that your domain is eligible for an upgrade. If the validation fails, then your domain remains in the Modifying state. To resolve this issue, complete the troubleshooting steps for the error that you receive. Then, retry your configuration change.

You launched multiple configuration changes

You can't apply a new configuration change when there's an existing configuration change in progress. To make multiple configuration updates, include all changes in a single request. If you submit simultaneous changes, then you receive the "A change is already in progress" error message.

Validation checks remain valid for the duration of the configuration change. If your configuration passes the Validation stage, then don't modify the resources that your domain requires until the initial change completes. For example, don't deactivate the AWS Key Management Service (AWS KMS) key that you use for encryption.

There aren't available IP addresses in the subnets in the VPC

If there aren't enough available IP addresses, then free up or add new IP addresses in the virtual private cloud (VPC) subnet CIDR blocks.

Shard migration to the new set of data nodes doesn't complete

Check your shard migration progress

After OpenSearch Service creates the new resources, it begins to migrate shards to the new data nodes. This process can take several minutes to several hours based on the cluster load and size.

To monitor the shard migration status, run the following command:

GET /DOMAIN_ENDPOINT/_cat/recovery?active_only=true&v

Note: Replace DOMAIN_ENDPOINT with your domain endpoint. If you use OpenSearch Dashboards to run the preceding command, then remove /DOMAIN_ENDPOINT/.

If your OpenSearch Service cluster is in the red cluster status, then the shard migration fails. To troubleshoot this issue, see Why is my OpenSearch Service cluster in a red or yellow status?

To view the size of your shards, run the following command:

GET /_cat/shards?v

Then, run the following command to view each node's number of assigned shards:

GET /_cat/allocation?v

If the new nodes don't have all of the required shards, then run the following command to identify the cause:

GET /_cluster/allocation/explain?pretty

For more information, see CAT shards API, CAT allocation API, and Cluster allocation explain API on the OpenSearch website.

Use OpenSearch Service best practice

To speed up shard migration, adhere to the following best practices:

  • Use a shard strategy that aligns with your needs.
  • Plan for growth and workload type when you choose the number of shards for your index.
  • Make sure that the cluster's CPU and Java Virtual Machine (JVM) memory pressure aren't too high.
  • Make sure that there's enough free storage space in the new set of nodes. To free store space, delete indexes that you no longer need. For instructions, see Delete index API on the OpenSearch website.
    Note: Storage space issues can occur if you add new data to the cluster during the blue/green deployment process. Or, they occur if previous nodes have large shards that OpenSearch Service can't allocate to the new nodes.

Update the allocation retry value

If your shard exceeds the maximum number of retries and remains unassigned to a node, then retry the allocation. By default, the cluster allocates a shard a maximum of 5 retries in a row.

To increase the retry number for the shard, run the following command:

PUT INDEX_NAME/_settings  {
    "index.allocation.max_retries" : 10
}

Note: Replace INDEX_NAME with your index name and 10 with the number of retries.

Check for issues in your index settings

Internal hardware failures can cause shards on existing data nodes to get stuck during migration. Based on your hardware issue, OpenSearch Service runs scripts to automatically return the nodes to a healthy state. If you pin shards to an existing set of nodes, then shard migration can get stuck.

To make sure that you don't have shards pinned to any nodes, run the following commands to check the index settings:

GET /DOMAIN_ENDPOINT/_cluster/allocation/explain?pretty
GET /DOMAIN_ENDPOINT/INDEX_NAME/_settings?pretty

Note: Replace DOMAIN_ENDPOINT with your domain endpoint and INDEX_NAME with your index. If you use OpenSearch Dashboards to run the preceding command, then remove /DOMAIN_ENDPOINT/.

In the output, check for the following settings to identify shards that are pinned to nodes:

"index.routing.allocation.require._name": "NODE_NAME"
"index.blocks.write": true

Note: Replace NODE_NAME with your node name.

If you see "index.routing.allocation.require._name": "NODE_NAME" in your index settings, then run the following command to reset the setting:

PUT INDEX_NAME/_settings  {
    "index.routing.allocation.require._name": null
}

Note: Replace DOMAIN_ENDPOINT with your domain endpoint and INDEX_NAME with your index.

For more information about shard settings in your index, see Index-level shard allocation on the Elastic website.

If you see "index.blocks.write": true in your index settings, then your index has a write block. This write block issue can occur because of a a ClusterBlockException error. To troubleshoot this issue, see How do I resolve the 403 "index_create_block_exception" or "cluster_block_exception" error in OpenSearch Service?

To monitor the progress of your configuration change, run the DescribeDomainChangeProgress API. For clusters that are stuck in the Modifying state or domains that are stuck in the Deleting older resources state for more than 24 hours, create an AWS Support case.

AWS OFFICIALUpdated 2 months ago