Why is my OpenSearch Service domain upgrade taking so long?

5 minute read

I'm trying to upgrade my Amazon OpenSearch Service domain, but the upgrade is taking a long time.

Short description

When you upgrade your OpenSearch Service domain version, configuration changes are made that activate a blue/green deployment process. In a blue/green deployment, two production environments are run. One environment is live, and the other environment is idle. The two production environments are then switched according to software updates. For OpenSearch Service, a new environment is created during domain updates, and users are routed to the new production environment after the updates are complete. This behavior minimizes the downtime and maintains the original environment in case a deployment is unsuccessful.

The OpenSearch Service upgrade process consists of pre-upgrade checks for issues and a cluster snapshot to restore the cluster if the upgrade fails.

The following issues might occur with an OpenSearch Service upgrade:

  • Pre-upgrade check failures
  • Upgrade process taking too long to complete
  • Upgrade succeeded with issues

For more information, see Upgrading Amazon OpenSearch Service domains.


Pre-upgrade checks

The upgrade process is irreversible. You can't pause or cancel it. During an upgrade, you can't make configuration changes to the domain. Before starting an upgrade, it's a best practice to double-check eligibility. Your domain might be ineligible for an upgrade, or fail to upgrade.

To check for the most common upgrade issues, see Troubleshooting an upgrade.

Check the snapshot status

Before a migration, OpenSearch Service takes an automated snapshot of your cluster when it passes the eligibility test. During a snapshot, the progress status might show Null or 0%. After OpenSearch Service takes the snapshot, the percentage value is updated. The time it takes to complete a snapshot can vary depending on the storage space. OpenSearch Service incrementally takes snapshots. If there are significant changes in your data from the previous automated snapshot, then your snapshot can take longer to complete.

The following _snapshot request retrieves all currently running snapshots, with detailed status information:

GET /_snapshot/_status

For more information about the snapshot APIs, see Monitor a snapshot on the Elasticsearch website.

Retrieve all cluster snapshots and node IDs

To retrieve all currently running snapshots in your cluster, use the current parameter:

GET /_snapshot/<snapshot-repository>/_current

To obtain the IDs of all data nodes, run the cat nodes API:

GET _cat/nodes

You can use the node IDs to identify the nodes that are old or new. An increasing number of shards on the new nodes indicates a smooth migration. Eventually, all the shards move to the new nodes, and the old nodes become empty.

Monitor the blue/green deployment process

When your cluster enters the blue/green deployment process, the new nodes in the green environment appear. The shards are then migrated from the old nodes in the blue environment. After the data migration or shard reallocation is complete, your old nodes are terminated.

You can monitor the blue/green deployment process in its three stages: new nodes, data migration, and removal of old nodes.

Stage 1: Creation of new nodes

You can monitor the nodes cluster metric in Amazon CloudWatch to get the node count. Or, you can use the cat nodes API to list all the nodes in your cluster:

GET /_cat/nodes?v&pretty

During this stage of the blue/green deployment process, you can view new nodes from the API output as the node count increases. 

Stage 2: Data migration

As soon as the first stage is complete, shard migration begins. During the data migration, the shard count for older nodes decreases, and the shard count for newer nodes increases. You can use the cat/allocation API (from the OpenSearch website) to get how many shards are allocated to each node: 

GET /_cat/allocation

To get the shards' statuses, Started, Relocating, or Unassigned, run the following API:

GET _cat/shards?h=index,shard,prirep,state,relocating.reason

To check the recovery status (from the Elasticsearch website) of the shards in the cluster, run the following API:

GET _cat/recovery?active_only=true

During this stage, the data migration might take additional time to complete because of an overloaded cluster, unbalanced shards, or backend issues.

Overloaded cluster

Make sure that you upgrade the version when the cluster traffic isn't high. Before you begin the upgrade, check the CPUUtilization and JVMMemoryPressure cluster metrics to make sure that these metrics have optimal values.

For more information, see How do I troubleshoot high CPU utilization on my Amazon OpenSearch Service cluster?

Unbalanced shards

By default, OpenSearch Service has a sharding strategy of 5:1, where each index is divided into five primary shards. Set your sharding strategy size so that each one shards between 10—30 GiB for search workloads, or 30—50 GiB for logs workloads.

OpenSearch and Elasticsearch 7.x and later have a limit of 1,000 shards per node. It's a best practice to have no more than 25 shards per GiB of Java heap.

For more information, see How do I rebalance the uneven shard distribution in my Amazon OpenSearch Service cluster?

Backend issues

During this stage, shard migration can get stuck because of backend issues. If there's no progress with the migration and the issue doesn't self-resolve, then contact AWS Support.

Stage 3: Removal of old nodes

After all the shards are migrated to the new nodes, older nodes are removed from your cluster. The node count then returns to the original node count that you configured. At this stage, the blue/green deployment and update processes are complete.

Upgrade succeeded with issues

The "upgrade succeeded with issues" message occurs when the cluster is blocking incoming write requests. Check the OpenSearch Service ClusterIndexWritesBlocked metric. A value of one means that the cluster is blocking write requests. To resolve this issue, add more disk space, or scale your cluster.

For more information, see Operational best practices for Amazon OpenSearch Service.

AWS OFFICIALUpdated a year ago