What is a blue/green deployment for a Amazon OpenSearch Service cluster?

6 minute read

I want to know what a blue/green deployment is for my Amazon OpenSearch Service cluster.

Short description

When you change your OpenSearch Service cluster configuration, a blue/green deployment can be started. During a blue/green deployment, a cluster state changes to "Processing" while a new OpenSearch Service environment is being created. When your new OpenSearch environment is created, the following occurs:

  • The total number of nodes are doubled. Or, the total number of nodes is equal to the node count in the old and new environment.
  • After the new nodes are provisioned the cluster state returns to "Active", the data migration from the old nodes to the new nodes begins.
  • After the data migration is completed, the old nodes are terminated.

OpenSearch Service performs a series of validation checks to confirm that your domain is eligible for an update during configuration changes or version upgrades. If any checks fail, you receive a notification in the OpenSearch Service console containing the issues that you must fix before updating your domain. Blue/green deployment isn't started until validation checks are completed even though the OpenSearch Service console shows the status as "Processing". Configuration changes can be retried from the OpenSearch Service console after the validation check issues are fixed. For more information, see Troubleshooting validation errors.

  • Blue/green deployment minimizes downtime and maintains the original environment in the event that deployment to the new environment is unsuccessful.
  • It's a best practice to schedule blue/green deployments during your domain's off-peak window. For more information, see Defining off-peak windows for Amazon OpenSearch Service.


What initiates a blue/green deployment in an Amazon OpenSearch service cluster

For list of configuration changes that initiates a blue/green deployment, See Changes that usually cause Blue/green deployments. You can test if your planned domain configuration changes initiate a blue/green deployment using the AWS Console or an API with a dry run. For more instructions, see Determining whether a change will cause a blue/green deployment.

  • The double number of nodes during blue/green deployments isn't impacted by Amazon OpenSearch service quotas. For example, if you have 80 default instances per domain quota and your OpenSearch cluster has 70 instances. During the blue/green deployment your OpenSearch cluster can use 140 instances.
  • If you make changes that causes a blue/green deployment, OpenSearch Service automatically updates your domain to the latest available software update. For more information, see Service software updates in OpenSearch Service.

Performance impact of blue/green deployments

During blue/green deployments your Amazon OpenSearch service cluster is available for incoming search and indexing requests. However, you might experience the following performance issues:

  • Temporary increase in usage on leader nodes as clusters have more nodes to manage.
  • Increased search and indexing latency as OpenSearch Service copies data from old nodes to new nodes.
  • Increased rejections for incoming requests as the cluster load increases during blue/green deployments.

To avoid latency issues and request rejections, it's a best practice to run blue/green deployments when the cluster is healthy and there's low network traffic.

To avoid data loss during blue/green deployments, make sure that you follow the operational best practices for OpenSearch Service.

Checking blue/green deployment activity, audit logs, and notifications

AWS CloudTrail
OpenSearch Service activity is recorded in CloudTrail events along with other AWS service events in Event history. CloudTrail captures all configuration API calls for OpenSearch Service as events. For more information, see Monitoring OpenSearch Service API calls with CloudTrail.

OpenSearch Service audit logs
If your Amazon OpenSearch Service domain uses fine-grained access control, you can turn on audit logs for your data. Audit logs are customizable and let you track user activity on your OpenSearch clusters. OpenSearch Service publishes audit logs to CloudWatch Logs. For more information, see Monitoring audit logs in OpenSearch Service.

OpenSearch Service notifications
Notifications in Amazon OpenSearch Service contain important information about the performance and health of your domains. OpenSearch Service notifies you about service software updates, Auto-Tune enhancements, cluster health events, and domain errors. You can view notifications in the Notifications panel of the OpenSearch Service console. For more information, see Getting started with notifications.

Configuration change duration

Your configuration change can take longer depending on the cluster size, workload, shard size, and shard count. You can check the progress of the configuration change stages under Domain status in the OpenSearch Service console. You can also check the configuration change stages progress using the DescribeDomainChangeProgress API.

Use the cat recovery API to monitor the status of your shard relocation. To see which shards are still relocating, use the following command syntax:

curl -X GET "https://<end_point>/_cat/recovery?v=true&pretty" | awk '/peer/ {print $1" "$2" "$3" "$4" "$18}' | grep -v 100\.0\%

To list the shard relocation by byte percentages, use the following command syntax:

curl -X GET "https://<end_point>/_cat/recovery?v=true&pretty" | awk '/peer/ {print $1" "$2" "$3" "$4" "$18}' | tr -d "%" | sort -k 5 -n

For more information, see the cat recovery API on the Elasticsearch website.
Note: To sort the data by byte percentage (which is in the fifth column), you must specify "5" for -k.

If you observe minimal progress for the shard relocation, your cluster might be stuck.

Reasons for stuck blue/green deployments

Your blue/green deployment process might get stuck for the following reasons:

  • An unhealthy cluster state from before the configuration change.
  • Consistently high JVM memory pressure. Aim to keep your JVM memory pressure below 75% to avoid out of memory (OOM) issues.
  • Consistently high CPU utilization. Aim to keep your CPU utilization below 80%.
  • Too many shards on a cluster or incorrect shard sizing. It's a best practice to keep your shard count between 10 GiB and 50 GiB. For more information about indexing strategy, see Choosing the number of shards.
  • Configuration setup isn't valid or too many configuration changes at the same time. Make sure to verify your configuration settings and wait to send a configuration change until the first configuration change completes.
  • Insufficient disk space or capacity for the relocation process or requested instance type.
  • Lack of available IPs on the requested subnet for a cluster inside a virtual private cloud (VPC).
  • Using volume size for the instance type. Your volume size must be within the limit range.
  • Using index settings like "index.routing.allocation.require._name" or "NODE_NAME" or "index.blocks.write": true". These settings indicate a write block. Make sure to remove these settings from your index settings before you proceed.

For more information, see Why is my OpenSearch Service domain stuck in the "Processing" state?

Related information

Why is my Amazon OpenSearch Service domain upgrade taking so long?

Introducing Auto-tune in OpenSearch Service

AWS OFFICIALUpdated a year ago