How do I improve the fault tolerance of my OpenSearch Service domain?
I want to protect Amazon OpenSearch Service resources against accidental deletion, application or hardware failures, or outages. I want to know best practices to improve fault tolerance or restore snapshots.
Short description
To improve the fault tolerance of your OpenSearch Service domain, take one or more of the following actions:
- Use at least three master eligible data nodes in each domain
- Use dedicated master nodes
- Scale your domain horizontally
- Turn on multi-AZ configurations
- Don't use T2 or T3 instance types for production environments
- Monitor your domain metrics with Amazon CloudWatch
- Take regular snapshots
Resolution
Use at least three master eligible data nodes in each domain
Use at least three master eligible data nodes to avoid the 503 Service Unavailable error. This error occurs if a node fails in a one or two node cluster. OpenSearch Service uses a quorum based election to choose a leader node. If you have three master eligible data nodes to vote on a leader node, then you avoid a tie-breaker. Tie-breakers cause split brain or quorum loss. If one node fails in a three node cluster, then the cluster can still elect a new leader node.
To further optimize fault tolerance, set up at least one replica for each index and data node (primary data and replication data).
Each index has 5 primary shards and a replication factor of 1 by default. It's a best practice to scale your data nodes with to the same number of primary shards to verify equal shard distribution. This action improves your cluster resiliency. For example: If you have an index of 100 GB with 5 shards, then use 5 data nodes with 20 GB of data each.
Use dedicated master nodes
Dedicated master nodes help prevent bottlenecks caused by overloaded data nodes that are handling master node tasks. Dedicated master nodes manage the cluster's shards and cluster state. Removing these tasks from data nodes frees these nodes to handle search and write traffic more efficiently. Dedicated master nodes also improve performance of snapshot operations for data recovery on large clusters with many shards and data nodes.
It's a best practice to use dedicated master nodes in the following scenarios:
- Your domain is a production deployment.
- Your index mapping is complex, with many fields defined across types and indices.
- Your cluster has more than 5-10 data nodes. Traffic and shard count define this quantity.
- Your data nodes have more shards than the cluster sizing calculations. For more information, see Shard strategy.
Horizontally scale your domain
To improve the fault tolerance of an OpenSearch Service domain, scale your domain to match the amount of data that it's storing. For high availability, verify that you have additional data nodes. If you increase data nodes and shard replication, then you avoid a single point of failure if a data node in the cluster fails. You also reduce the risk of data loss through a red index. A scaled domain uses a yellow index status to promote your replica shard to a primary shard. This keeps write operations active and prevents data loss.
Activate multi-AZ configurations
Zone awareness helps prevent downtime and data loss when a cluster is deployed in a single Availability Zone. If you turn on zone awareness, then OpenSearch Service allocates the nodes and shards across two or three Availability Zones in that AWS Region.
It's a best practice to turn on one of the following multi-AZ configurations for your OpenSearch Service domain:
For example: If you set up three Availability Zones, then configure a replication factor of 2 on your index. If there's a zone failure, then the two replicas allow for 100% data redundancy. Also, OpenSearch Services promotes one replica to primary shard to continue write operations.
Don't use T instances (T2 and T3) for production environments
For production environments, use the following non T instance types:
- For scalable workloads, use M instances (general purpose).
- For index heavy workloads, use C instances (compute optimized).
- For heavy search workloads, use R instances (memory optimized).
- For storage optimized workloads that have big indices with big shards (50-100 GB shard size), use i3 or IM4GN instances (storage optimized).
For more information on instance types, see On-demand instance pricing on the Amazon OpenSearch Service pricing page.
Use small T2 and T3 instances for development, testing, proof of concept, and learning environments. Don't use t2.small or T3.small instances for data nodes or dedicated master nodes for important deployments that require high availability. If you use T instances for deployments, then be aware of the following features:
- T instances are assigned CPU credits. If the CPU credit balance runs out, then the T instance throttles the CPU. This action can cause nodes to time out and drop out of the cluster under sustained heavy traffic. For more information, see Key concepts and definitions for burstable performance instances.
- T3 instance types are more stable and resilient than T2 instances.
- If your T instances use more CPU than allocated, then they start to use CPU credit balance. When CPU credit balance runs out, the instance throttles the node. The node then drops out of the cluster. To avoid this scenario, monitor the CPU credit balance, CPU Utilization and JVMMemoryPressure of your instances.
Note: If you need more processing power or memory, then scale up your instance types. If you need more storage for shard management, then add more nodes to scale out your cluster. For more information, see Choosing instance types and testing.
Monitor your domain metrics with CloudWatch
To monitor your domain for performance bottlenecks, use CloudWatch metrics: CPU Utilization, JVMMemoryPressure and Amazon Elastic Block Storage (Amazon EBS) metrics for FreeStorageSpace, IOPS Throttling, and Throughput Throttling. Use metrics to proactively correct and troubleshoot minor issues before they become complex.
Create CloudWatch alarms for key OpenSearch Service metrics.
For more information, see Get started with Amazon OpenSearch Service: Set CloudWatch alarms on key metrics.
Take regular snapshots
All OpenSearch Service domains have automated snapshots turned on for disaster recovery. However, it's a best practice to maintain an external snapshot repository where you save and manage your own backups separately from the AWS managed repository. Self-managed snapshots help you migrate data between OpenSearch Service domains or restore data to another OpenSearch Service domain.
Related information
Get started with Amazon OpenSearch Service: How many data instances do I need?
Configure Amazon OpenSearch Service for high availability
Relevant content
- asked a year agolg...
- asked 9 months agolg...
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago
- AWS OFFICIALUpdated 4 months ago
- AWS OFFICIALUpdated 2 years ago