我想监控我的 Amazon OpenSearch Service 集群是否存在稳定性问题。如何有效地监控我的集群?
解决方法
重要提示:不同版本的 Elasticsearch 使用不同的线程池来处理对 _index API 的调用。
- Elasticsearch 1.5 和 2.3 版使用索引线程池。
- Elasticsearch 5.x、6.0 和 6.2 版使用批量线程池。(目前,OpenSearch Service 控制台不包含批量线程池的图形。)
- Elasticsearch 6.3 版及更高版本使用写线程池。
要监控 OpenSearch Service 集群的运行状况,请设置建议的 Amazon CloudWatch 警报以及下列 OpenSearch Service 集群指标警报:
- MasterReachableFromNode
- KibanaHealthyNodes
- DiskQueueDepth
- ThreadpoolIndexQueue
- ThreadpoolSearchQueue
您可以像这样配置 OpenSearch Service 指标警报:
MasterReachableFromNode:
Statistic = Maximum
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: Leader node is down.
KibanaHealthyNodes:
Statistic = Average
Value = ‘=0’
Frequency = 1 period
Period = 1 minute
Issue: Indicates that the kibana index is unhealthy.
DiskQueueDepth:
Statistic = Average
Value = ‘>=100'
Frequency = 1 period
Period = 5 minutes
Issue: Disk Queue Depth is the number of I/O requests that are queued at a time against the storage. This could indicate a surge in requests or Amazon EBS throttling, resulting in increased latency.
ThreadpoolIndexQueue and ThreadpoolSearchQueue:
Statistic = Maximum
Value = ‘>=20’
Frequency = 1 period
Period = 1 minute
Issue: Indicates that there are requests getting queued up, which can be rejected. To verify the request status, check the CPU Utilization and Threadpool Index or Search rejects.
要为您的 OpenSearch Service 集群设置 Amazon CloudWatch 警报,请执行以下步骤:
1. 打开 Amazon CloudWatch console (Amazon CloudWatch 控制台)。
2. 转至警报选项卡。
3. 选择创建警报。
4. 选择选择指标。
5. 为您的指标选择 ES。
6. 选择每个域和每个客户端的指标。
7. 选择一个指标,然后选择下一步。
8. 为您的 Amazon CloudWatch 警报配置以下设置:
Statistic = Maximum
Period to 1 minute
Threshold type = Static
Alarm condition = Greater than or equal to
Threshold value = 1
9. 选择其他配置选项卡。
10. 更新以下配置设置:
Datapoints to alarm = Frequency stated above
Missing data treatment = Treat missing data as ignore (maintain the alarm state)
11. 选择 Next(下一步)。
12. 选择您希望您的警报执行的操作,然后选择下一步。
13. 为您的警报设置一个名称,然后选择下一步。
14. 选择创建警报。
注意:如果触发了 CPUUtilization 或 JVMMemoryPressure 警报,请检查您的 Amazon CloudWatch 指标,确定传入的请求是否出现峰值。特别是监控以下 Amazon CloudWatch 指标:IndexingRate、SearchRate 和 OpenSearchRequests。
相关信息
ClusterBlockException
使用 Amazon CloudWatch 警报