There is a cluster that, due to losing a couple of nodes has a single shard in UNASSIGNED state.
TL;DR;: The shard can not be rerouted due to AWS limitations, index can not be deleted due to running snapshot (for over 18 hours now), cluster has scaled to double its regular size for no obvious reason and snapshot can not be cancelled because it is one of the automated ones.
What could be done to get the cluster back to green health? Data loss of that single index should not be a problem.
Detailed explanation
Symptom
Cluster in red health status due to a single unnasigned shard. A call to /_cluster/allocation/explain
returns the following:
{
"index": "REDACTED",
"shard": 1,
"primary": true,
"current_state": "unassigned",
"unassigned_info": {
"reason": "NODE_LEFT",
"at": "2021-12-01T21:27:04.905Z",
"details": "node_left[REDACTED]",
"last_allocation_status": "no_valid_shard_copy"
},
"can_allocate": "no_valid_shard_copy",
"allocate_explanation": "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",
...
Cluster rerouting
Regular troubleshooting on the matter indicates that one could take the data loss by reallocating the shard to empty using something like:
$ curl -XPOST '/_cluster/reroute' -d '{"commands": [{ "allocate_empty_primary": { "index": "REDACTED", "shard": 1, "node": "REDACTED", "accept_data_loss": true }}] }'
{"Message":"Your request: '/_cluster/reroute' is not allowed."}
But that endpoint is not available in AWS.
Closing/Deleting the index
Other suggestions include closing the index for operations, but that is not supported by AWS:
$ curl -X POST '/REDACTED/_close'
{"Message":"Your request: '/REDACTED/_close' is not allowed by Amazon Elasticsearch Service."}
Another solution is to delete the index. But, as there is a running snapshot, it can not be deleted:
$ curl -X DELETE '/REDACTED'
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[REDACTED][indices:admin/delete]"}],"type":"illegal_argument_exception","reason":"Cannot delete indices that are being snapshotted: [[REDACTED]]. Try again after snapshot finishes or cancel the currently running snapshot."},"status":400}
Cancelling the snapshot
As the previous error message states, you can try cancelling the snapshot:
curl -X DELETE '/_snapshot/cs-automated-enc/REDACTED'
{"Message":"Your request: '/_snapshot/cs-automated-enc/REDACTED' is not allowed."}
Apparently that is because the snapshot is part of the automated ones. Had it been a manual snapshot I would have been able to cancel it.
Problem is that the snapshot has been running for over 10 hours and is still initializing:
$ curl '/_snapshot/cs-automated-enc/REDACTED/_status'
{ "snapshots": [
{
"snapshot": "2021-12-12t20-38-REDACTED",
"repository": "cs-automated-enc",
"uuid": "REDACTED",
"state": "INIT",
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 0,
"failed": 0,
"total": 0
},
"stats": {
"number_of_files": 0,
"processed_files": 0,
"total_size_in_bytes": 0,
"processed_size_in_bytes": 0,
"start_time_in_millis": 0,
"time_in_millis": 0
},
"indices": {}
}
]}
As it can be seen from the timestamp, it has been that way for almost 20 hours now (for reference, previous snapshots show to have run in a couple of minutes).
Update: after the latest AWS outage in EC2, the snapshot was cancelled which allowed us to delete the index with the unallocated shard and the cluster is back in a healthy status :)