EBS Snapshot failures are extremely rare, however failures happen, even when managed by Amazon Data Lifecycle Manager (DLM), with users often unaware of what happened until they need to restore a snapshot which they found failed: this can result in no snapshot being available for restoration.
Although EBS Snapshot failures are extremely rare, however, EBS snapshot creation might fail because of transient issues, service limits, or other reasons.
The EBS Snapshot creation is an asynchronous process where a createSnapshot event results either succeeded or failed.
Users should monitor such metrics for their critical workloads, being aware about any EBS Snapshot failure.
We implemented the automatic-EBS-failed-snashot-recovery aws-sample project.
In this project we use an EventBridge Rule to send EBS snapshot failure events to an SQS queue which triggers a Lambda that will run another EBS Snapshot creation.
The lambda runs to only on those snapshots tagged with a specific tag source_tag_Key=source_tag_Value which has to be defined in the environment variables. Recovery snapshots are marked by the lambda by appending the tag snapshot_recovery_tag_Key=snapshot_recovery_tag_Value which has to be defined in the environment variables.
The smart mechanism is that in case the Lambda fails, a new createSnapshot failed event is generated and the above mentioned mechanism is triggered again.
The python code proposed in this sample includes a tagcounter to limit the maximum number of attempts, such a counter decreases every time a recovery snapshot attempt fails.
In case the counter reaches zero, then the Lambda sends a notification via SNS.
Please note that this is a sample to review and to modify according to your needs.
Important The sample recovers the EBS snapshots, thus, we recommend to review and clean-up such snapshots to avoid unexpected charges on your bill.