Automated Cross-Region Cleanup of Orphaned EBS Snapshots
EBS snapshots created by Kubernetes backup tools often become orphaned when backup retention policies expire or backups are manually deleted, leading to cost accumulation across multiple AWS regions. This becomes operational overhead when there are thousands of orphaned snapshots.
This article provides a practical, automated solution for identifying and cleaning up orphaned EBS snapshots across all AWS regions.
The Hidden Cost of Kubernetes Backup Automation
When implementing backup strategies for Amazon EKS workloads, organizations often deploy automated backup solutions that create EBS snapshots to protect persistent volumes. However, many discover an unexpected consequence: while these backup tools efficiently manage retention at the application level, the underlying Amazon EBS snapshots often persist beyond their intended lifecycle.
Security Concerns
- Potential sensitive data exposure
- Outdated snapshots may contain vulnerable configurations
- Compliance risks if snapshots contain regulated data
Why Snapshots Become Orphaned
Backup solutions using Kubernetes CSI drivers create EBS snapshots through the AWS API to protect persistent volumes. The disconnect occurs when:
- Application-level retention policies expire and remove backup metadata
- Manual deletion of backup objects doesn't trigger underlying AWS resource cleanup
- Cross-region backup strategies create snapshots in multiple locations
- Backup tool lifecycle management operates independently of AWS resource lifecycle
The Cost Impact
EBS snapshots follow an incremental storage model, but even incremental costs accumulate significantly:
- Snapshots persist indefinitely without intervention
- Multi-region deployments multiply storage costs
- Lack of visibility makes cost attribution difficult
- Manual cleanup becomes operationally prohibitive at scale
Organizations commonly discover thousands of orphaned snapshots representing hundreds or thousands of dollars in monthly storage costs that can be optimized without impacting backup integrity.
Automated Solution
The solution addresses orphaned snapshot cleanup through a multi-layered approach that prioritizes safety while maximizing cost optimization:
Core Components
- Multi-region scanning: Automatically processes all AWS Regions
- Tag-based identification: Targets specific backup tool snapshots
- AMI protection: Prevents deletion of snapshots used by Amazon Machine Images
- Configurable retention: Age-based filtering with customizable policies
- Comprehensive logging: Full audit trail for compliance and troubleshooting
Safety-First Design
The automation includes multiple safeguards:
- Dry-run capability for validation before production deployment
- AMI association checks to protect critical system images
- Tag-based filtering to ensure only intended snapshots are processed
- Error handling that continues processing despite individual failures
- Detailed logging for complete visibility into all operations
This approach ensures organizations can confidently automate cleanup while maintaining the integrity of their backup and system recovery capabilities.
The Implementation
Here's the complete Python script that powers this solution:
import boto3, datetime # ---- CONFIG ---- AGE_DAYS = 7 # No of days old snapshot to be deleted TAG_KEYS = ["<tag>", "<tag>"] DRY_RUN = True # set to False to actually delete # ----------------------------------------- def lambda_handler(event, context): cutoff = datetime.datetime.utcnow() - datetime.timedelta(days=AGE_DAYS) ec2_global = boto3.client("ec2") deleted = [] for r in ec2_global.describe_regions()["Regions"]: region = r["RegionName"] ec2 = boto3.client("ec2", region_name=region) snaps = ec2.describe_snapshots(OwnerIds=["self"])["Snapshots"] # collect EBS snapshotIds used by AMIs so we don't delete them ami_snaps = { b["Ebs"]["SnapshotId"] for img in ec2.describe_images(Owners=["self"])["Images"] for b in img.get("BlockDeviceMappings", []) if "Ebs" in b } for s in snaps: start = s["StartTime"].replace(tzinfo=None) if start > cutoff: continue # skip snapshots referenced in AMI if s["SnapshotId"] in ami_snaps: continue # must contain at least one of the tag keys tags = {t["Key"]: t["Value"] for t in s.get("Tags", [])} if not any(k in tags for k in TAG_KEYS): continue msg = f"{s['SnapshotId']} Region={region} Time={start} Tags={tags}" if DRY_RUN: print(f"DRY_RUN: would delete {msg}") else: try: ec2.delete_snapshot(SnapshotId=s["SnapshotId"]) print(f"DELETED: {msg}") deleted.append(msg) except Exception as e: print(f"ERROR deleting {s['SnapshotId']}: {e}") return {"deleted": deleted}
Key Features and Safety Measures
- Multi-region scanning:
for r in ec2_global.describe_regions()["Regions"]: region = r["RegionName"] ec2 = boto3.client("ec2", region_name=region)
The script automatically discovers and processes all AWS Regions, ensuring comprehensive coverage without manual configuration.
- Tag-based identification:
TAG_KEYS = ["<tag>", "<tag>"] tags = {t["Key"]: t["Value"] for t in s.get("Tags", [])} if not any(k in tags for k in TAG_KEYS): continue
- AMI protection:
ami_snaps = { b["Ebs"]["SnapshotId"] for img in ec2.describe_images(Owners=["self"])["Images"] for b in img.get("BlockDeviceMappings", []) if "Ebs" in b }
Before deletion, the script identifies snapshots used by AMIs, preventing accidental removal of critical system images.
- Age based retention:
AGE_DAYS = 7 cutoff = datetime.datetime.utcnow() - datetime.timedelta(days=AGE_DAYS) if start > cutoff: continue
Configurable retention period protects recent snapshots while cleaning up older, orphaned resources.
- Dry-Run Safety:
DRY_RUN = True if DRY_RUN: print(f"DRY_RUN: would delete {msg}") else: ec2.delete_snapshot(SnapshotId=s["SnapshotId"])
Built-in dry-run capability allows safe testing and validation before actual deletion.
Required IAM Permissions
{ "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "ec2:DescribeRegions", "ec2:DescribeSnapshots", "ec2:DescribeImages", "ec2:DeleteSnapshot" ], "Resource": "*" }, { "Effect": "Allow", "Action": [ "logs:CreateLogGroup", "logs:CreateLogStream", "logs:PutLogEvents" ], "Resource": "arn:aws:logs:*:*:*" } ] }
EventBridge Scheduling
{ "ScheduleExpression": "rate(7 days)", "State": "ENABLED", "Targets": [ { "Id": "SnapshotCleanupTarget", "Arn": "arn:aws:lambda:region:account:function:snapshot-cleanup" } ] }
Testing Strategy
Start with DRY_RUN = True to validate targeting logic
Test in a non-production account first
Monitor CloudWatch logs for execution details
Validate cost impact before full deployment
- Topics
- StorageContainers
- Language
- English
Relevant content
- asked 3 years ago
AWS OFFICIALUpdated 3 years ago