Skip to content

Automated Cross-Region Cleanup of Orphaned EBS Snapshots

5 minute read
Content level: Intermediate
0

EBS snapshots created by Kubernetes backup tools often become orphaned when backup retention policies expire or backups are manually deleted, leading to cost accumulation across multiple AWS regions. This becomes operational overhead when there are thousands of orphaned snapshots.

This article provides a practical, automated solution for identifying and cleaning up orphaned EBS snapshots across all AWS regions.

The Hidden Cost of Kubernetes Backup Automation

When implementing backup strategies for Amazon EKS workloads, organizations often deploy automated backup solutions that create EBS snapshots to protect persistent volumes. However, many discover an unexpected consequence: while these backup tools efficiently manage retention at the application level, the underlying Amazon EBS snapshots often persist beyond their intended lifecycle.

Security Concerns

  • Potential sensitive data exposure
  • Outdated snapshots may contain vulnerable configurations
  • Compliance risks if snapshots contain regulated data

Why Snapshots Become Orphaned

Backup solutions using Kubernetes CSI drivers create EBS snapshots through the AWS API to protect persistent volumes. The disconnect occurs when:

  • Application-level retention policies expire and remove backup metadata
  • Manual deletion of backup objects doesn't trigger underlying AWS resource cleanup
  • Cross-region backup strategies create snapshots in multiple locations
  • Backup tool lifecycle management operates independently of AWS resource lifecycle

The Cost Impact

EBS snapshots follow an incremental storage model, but even incremental costs accumulate significantly:

  • Snapshots persist indefinitely without intervention
  • Multi-region deployments multiply storage costs
  • Lack of visibility makes cost attribution difficult
  • Manual cleanup becomes operationally prohibitive at scale

Organizations commonly discover thousands of orphaned snapshots representing hundreds or thousands of dollars in monthly storage costs that can be optimized without impacting backup integrity.

Automated Solution

The solution addresses orphaned snapshot cleanup through a multi-layered approach that prioritizes safety while maximizing cost optimization:

Core Components

  • Multi-region scanning: Automatically processes all AWS Regions
  • Tag-based identification: Targets specific backup tool snapshots
  • AMI protection: Prevents deletion of snapshots used by Amazon Machine Images
  • Configurable retention: Age-based filtering with customizable policies
  • Comprehensive logging: Full audit trail for compliance and troubleshooting

Safety-First Design

The automation includes multiple safeguards:

  • Dry-run capability for validation before production deployment
  • AMI association checks to protect critical system images
  • Tag-based filtering to ensure only intended snapshots are processed
  • Error handling that continues processing despite individual failures
  • Detailed logging for complete visibility into all operations

This approach ensures organizations can confidently automate cleanup while maintaining the integrity of their backup and system recovery capabilities.

The Implementation

Here's the complete Python script that powers this solution:

import boto3, datetime

# ---- CONFIG ----
AGE_DAYS   = 7 # No of days old snapshot to be deleted
TAG_KEYS   = ["<tag>", "<tag>"] 
DRY_RUN    = True  # set to False to actually delete 
# -----------------------------------------

def lambda_handler(event, context):
    cutoff = datetime.datetime.utcnow() - datetime.timedelta(days=AGE_DAYS)
    ec2_global = boto3.client("ec2")
    deleted = []
    
    for r in ec2_global.describe_regions()["Regions"]:
        region = r["RegionName"]
        ec2 = boto3.client("ec2", region_name=region)
        snaps = ec2.describe_snapshots(OwnerIds=["self"])["Snapshots"]
        
        # collect EBS snapshotIds used by AMIs so we don't delete them
        ami_snaps = {
            b["Ebs"]["SnapshotId"]
            for img in ec2.describe_images(Owners=["self"])["Images"]
            for b in img.get("BlockDeviceMappings", [])
            if "Ebs" in b
        }
        
        for s in snaps:
            start = s["StartTime"].replace(tzinfo=None)
            if start > cutoff:
                continue
            # skip snapshots referenced in AMI
            if s["SnapshotId"] in ami_snaps:
                continue
            # must contain at least one of the tag keys
            tags = {t["Key"]: t["Value"] for t in s.get("Tags", [])}
            if not any(k in tags for k in TAG_KEYS):
                continue
                
            msg = f"{s['SnapshotId']}    Region={region}    Time={start}    Tags={tags}"
            if DRY_RUN:
                print(f"DRY_RUN: would delete {msg}")
            else:
                try:
                    ec2.delete_snapshot(SnapshotId=s["SnapshotId"])
                    print(f"DELETED: {msg}")
                    deleted.append(msg)
                except Exception as e:
                    print(f"ERROR deleting {s['SnapshotId']}: {e}")
                    
    return {"deleted": deleted}

Key Features and Safety Measures

  • Multi-region scanning:
for r in ec2_global.describe_regions()["Regions"]:
    region = r["RegionName"]
    ec2 = boto3.client("ec2", region_name=region)

The script automatically discovers and processes all AWS Regions, ensuring comprehensive coverage without manual configuration.

  • Tag-based identification:
TAG_KEYS = ["<tag>", "<tag>"]
tags = {t["Key"]: t["Value"] for t in s.get("Tags", [])}
if not any(k in tags for k in TAG_KEYS):
    continue
  • AMI protection:
ami_snaps = {
    b["Ebs"]["SnapshotId"]
    for img in ec2.describe_images(Owners=["self"])["Images"]
    for b in img.get("BlockDeviceMappings", [])
    if "Ebs" in b
}

Before deletion, the script identifies snapshots used by AMIs, preventing accidental removal of critical system images.

  • Age based retention:
AGE_DAYS = 7
cutoff = datetime.datetime.utcnow() - datetime.timedelta(days=AGE_DAYS)
if start > cutoff:
    continue

Configurable retention period protects recent snapshots while cleaning up older, orphaned resources.

  • Dry-Run Safety:
DRY_RUN = True
if DRY_RUN:
    print(f"DRY_RUN: would delete {msg}")
else:
    ec2.delete_snapshot(SnapshotId=s["SnapshotId"])

Built-in dry-run capability allows safe testing and validation before actual deletion.

Required IAM Permissions

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeRegions",
                "ec2:DescribeSnapshots",
                "ec2:DescribeImages",
                "ec2:DeleteSnapshot"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:*:*:*"
        }
    ]
}

EventBridge Scheduling

{
    "ScheduleExpression": "rate(7 days)",
    "State": "ENABLED",
    "Targets": [
        {
            "Id": "SnapshotCleanupTarget",
            "Arn": "arn:aws:lambda:region:account:function:snapshot-cleanup"
        }
    ]
}

Testing Strategy

Start with DRY_RUN = True to validate targeting logic

Test in a non-production account first

Monitor CloudWatch logs for execution details

Validate cost impact before full deployment