Skip to content

Orphaned S3 objects are never cleaned up after all backups are deleted #19

@mpryc

Description

@mpryc

Problem

When a user deletes all their backups, the underlying data objects (pack blobs) remain permanently on S3 storage and are never cleaned up. This leads to unnecessary storage costs and stale data accumulating indefinitely.

Why this happens

OADP-VMDP is built on Kopia, which has a multi-stage garbage collection (GC) pipeline to clean up unreferenced storage objects. However, the way the CLI is currently wired, this pipeline can never fully complete after backup deletion.

There are two independent issues at play:

1. Auto-maintenance uses SafetyFull which requires multiple days to complete

After every backup create or backup delete, auto-maintenance runs via cli/app.go:590:

return snapshotmaintenance.Run(ctx, w, maintenance.ModeAuto, false, maintenance.SafetyFull)

SafetyFull is designed for multi-client Kopia servers where many users may be reading/writing concurrently. It enforces conservative timing delays defined in repo/maintenance/maintenance_safety.go:

Parameter Value Effect
MinContentAgeSubjectToGC 24 hours Content must be unreferenced for 24h before GC touches it
PackDeleteMinAge 24 hours Orphaned blobs must be 24h old before deletion
MarginBetweenSnapshotGC 4 hours Must wait 4h between GC cycles
RequireTwoGCCycles true Two successful GC cycles required before dropping deleted content
MinRewriteToOrphanDeletionDelay 1 hour Must wait 1h after content rewrite before deleting orphaned blobs

This means even if auto-maintenance triggers, it takes multiple days of continued CLI usage for all orphaned S3 data to actually get deleted.

2. After deleting the last backup, auto-maintenance has no future trigger

Auto-maintenance only runs as a side-effect of repositoryWriterAction operations (backup create, backup delete, etc.). Once the user deletes their last backup and walks away, no more write operations happen, so auto-maintenance never triggers again. The orphaned blobs remain on S3 permanently.

User scenario that demonstrates the problem

Step 1: User creates files in /home/user/myfiles
Step 2: oadp-vmdp backup create /home/user/myfiles     → data uploaded to S3
Step 3: User updates files
Step 4: oadp-vmdp backup create /home/user/myfiles     → new/changed data uploaded to S3
Step 5: User removes the directory
Step 6: oadp-vmdp backup create /home/user/myfiles     → empty snapshot, no new data
Step 7: oadp-vmdp backup delete <all-ids> --delete     → snapshot manifests deleted

After Step 7:
  - All snapshot manifests are gone
  - All pack blobs from Steps 2, 4, 6 remain on S3 forever
  - Auto-maintenance ran once with SafetyFull → timing gates prevented any actual cleanup
  - No future CLI operation will trigger another maintenance cycle
  - User has no way to clean up (maintenance/blob commands are not wired in the CLI)

Impact

  • Storage cost: Orphaned blobs accumulate and the user pays for S3 storage indefinitely
  • No user recourse: The maintenance run, blob gc, and policy commands are intentionally not wired in the shipped CLI (cli/app.go:298-309), so the user has no way to manually trigger cleanup
  • Applies to both Linux and Windows VMs: The S3 storage layer is platform-agnostic; both platforms are equally affected

Proposed Solution

Force full maintenance with SafetyNone when the last backup is deleted.

After backup delete completes, check if any snapshots remain in the repository. If none remain, run the full maintenance pipeline with SafetyNone instead of SafetyFull. This will clean up all orphaned S3 objects in a single pass.

Why SafetyNone is safe here

SafetyFull timing delays exist to protect against concurrent Kopia clients reading/writing the same repository. When zero snapshots remain:

  • There is nothing for any concurrent process to be reading
  • Every content and blob is by definition unreferenced
  • OADP-VMDP is a single-agent tool running inside one VM, not a multi-client server

SafetyNone (repo/maintenance/maintenance_safety.go:45-54) skips all timing delays and two-cycle requirements, allowing the full GC pipeline to complete immediately.

What to change

File: cli/command_snapshot_delete.go

In the run method, after all snapshot deletions complete:

  1. Call snapshot.ListSnapshotManifests(ctx, rep, nil, nil) to check if any snapshots remain
  2. If the list is empty, run snapshotmaintenance.Run() with maintenance.ModeFull and maintenance.SafetyNone
  3. Log a user-friendly message like: "No backups remaining. Cleaning up storage... freed X MB (N objects removed)"

The backup delete command currently uses repositoryWriterAction which calls maybeRunMaintenance() in cli/app.go:572-601 after completion. The forced maintenance should run before this automatic call (or the automatic call should be skipped when forced maintenance already ran).

Pseudocode

func (c *commandSnapshotDelete) run(ctx context.Context, rep repo.RepositoryWriter) error {
    // ... existing deletion logic ...

    // After all deletions, check if any snapshots remain
    remaining, err := snapshot.ListSnapshotManifests(ctx, rep, nil, nil)
    if err != nil {
        return errors.Wrap(err, "error listing remaining snapshots")
    }

    if len(remaining) == 0 && c.snapshotDeleteConfirm {
        log(ctx).Infof("No backups remaining. Cleaning up storage...")

        dr, ok := rep.(repo.DirectRepositoryWriter)
        if ok {
            if err := snapshotmaintenance.Run(ctx, dr, maintenance.ModeFull, true, maintenance.SafetyNone); err != nil {
                log(ctx).Warningf("Storage cleanup completed with warnings: %v", err)
            } else {
                log(ctx).Infof("Storage cleanup completed successfully.")
            }
        }
    }

    return nil
}

User experience after the fix

$ oadp-vmdp backup delete <id> --delete
Deleting snapshot abc123 of user@host:/home/user/myfiles at 2025-01-15 10:30:00...
No backups remaining. Cleaning up storage...
Storage cleanup completed successfully.

No extra commands. No configuration. The user deletes their backups and S3 is cleaned up automatically.


Additional Consideration

A standalone backup cleanup subcommand could also be added for edge cases (interrupted cleanup, auditing storage usage), but the automatic cleanup on last-backup deletion is the primary fix that addresses the core problem without requiring users to learn or remember additional commands.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions