Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract) by Xitee1 · Pull Request #23 · Xitee1/bd-archiver

Xitee1 · 2026-05-12T16:44:36Z

Summary

Phases 2 through 5 of the incremental-archives feature. Stacks on top of #20 (Phase 1: catalog policy + persistence). Once #20 merges, this PR rebases against main automatically.

Phase 2 — Naming + label: dar archive name is now `<-n>-gen`; volume labels become `<truncated_name>G` (gen+disc visible at a glance, name truncates to 23 chars in the label only — filenames inside the ISO keep the full name).
Phase 3 — `--base`: `bd-archive create` accepts `--base <catalog.dar>` pointing at a previous generation's persisted catalog. Run becomes Gen N+1 (parsed from base filename); dar gets `-A` for incremental mode; preview shows delta size instead of full source.
Phase 4 — `--min-last-disc-fill`: auto-defers newest-by-mtime files until last-disc fill reaches the threshold. Conservative pool: only files not already in the base catalog for incrementals (no risk of losing files that just had their mtime touched on disk).
Phase 5 — Chain-aware extract: `bd-archive extract` restores a full chain in one run. Discs may be inserted in any order; the tool detects each disc's generation from filenames and runs one `dar -x` per generation in order at the end (subsequent gens with `-wa` so they overwrite earlier-gen files). Backward-compatible with pre-feature legacy archives (treated as Gen 1).

Design context

Full spec: `docs/superpowers/specs/2026-05-12-incremental-backups-design.md` (committed in #20). All four phases here implement that spec; Phase 5 deviates slightly from the spec's "single dar -x" model by doing one call per generation, which was verified to handle additions, modifications, AND deletions correctly in dar 2.7.17 with `-wa`.

Code review

A comprehensive code review of the full diff caught:

A NameError in the auto-defer path when the pool was empty (incremental where all source files are already in base).
A nonsense fill % display when archive_est was 0 (delta-empty incremental).
Missing range validation on `--min-last-disc-fill`.

All three fixed in the final commit. The reviewer's other flagged item (per-gen extract vs. single-call) was verified empirically to not be a real semantic gap — dar's `-wa` per-gen does honour deletion entries.

Test plan

`ruff check src/bd_archive/` clean
Phase 2 e2e: 2-disc archive, volume labels show `phase2test_G01_0001` / `phase2test_G01_0002`; files inside ISO are `phase2test-gen1.NNNN.dar`; local catalog persists as `phase2test-gen1-catalog.0001.dar`.
Phase 3 e2e: Gen 1 of 50 MiB → 2 discs. Add 15 MiB of new files. Gen 2 with `--base` → single disc, 15 MiB delta only, `phase3test-gen2.0001.dar` and `phase3test-gen2-catalog.0001.dar` on disc.
Phase 4 e2e: Gen 1 50 MiB, add 60 MiB. Gen 2 baseline: 2 discs, last 46%. Gen 2 with `--min-last-disc-fill 50`: 2 files deferred (20 MiB), 1 disc, 94%.
Phase 5 chain extract smoketest: built 2-gen chain (5 files + 1 modified + 1 new in gen 2), staged all slices into one dir, called `extract_sequential` for each gen in order. Result: byte-identical to source. Also verified separately that dar's per-gen `-wa` correctly removes a deleted file from gen 2.
Review fixes: empty-pool incremental + `--min-last-disc-fill 80` no longer crashes — proceeds with a clean info message; `--min-last-disc-fill 150` rejected with a clear error.
Real-disc e2e (manual user verification): extract.py's disc-mount + prompt + verify loop is unchanged in mechanics, just refactored to track per-gen state. Needs a multi-gen chain on real BD-R media to fully exercise.

🤖 Generated with Claude Code

Phase 2 of incremental-archives. Internal dar archive name is now <name>-gen<N> (every new full is Gen 1; Phase 3 derives higher N from --base). Volume labels switch to <truncated_name>_G<NN>_<NNNN> — the gen suffix lives in the label, the human-meaningful name truncates to 23 chars if longer. Filenames inside the ISO keep the full name. Pre-Phase-2 (legacy) Gen 1 archives are unaffected: their old labels and naming stay on the burned discs. New archives produced from this phase onward carry the new scheme. Why the truncation tradeoff: physically distinguishing Gen 1 Disc 1 from Gen 2 Disc 1 of the same chain is more useful than seeing the last few characters of an already-known archive name. The archive name acts as the chain identity (see project README, updated in a later phase), which discipline the user enforces by keeping `-n` constant across generations. README on disc gains a Generation line and a CHAIN: hint explaining the name-consistency rule. Manual e2e verified: phase2test_G01_0001 / phase2test_G01_0002 labels on a 2-disc set; phase2test-gen1.NNNN.dar slices on the discs via UDF; phase2test-gen1-catalog.0001.dar persisted to output_dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 3 of incremental-archives. New `--base <catalog.dar>` flag on `bd-archive create` makes the run produce an incremental archive against the supplied isolated catalog. dar's `-A` flag does the actual work; this commit wires it up end-to-end. The base catalog filename encodes the predecessor generation (`<name>-gen<N>-catalog.NNNN.dar`), so the new gen number is derived without any sidecar metadata file. Legacy catalogs (pre-Phase-2, filename `<name>-catalog.NNNN.dar`) are treated as Gen 1; the new gen becomes Gen 2. The pre-archive preview is now base-aware: when --base is given, the estimated archive size reflects only files that are new or modified since the base catalog (tools.dar.list_catalog_paths parses `dar -l` output; mtime > catalog-mtime catches modifications heuristically). Disc-count and last-disc-fill estimates use this delta, not the full source — without this, an incremental's preview would massively overstate. Chain identity is the archive name: --base whose embedded archive name disagrees with -n fails with a clear error pointing at the mismatch. Same name across generations is the user's discipline. archive/dar_archive.py gains parse_dar_filename(), a single regex that handles both Phase-2+ generational filenames and legacy ones. Used here for --base validation and reusable by Phase 5's chain detection in extract. Manual e2e: Gen 1 full of 50 MiB source → 2 discs. Adding 15 MiB of new files and running Gen 2 with --base produced a single-disc incremental containing only the delta (phase3test-gen2.0001.dar of 15 MiB, plus its catalog). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 4 of incremental-archives. When the projected last-disc fill is below `--min-last-disc-fill PERCENT`, bd-archive automatically defers the newest files until enough has been removed to either drop a disc from the set or empty the candidate pool. Pool selection is deliberately conservative: - With `--base`: only files whose relative path is not in the base catalog (truly new). Determined via `dar -l <base>` parse, so files that have merely had their mtime touched on disk stay in the archive (no silent loss across generations). - Without `--base` (Full): all files are candidates, with a loud warning that deferred files won't be archived until a future incremental run picks them up. The preview block now shows what would be deferred (file count, byte count, oldest mtime, sample paths) BEFORE the confirm prompt, so the user can abort if the plan looks wrong. When the threshold is unreachable (entire pool deferred without ever crossing the fill threshold), the run still proceeds with the partial deferral — the user gets a warning, not an abort. The only fatal case is "deferring everything would archive zero bytes", which exits 1. archive/source_scan.py grows a SourceFile dataclass and list_source_files() walker — separate from scan_source's aggregate view because the defer algorithm needs per-file rel_path/size/mtime. tools/dar.py::create_sliced grows an `excludes` parameter that turns each entry into a `-P <path>` flag, with dar -P being the relative-subpath exclude operator. Manual e2e: Gen 1 of 50 MiB, then 60 MiB delta. Without --min-last- disc-fill: 2 discs, last disc 46%. With --min-last-disc-fill 50: 20 MiB deferred (2 files), single disc, last fill 94%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Phase 5 of incremental-archives. `bd-archive extract` now restores an entire incremental chain (Gen 1 + all subsequent gens) in a single invocation. Previously it could only restore one archive set. User flow: - User runs `bd-archive extract -o ./restored`. - Tool prompts for discs one at a time. Each disc's filenames are parsed (via archive.dar_archive.parse_dar_filename) to detect the chain name and which generation that disc belongs to. Order doesn't matter; discs from any gen, any order, all accepted. - All slices land in one flat staging dir. Different generations have different dar basenames (photos-gen1, photos-gen2, …), so they coexist without collision. - When the user says "no more discs", the tool runs `dar -x` once per generation in order. The first gen extracts into the clean output; later gens use dar's -wa flag to overwrite files that earlier gens already wrote (later gens carry the newer content). The current archive_name variable is replaced with two pieces of state: chain_name (the -n value, identical across all gens) and a gen→dar_basename mapping (because legacy pre-Phase-2 gen 1 archives have basename "photos" while new ones have "photos-gen1"). Per-generation catalog verification: each gen has its own catalog file with a different basename, so the "verified" flag is now a dict keyed by gen number rather than a single bool. A disc that fails its gen's catalog sha512 drops it from staging so the next disc of the same gen can refetch — same convergence logic as before, just generation-scoped. The damage path (par2 repair) is unchanged in mechanics. tools/dar.py::extract_sequential grows an `overwrite` parameter that toggles dar's `-wa` flag. Required for chain extracts where gen N's data replaces gen N-1's; no effect on the first gen which extracts into an empty output dir. Smoketest: built a 2-gen chain (5 original files + 1 sub-dir file, then 1 modified + 1 new in gen 2), invoked extract_sequential for each gen in order against staged slices. diff -rq between source and restored output: byte-identical, no differences. Disc-mounting flow (prompt/mount/copy/verify) is preserved from the previous implementation; refactored to track per-gen state but the per-disc UX is the same. A full e2e against a real optical drive remains a manual user verification step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

README gains an "Adding an incremental generation" section with the --base workflow and --min-last-disc-fill explanation, plus updates the extract section to describe whole-chain restore. Adds a "Chain identity = archive name" callout near the top to explain the discipline of keeping -n constant across generations. AGENTS.md create / extract architecture descriptions are rewritten to cover the new naming scheme, --base flow, list_catalog_paths / scan_delta_bytes, auto-defer pool semantics, per-gen catalog state, and dar -x -wa chain restore. Layout section notes the new constants (ISO9660_LABEL_NAME_MAX / _SUFFIX_LEN), new helpers (parse_dar_filename, list_source_files / SourceFile, scan_delta_bytes), and the extended dar wrapper surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses findings from the comprehensive code review of phases 2-5: - Critical: NameError when the auto-defer pool is empty (incremental with --min-last-disc-fill where every source file is already in the base catalog). The fallthrough branch referenced new_n / new_last / new_fill which only existed if the loop body ran. Initialise them to the pre-defer layout before the loop, and split the "pool was empty" path from the "deferred everything to zero" path with distinct messages. - Important: _layout(0) returned a nonsense 110% fill for the delta-empty incremental case (no new files; only catalog + par2 overhead on the disc). Add an early return for est==0 computing fill from the fixed overhead so the preview makes sense. - Minor: range-validate --min-last-disc-fill at the top of cmd_create so 150 surfaces "must be 0-100" rather than silently triggering the threshold-unreachable warning. - Minor: hoist `from datetime import datetime` to the module-level imports (was inside the auto-defer print block) and split a long f-string to satisfy ruff's line-length rule. Manual e2e for both regressions verified: empty-pool case now produces a clean info message and proceeds with the original layout, generating a tiny incremental that captures any deletions (which is exactly what dar's incremental does in that scenario). Range validation rejects 150 with a clear error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Xitee1 and others added 6 commits May 12, 2026 18:18

Base automatically changed from feat/incremental-phase1-catalog-policy to main May 12, 2026 18:04

Xitee1 added 2 commits May 12, 2026 20:05

lint

62e8340

Merge branch 'main' into feat/incremental-phase2-5

0adc8fa

Xitee1 merged commit 70c1fa2 into main May 13, 2026
1 check passed

Xitee1 deleted the feat/incremental-phase2-5 branch May 19, 2026 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract)#23

Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract)#23
Xitee1 merged 8 commits into
mainfrom
feat/incremental-phase2-5

Xitee1 commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Xitee1 commented May 12, 2026

Summary

Design context

Code review

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant