Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract)#23
Merged
Conversation
Phase 2 of incremental-archives. Internal dar archive name is now <name>-gen<N> (every new full is Gen 1; Phase 3 derives higher N from --base). Volume labels switch to <truncated_name>_G<NN>_<NNNN> — the gen suffix lives in the label, the human-meaningful name truncates to 23 chars if longer. Filenames inside the ISO keep the full name. Pre-Phase-2 (legacy) Gen 1 archives are unaffected: their old labels and naming stay on the burned discs. New archives produced from this phase onward carry the new scheme. Why the truncation tradeoff: physically distinguishing Gen 1 Disc 1 from Gen 2 Disc 1 of the same chain is more useful than seeing the last few characters of an already-known archive name. The archive name acts as the chain identity (see project README, updated in a later phase), which discipline the user enforces by keeping `-n` constant across generations. README on disc gains a Generation line and a CHAIN: hint explaining the name-consistency rule. Manual e2e verified: phase2test_G01_0001 / phase2test_G01_0002 labels on a 2-disc set; phase2test-gen1.NNNN.dar slices on the discs via UDF; phase2test-gen1-catalog.0001.dar persisted to output_dir. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of incremental-archives. New `--base <catalog.dar>` flag on `bd-archive create` makes the run produce an incremental archive against the supplied isolated catalog. dar's `-A` flag does the actual work; this commit wires it up end-to-end. The base catalog filename encodes the predecessor generation (`<name>-gen<N>-catalog.NNNN.dar`), so the new gen number is derived without any sidecar metadata file. Legacy catalogs (pre-Phase-2, filename `<name>-catalog.NNNN.dar`) are treated as Gen 1; the new gen becomes Gen 2. The pre-archive preview is now base-aware: when --base is given, the estimated archive size reflects only files that are new or modified since the base catalog (tools.dar.list_catalog_paths parses `dar -l` output; mtime > catalog-mtime catches modifications heuristically). Disc-count and last-disc-fill estimates use this delta, not the full source — without this, an incremental's preview would massively overstate. Chain identity is the archive name: --base whose embedded archive name disagrees with -n fails with a clear error pointing at the mismatch. Same name across generations is the user's discipline. archive/dar_archive.py gains parse_dar_filename(), a single regex that handles both Phase-2+ generational filenames and legacy ones. Used here for --base validation and reusable by Phase 5's chain detection in extract. Manual e2e: Gen 1 full of 50 MiB source → 2 discs. Adding 15 MiB of new files and running Gen 2 with --base produced a single-disc incremental containing only the delta (phase3test-gen2.0001.dar of 15 MiB, plus its catalog). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of incremental-archives. When the projected last-disc fill is below `--min-last-disc-fill PERCENT`, bd-archive automatically defers the newest files until enough has been removed to either drop a disc from the set or empty the candidate pool. Pool selection is deliberately conservative: - With `--base`: only files whose relative path is not in the base catalog (truly new). Determined via `dar -l <base>` parse, so files that have merely had their mtime touched on disk stay in the archive (no silent loss across generations). - Without `--base` (Full): all files are candidates, with a loud warning that deferred files won't be archived until a future incremental run picks them up. The preview block now shows what would be deferred (file count, byte count, oldest mtime, sample paths) BEFORE the confirm prompt, so the user can abort if the plan looks wrong. When the threshold is unreachable (entire pool deferred without ever crossing the fill threshold), the run still proceeds with the partial deferral — the user gets a warning, not an abort. The only fatal case is "deferring everything would archive zero bytes", which exits 1. archive/source_scan.py grows a SourceFile dataclass and list_source_files() walker — separate from scan_source's aggregate view because the defer algorithm needs per-file rel_path/size/mtime. tools/dar.py::create_sliced grows an `excludes` parameter that turns each entry into a `-P <path>` flag, with dar -P being the relative-subpath exclude operator. Manual e2e: Gen 1 of 50 MiB, then 60 MiB delta. Without --min-last- disc-fill: 2 discs, last disc 46%. With --min-last-disc-fill 50: 20 MiB deferred (2 files), single disc, last fill 94%. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of incremental-archives. `bd-archive extract` now restores an entire incremental chain (Gen 1 + all subsequent gens) in a single invocation. Previously it could only restore one archive set. User flow: - User runs `bd-archive extract -o ./restored`. - Tool prompts for discs one at a time. Each disc's filenames are parsed (via archive.dar_archive.parse_dar_filename) to detect the chain name and which generation that disc belongs to. Order doesn't matter; discs from any gen, any order, all accepted. - All slices land in one flat staging dir. Different generations have different dar basenames (photos-gen1, photos-gen2, …), so they coexist without collision. - When the user says "no more discs", the tool runs `dar -x` once per generation in order. The first gen extracts into the clean output; later gens use dar's -wa flag to overwrite files that earlier gens already wrote (later gens carry the newer content). The current archive_name variable is replaced with two pieces of state: chain_name (the -n value, identical across all gens) and a gen→dar_basename mapping (because legacy pre-Phase-2 gen 1 archives have basename "photos" while new ones have "photos-gen1"). Per-generation catalog verification: each gen has its own catalog file with a different basename, so the "verified" flag is now a dict keyed by gen number rather than a single bool. A disc that fails its gen's catalog sha512 drops it from staging so the next disc of the same gen can refetch — same convergence logic as before, just generation-scoped. The damage path (par2 repair) is unchanged in mechanics. tools/dar.py::extract_sequential grows an `overwrite` parameter that toggles dar's `-wa` flag. Required for chain extracts where gen N's data replaces gen N-1's; no effect on the first gen which extracts into an empty output dir. Smoketest: built a 2-gen chain (5 original files + 1 sub-dir file, then 1 modified + 1 new in gen 2), invoked extract_sequential for each gen in order against staged slices. diff -rq between source and restored output: byte-identical, no differences. Disc-mounting flow (prompt/mount/copy/verify) is preserved from the previous implementation; refactored to track per-gen state but the per-disc UX is the same. A full e2e against a real optical drive remains a manual user verification step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README gains an "Adding an incremental generation" section with the --base workflow and --min-last-disc-fill explanation, plus updates the extract section to describe whole-chain restore. Adds a "Chain identity = archive name" callout near the top to explain the discipline of keeping -n constant across generations. AGENTS.md create / extract architecture descriptions are rewritten to cover the new naming scheme, --base flow, list_catalog_paths / scan_delta_bytes, auto-defer pool semantics, per-gen catalog state, and dar -x -wa chain restore. Layout section notes the new constants (ISO9660_LABEL_NAME_MAX / _SUFFIX_LEN), new helpers (parse_dar_filename, list_source_files / SourceFile, scan_delta_bytes), and the extended dar wrapper surface. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses findings from the comprehensive code review of phases 2-5: - Critical: NameError when the auto-defer pool is empty (incremental with --min-last-disc-fill where every source file is already in the base catalog). The fallthrough branch referenced new_n / new_last / new_fill which only existed if the loop body ran. Initialise them to the pre-defer layout before the loop, and split the "pool was empty" path from the "deferred everything to zero" path with distinct messages. - Important: _layout(0) returned a nonsense 110% fill for the delta-empty incremental case (no new files; only catalog + par2 overhead on the disc). Add an early return for est==0 computing fill from the fixed overhead so the preview makes sense. - Minor: range-validate --min-last-disc-fill at the top of cmd_create so 150 surfaces "must be 0-100" rather than silently triggering the threshold-unreachable warning. - Minor: hoist `from datetime import datetime` to the module-level imports (was inside the auto-defer print block) and split a long f-string to satisfy ruff's line-length rule. Manual e2e for both regressions verified: empty-pool case now produces a clean info message and proceeds with the original layout, generating a tiny incremental that captures any deletions (which is exactly what dar's incremental does in that scenario). Range validation rejects 150 with a clear error. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phases 2 through 5 of the incremental-archives feature. Stacks on top of #20 (Phase 1: catalog policy + persistence). Once #20 merges, this PR rebases against main automatically.
Design context
Full spec: `docs/superpowers/specs/2026-05-12-incremental-backups-design.md` (committed in #20). All four phases here implement that spec; Phase 5 deviates slightly from the spec's "single dar -x" model by doing one call per generation, which was verified to handle additions, modifications, AND deletions correctly in dar 2.7.17 with `-wa`.
Code review
A comprehensive code review of the full diff caught:
All three fixed in the final commit. The reviewer's other flagged item (per-gen extract vs. single-call) was verified empirically to not be a real semantic gap — dar's `-wa` per-gen does honour deletion entries.
Test plan
🤖 Generated with Claude Code