Skip to content

Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract)#23

Merged
Xitee1 merged 8 commits into
mainfrom
feat/incremental-phase2-5
May 13, 2026
Merged

Incremental archives — Phases 2-5 (naming + --base + auto-defer + chain extract)#23
Xitee1 merged 8 commits into
mainfrom
feat/incremental-phase2-5

Conversation

@Xitee1
Copy link
Copy Markdown
Owner

@Xitee1 Xitee1 commented May 12, 2026

Summary

Phases 2 through 5 of the incremental-archives feature. Stacks on top of #20 (Phase 1: catalog policy + persistence). Once #20 merges, this PR rebases against main automatically.

  • Phase 2 — Naming + label: dar archive name is now `<-n>-gen`; volume labels become `<truncated_name>G` (gen+disc visible at a glance, name truncates to 23 chars in the label only — filenames inside the ISO keep the full name).
  • Phase 3 — `--base`: `bd-archive create` accepts `--base <catalog.dar>` pointing at a previous generation's persisted catalog. Run becomes Gen N+1 (parsed from base filename); dar gets `-A` for incremental mode; preview shows delta size instead of full source.
  • Phase 4 — `--min-last-disc-fill`: auto-defers newest-by-mtime files until last-disc fill reaches the threshold. Conservative pool: only files not already in the base catalog for incrementals (no risk of losing files that just had their mtime touched on disk).
  • Phase 5 — Chain-aware extract: `bd-archive extract` restores a full chain in one run. Discs may be inserted in any order; the tool detects each disc's generation from filenames and runs one `dar -x` per generation in order at the end (subsequent gens with `-wa` so they overwrite earlier-gen files). Backward-compatible with pre-feature legacy archives (treated as Gen 1).

Design context

Full spec: `docs/superpowers/specs/2026-05-12-incremental-backups-design.md` (committed in #20). All four phases here implement that spec; Phase 5 deviates slightly from the spec's "single dar -x" model by doing one call per generation, which was verified to handle additions, modifications, AND deletions correctly in dar 2.7.17 with `-wa`.

Code review

A comprehensive code review of the full diff caught:

  • A NameError in the auto-defer path when the pool was empty (incremental where all source files are already in base).
  • A nonsense fill % display when archive_est was 0 (delta-empty incremental).
  • Missing range validation on `--min-last-disc-fill`.

All three fixed in the final commit. The reviewer's other flagged item (per-gen extract vs. single-call) was verified empirically to not be a real semantic gap — dar's `-wa` per-gen does honour deletion entries.

Test plan

  • `ruff check src/bd_archive/` clean
  • Phase 2 e2e: 2-disc archive, volume labels show `phase2test_G01_0001` / `phase2test_G01_0002`; files inside ISO are `phase2test-gen1.NNNN.dar`; local catalog persists as `phase2test-gen1-catalog.0001.dar`.
  • Phase 3 e2e: Gen 1 of 50 MiB → 2 discs. Add 15 MiB of new files. Gen 2 with `--base` → single disc, 15 MiB delta only, `phase3test-gen2.0001.dar` and `phase3test-gen2-catalog.0001.dar` on disc.
  • Phase 4 e2e: Gen 1 50 MiB, add 60 MiB. Gen 2 baseline: 2 discs, last 46%. Gen 2 with `--min-last-disc-fill 50`: 2 files deferred (20 MiB), 1 disc, 94%.
  • Phase 5 chain extract smoketest: built 2-gen chain (5 files + 1 modified + 1 new in gen 2), staged all slices into one dir, called `extract_sequential` for each gen in order. Result: byte-identical to source. Also verified separately that dar's per-gen `-wa` correctly removes a deleted file from gen 2.
  • Review fixes: empty-pool incremental + `--min-last-disc-fill 80` no longer crashes — proceeds with a clean info message; `--min-last-disc-fill 150` rejected with a clear error.
  • Real-disc e2e (manual user verification): extract.py's disc-mount + prompt + verify loop is unchanged in mechanics, just refactored to track per-gen state. Needs a multi-gen chain on real BD-R media to fully exercise.

🤖 Generated with Claude Code

Xitee1 and others added 6 commits May 12, 2026 18:18
Phase 2 of incremental-archives. Internal dar archive name is now
<name>-gen<N> (every new full is Gen 1; Phase 3 derives higher N from
--base). Volume labels switch to <truncated_name>_G<NN>_<NNNN> — the
gen suffix lives in the label, the human-meaningful name truncates to
23 chars if longer. Filenames inside the ISO keep the full name.

Pre-Phase-2 (legacy) Gen 1 archives are unaffected: their old labels
and naming stay on the burned discs. New archives produced from this
phase onward carry the new scheme.

Why the truncation tradeoff: physically distinguishing Gen 1 Disc 1
from Gen 2 Disc 1 of the same chain is more useful than seeing the
last few characters of an already-known archive name. The archive
name acts as the chain identity (see project README, updated in a
later phase), which discipline the user enforces by keeping `-n`
constant across generations.

README on disc gains a Generation line and a CHAIN: hint explaining
the name-consistency rule.

Manual e2e verified: phase2test_G01_0001 / phase2test_G01_0002 labels
on a 2-disc set; phase2test-gen1.NNNN.dar slices on the discs via UDF;
phase2test-gen1-catalog.0001.dar persisted to output_dir.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 3 of incremental-archives. New `--base <catalog.dar>` flag on
`bd-archive create` makes the run produce an incremental archive
against the supplied isolated catalog. dar's `-A` flag does the
actual work; this commit wires it up end-to-end.

The base catalog filename encodes the predecessor generation
(`<name>-gen<N>-catalog.NNNN.dar`), so the new gen number is derived
without any sidecar metadata file. Legacy catalogs (pre-Phase-2,
filename `<name>-catalog.NNNN.dar`) are treated as Gen 1; the new
gen becomes Gen 2.

The pre-archive preview is now base-aware: when --base is given, the
estimated archive size reflects only files that are new or modified
since the base catalog (tools.dar.list_catalog_paths parses `dar -l`
output; mtime > catalog-mtime catches modifications heuristically).
Disc-count and last-disc-fill estimates use this delta, not the full
source — without this, an incremental's preview would massively
overstate.

Chain identity is the archive name: --base whose embedded archive
name disagrees with -n fails with a clear error pointing at the
mismatch. Same name across generations is the user's discipline.

archive/dar_archive.py gains parse_dar_filename(), a single regex
that handles both Phase-2+ generational filenames and legacy ones.
Used here for --base validation and reusable by Phase 5's chain
detection in extract.

Manual e2e: Gen 1 full of 50 MiB source → 2 discs. Adding 15 MiB
of new files and running Gen 2 with --base produced a single-disc
incremental containing only the delta (phase3test-gen2.0001.dar of
15 MiB, plus its catalog).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 4 of incremental-archives. When the projected last-disc fill is
below `--min-last-disc-fill PERCENT`, bd-archive automatically defers
the newest files until enough has been removed to either drop a disc
from the set or empty the candidate pool.

Pool selection is deliberately conservative:

- With `--base`: only files whose relative path is not in the base
  catalog (truly new). Determined via `dar -l <base>` parse, so
  files that have merely had their mtime touched on disk stay in
  the archive (no silent loss across generations).

- Without `--base` (Full): all files are candidates, with a loud
  warning that deferred files won't be archived until a future
  incremental run picks them up.

The preview block now shows what would be deferred (file count, byte
count, oldest mtime, sample paths) BEFORE the confirm prompt, so the
user can abort if the plan looks wrong.

When the threshold is unreachable (entire pool deferred without ever
crossing the fill threshold), the run still proceeds with the partial
deferral — the user gets a warning, not an abort. The only fatal case
is "deferring everything would archive zero bytes", which exits 1.

archive/source_scan.py grows a SourceFile dataclass and
list_source_files() walker — separate from scan_source's aggregate
view because the defer algorithm needs per-file rel_path/size/mtime.

tools/dar.py::create_sliced grows an `excludes` parameter that turns
each entry into a `-P <path>` flag, with dar -P being the
relative-subpath exclude operator.

Manual e2e: Gen 1 of 50 MiB, then 60 MiB delta. Without --min-last-
disc-fill: 2 discs, last disc 46%. With --min-last-disc-fill 50:
20 MiB deferred (2 files), single disc, last fill 94%.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Phase 5 of incremental-archives. `bd-archive extract` now restores
an entire incremental chain (Gen 1 + all subsequent gens) in a
single invocation. Previously it could only restore one archive set.

User flow:

- User runs `bd-archive extract -o ./restored`.
- Tool prompts for discs one at a time. Each disc's filenames are
  parsed (via archive.dar_archive.parse_dar_filename) to detect the
  chain name and which generation that disc belongs to. Order
  doesn't matter; discs from any gen, any order, all accepted.
- All slices land in one flat staging dir. Different generations
  have different dar basenames (photos-gen1, photos-gen2, …), so
  they coexist without collision.
- When the user says "no more discs", the tool runs `dar -x` once
  per generation in order. The first gen extracts into the clean
  output; later gens use dar's -wa flag to overwrite files that
  earlier gens already wrote (later gens carry the newer content).

The current archive_name variable is replaced with two pieces of
state: chain_name (the -n value, identical across all gens) and a
gen→dar_basename mapping (because legacy pre-Phase-2 gen 1 archives
have basename "photos" while new ones have "photos-gen1").

Per-generation catalog verification: each gen has its own catalog
file with a different basename, so the "verified" flag is now a
dict keyed by gen number rather than a single bool. A disc that
fails its gen's catalog sha512 drops it from staging so the next
disc of the same gen can refetch — same convergence logic as
before, just generation-scoped.

The damage path (par2 repair) is unchanged in mechanics.

tools/dar.py::extract_sequential grows an `overwrite` parameter
that toggles dar's `-wa` flag. Required for chain extracts where
gen N's data replaces gen N-1's; no effect on the first gen which
extracts into an empty output dir.

Smoketest: built a 2-gen chain (5 original files + 1 sub-dir file,
then 1 modified + 1 new in gen 2), invoked extract_sequential for
each gen in order against staged slices. diff -rq between source
and restored output: byte-identical, no differences.

Disc-mounting flow (prompt/mount/copy/verify) is preserved from
the previous implementation; refactored to track per-gen state but
the per-disc UX is the same. A full e2e against a real optical
drive remains a manual user verification step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
README gains an "Adding an incremental generation" section with the
--base workflow and --min-last-disc-fill explanation, plus updates the
extract section to describe whole-chain restore. Adds a "Chain identity
= archive name" callout near the top to explain the discipline of
keeping -n constant across generations.

AGENTS.md create / extract architecture descriptions are rewritten to
cover the new naming scheme, --base flow, list_catalog_paths /
scan_delta_bytes, auto-defer pool semantics, per-gen catalog state,
and dar -x -wa chain restore. Layout section notes the new
constants (ISO9660_LABEL_NAME_MAX / _SUFFIX_LEN), new helpers
(parse_dar_filename, list_source_files / SourceFile,
scan_delta_bytes), and the extended dar wrapper surface.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses findings from the comprehensive code review of phases 2-5:

- Critical: NameError when the auto-defer pool is empty (incremental
  with --min-last-disc-fill where every source file is already in
  the base catalog). The fallthrough branch referenced new_n /
  new_last / new_fill which only existed if the loop body ran.
  Initialise them to the pre-defer layout before the loop, and
  split the "pool was empty" path from the "deferred everything to
  zero" path with distinct messages.

- Important: _layout(0) returned a nonsense 110% fill for the
  delta-empty incremental case (no new files; only catalog +
  par2 overhead on the disc). Add an early return for est==0
  computing fill from the fixed overhead so the preview makes sense.

- Minor: range-validate --min-last-disc-fill at the top of cmd_create
  so 150 surfaces "must be 0-100" rather than silently triggering
  the threshold-unreachable warning.

- Minor: hoist `from datetime import datetime` to the module-level
  imports (was inside the auto-defer print block) and split a long
  f-string to satisfy ruff's line-length rule.

Manual e2e for both regressions verified: empty-pool case now
produces a clean info message and proceeds with the original
layout, generating a tiny incremental that captures any deletions
(which is exactly what dar's incremental does in that scenario).
Range validation rejects 150 with a clear error.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Base automatically changed from feat/incremental-phase1-catalog-policy to main May 12, 2026 18:04
@Xitee1 Xitee1 merged commit 70c1fa2 into main May 13, 2026
1 check passed
@Xitee1 Xitee1 deleted the feat/incremental-phase2-5 branch May 19, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant