Skip to content

offload: make the durability gate sound (freshness + method provenance + snapshot-pinned advance)#120

Merged
mbertschler merged 14 commits into
offload-v1from
fix-durability-soundness
Jun 11, 2026
Merged

offload: make the durability gate sound (freshness + method provenance + snapshot-pinned advance)#120
mbertschler merged 14 commits into
offload-v1from
fix-durability-soundness

Conversation

@mbertschler

@mbertschler mbertschler commented Jun 11, 2026

Copy link
Copy Markdown
Owner

Makes the offload durability gate sound. offload is the only command that
deletes user data, so every change here moves the gate strictly toward
"refuse more / claim less"
— it can only ever start refusing offload, never
start permitting it — with one deliberate exception: the relayed-target
freshness fix below, which un-wedges a workflow the first cut of #115 broke.

Closes #115
Closes #103
Closes #108
Closes #109

A re-audit of the first cut found that the #115 freshness condition, while
sound, wedged the headline workflow for peer-relayed offsites, and that the
#103 fix left the peer leg unpinned. Both are fixed here; see the offsite-wedge
section.

The gate, after this PR

A present row offloads only when, for every required target, all three hold:

  1. Origin vector (existing): the target's component for the content's origin
    node covers the content's origin run.
  2. Freshness ([critical] Offload gate is unsound for re-acquired content — add a local-run freshness condition #115, now relay-aware): a successful whole-volume push covers
    the run in which the path last became present. For a target this node pushes
    to directly the watermark is the last push in local run space
    (vs files.status_changed_run_id); for a relayed target it never pushes
    to, the watermark is the pulled origin-space push-freshness coordinate
    for the content's origin node (vs the content's origin run).
  3. Method ([medium] presence+size advances gate offload without content verification #109): the component is content-verified —
    blake3 / peer-blake3 / kopia-verify pass directly; a presence+size
    content-addressed component passes only once a verified scan-back fingerprint
    backs the gated object.

Offsite wedge fix — relayed freshness travels the wire (BLOCKER)

The wedge. #115's freshness watermark is LastSuccessfulWholeVolumePushRunID
in local run space. For a peer-relayed target — an offsite the NAS pushes to
but the laptop never syncs to directly (named in offload_requires, absent from
the laptop's sync_to) — the laptop has zero local sync runs to it, so the
watermark is always 0 and every file fails freshness forever. Offload became
a permanent silent no-op for the offsite tier — exactly the workflow the feature
exists for. (The first cut's TestOffloadPeerRelayedTargetNeedsLocalPush
documented this as "deliberate"; it was the wedge.)

The fix (per design §3's origin-space model). Carry a whole-volume-push
freshness coordinate over the durability-pull wire, expressed in origin
space
— the same coordinate space the durability vector already uses, so it
composes across hops with no translation and reuses the gated content's own
origin coordinate (which the requester already holds).

  • New destination_push_freshness table (migration v20): per
    (volume, destination, origin_node), the per-origin-node maxima of the
    present set captured at the destination's most recent successful whole-volume
    push
    . It is non-monotonic (overwritten per push) — distinct from the
    monotonic durability vector — so it answers "is this origin run covered by a
    fresh push", which the vector's all-time max cannot.
  • AdvanceDestinationVectorTo records it from exactly the pinned snapshot it
    advances the vector with, so every gating push (rclone bucket,
    content-addressed, kopia, and the now-pinned peer leg) leaves freshness behind
    for free.
  • Wire: DurabilityResponse gains a Freshness list. The responder (the
    node that actually pushes to the relayed target) reports its recorded
    push-freshness; the puller merges it into its local table monotonically
    (highest coordinate ever observed). Append-only targets make this sound: once
    a push covered origin run R, everything ≤ R of that node was pushed and
    persists, so a stale pull never needs to lower the puller, and re-acquired
    old content is still durable.
  • Gate: a relayed target (no local push, watermark 0) gates the gated
    content's origin run against the pulled freshness for that origin node. No
    freshness evidence ⇒ refuse
    (the safe direction). A locally-pushed target is
    unchanged.

Why origin-space per-node max is sound here (confirmed against the vector
math): the gating destinations are append-only (offsites are
content-addressed append-only; sync never deletes; kopia retains snapshots), and
advances are snapshot-pinned (#103). So "origin run r of node N was covered
by a fresh push" + the monotonic vector ⇒ that content was pushed and remains.
The only corner is a relayed file whose origin run exceeds the latest push's
maxima (content the pushing node has since dropped from its present set): the
gate refuses it even though the append-only target still holds it — over-conservative
but safe, and outside the common growing-corpus case (the maxima climbs
monotonically when the source of truth never loses content).

#115 — local freshness watermark (unchanged from the first cut)

Derived at gate time: LastSuccessfulWholeVolumePushRunID(volume, destination) =
MAX(id) of kind='sync' status='success' rows for the pair. A path present
after the last push fails freshness regardless of vector state — which also
catches the #103 over-advance windows. No new column for this watermark.

#103 — advance from the push's own enumeration, not the live table (now CLOSED)

  • Snapshot-pinned advance: handlers capture PresentOriginMaxima before
    the transfer and advance to exactly that snapshot via
    AdvanceDestinationVectorTo(vol, dest, method, components).
  • Peer leg now pinned (re-audit finding 3): sync/node.go phaseClose
    previously advanced from a post-transfer live read
    (AdvanceDestinationVector). It now captures the snapshot before
    phaseTransfer and feeds it to AdvanceDestinationVectorTo(…, peer-blake3, …),
    matching the other three handlers. The live-read AdvanceDestinationVector is
    removed (no remaining caller); its two store tests move onto the snapshot path.
    A peer-path snapshot-pin test proves a row committed mid-transfer is not
    covered. This closes the [critical] Durability vector over-advances from the live index after a push #103 residual the first cut left open.
  • Cross-kind run exclusion: BeginSyncRunIfClear refuses a new sync while an
    index or audit walks the volume (asymmetry rationale unchanged).

#108 — kopia verify depth & scope (re-audit finding 2)

#109 — verification-method provenance (unchanged from the first cut)

  • destination_run_ids.verify_method records the method that advanced each
    component; store.ContentVerifiedMethod names the accepted ones.
  • A presence+size component gates only once remote_objects carries both a
    checksum and a verified_at_ns.
  • DurabilityComponent.verify_method carries the method verbatim to the puller.

Schema changes

Additive migration v20 (migrateV19ToV20) — no existing migration edited;
store/schema.sql regenerated via go test ./store -update-schema:

  • new table destination_push_freshness(volume_id, destination, origin_node_id, origin_run_id, updated_at_ns), PK (volume_id, destination, origin_node_id).

(v19 from the first cut still adds verify_method to destination_run_ids and
its history.)

Re-audit findings disposition

Tests (this gates deletion)

go vet ./..., go test ./..., and golangci-lint run are clean.

Adds destination_run_ids.verify_method (and the history mirror) so the
offload gate can tell a content-verified durability component apart from
a presence-only one. Both columns are additive and nullable; existing
components backfill to NULL, which the gate reads as not-content-verified
so it refuses rather than over-claims.
AdvanceDestinationVectorTo advances the durability vector to exactly the
caller-supplied origin components (a push's own enumeration snapshot,
captured via the now-exported PresentOriginMaxima) instead of re-reading
the live present set after a transfer — closing the over-advance window
where a row committed mid-push is claimed durable though never
transferred (#103).

Components now carry the verification method that advanced them;
ContentVerifiedMethod names which methods (blake3/peer-blake3/
kopia-verify) the offload gate accepts. The peer-sync close-phase advance
keeps its live-set AdvanceDestinationVector, tagged peer-blake3.
UpsertDestinationRunIDVerified threads a pulled peer's method through.
LastSuccessfulWholeVolumePushRunID derives a destination's last
successful whole-volume push in local run space (the max successful sync
run id), the freshness watermark the offload gate requires to be at or
beyond a path's status_changed_run_id (#115). FileRow now exposes
status_changed_run_id for the gate.

BeginSyncRunIfClear refuses a new sync while an index or audit walks the
volume, so a sync never captures its enumeration snapshot against a
mutating tree (#103). The reverse (sync blocking index) is deliberately
not enforced: the advance is already snapshot-pinned, and blocking index
during sync would wedge the scheduler's index-before-sync invariant.
The gate now passes a present row only when, for every required target,
all three hold: the origin vector covers the content's origin run; the
target's last successful whole-volume push (local run space) is at or
beyond the run the path last became present (status_changed_run_id); and
the component is content-verified — blake3/peer-blake3/kopia-verify
directly, or a presence+size content-addressed component only once a
verified scan-back fingerprint (remote_objects.checksum + verified_at_ns)
backs the gated object.

Freshness closes the re-acquisition hole (#115) and the over-advance
windows (#103); method provenance holds presence+size offsites out of
the gate until verified (#109). The change is strictly stricter — it can
only refuse more, never delete more.
Bucket, content-addressed, and kopia pushes capture the volume's
present-set origin maxima before the transfer and advance the durability
vector to exactly that snapshot via AdvanceDestinationVectorTo, tagged
with the push's verification method (#103). The content-addressed push
records presence+size; kopia records kopia-verify.

kopia snapshot verify now passes an explicit --verify-files-percent
(configurable per destination via verify_files_percent, non-zero default
10) so verification reads a real fraction of file bytes rather than
kopia's bytes-free default, and the advance is scoped to the captured
present set rather than kopia's independent live walk (#108).

VerifyMethod constants alias the canonical identifiers now owned by
store, keeping the reported method and the recorded method in lockstep.
DurabilityComponent gains an additive verify_method field so a peer's
recorded verification method travels verbatim to the puller, whose
offload gate weighs a pulled component exactly as the responder did
(empty for a pre-v19 responder, read as unverified). Without it a peer
relaying a presence+size offsite component would gate like a content-
verified one. The agent durability endpoint populates it from the row.

cmd offload tests seed the freshness push run and a verified method so
the stricter gate's happy paths still pass.
Copilot AI review requested due to automatic review settings June 11, 2026 02:45
A target whose vector arrives only via the peer durability pull (no
local push to it) is held out by the freshness condition — the watermark
is in local run space, and the laptop cannot locally verify a re-acquired
path was re-pushed on a hop it never performs. Strictly stricter; add a
test pinning the behavior and clarify the peer-pulled-evidence test.
@mbertschler

Copy link
Copy Markdown
Owner Author

Follow-up note on a #115 judgment call worth a second look:

The freshness watermark is in local run space and is applied uniformly to every required target. A consequence: a target whose vector arrives only via the peer durability pull (a destination the laptop never pushes to directly, e.g. an offsite only the NAS can reach) has a local-push watermark of 0, so it is held out of the gate until a local whole-volume push covers the path.

This is the mandated strictly-stricter direction (refuse more, never delete more): the laptop genuinely cannot verify, from a local run, that a re-acquired path was re-pushed on a hop it never performs, and the re-acquisition hole exists cross-node too (local-origin content keeps a frozen first_seen/origin run across a revive, so a stale pulled vector would still "cover" it). The design's "trust the NAS's recorded evidence" covers durability presence, not freshness of a local re-acquisition. Pinned by TestOffloadPeerRelayedTargetNeedsLocalPush.

If cross-node offload of peer-relayed-only targets is desired later, the sound way is to carry a freshness coordinate over the durability-pull wire (the responder's last-push run in its own space, mapped through the pulled vector) — out of scope here and strictly an unblock, not a tightening.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens offload’s durability gate (the only data-deleting command) by tightening the evidence required to delete local bytes: adding a per-destination freshness watermark, recording per-component verification method provenance, and ensuring post-sync durability vector advances are pinned to the push’s own enumeration snapshot.

Changes:

  • Add freshness gating: require each destination’s last successful whole-volume push run (local run space) to be ≥ the file’s status_changed_run_id.
  • Add verification-method provenance end-to-end (schema v19 + wire field), and enforce method-aware gating (including fingerprint-required upgrade for presence+size).
  • Make durability vector advances snapshot-pinned via PresentOriginMaxima + AdvanceDestinationVectorTo, and add cross-kind run exclusion to prevent sync enumerations racing with index/audit.

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
syncproto/syncproto.go Adds verify_method to durability components so peers transmit provenance.
agent/durability.go Includes VerifyMethod in the durability response payload.
sync/durability.go Persists pulled durability components with their verification method.
store/destination_run_ids.go Adds verify-method constants, method-aware upsert, snapshot-pinned advance API, and PresentOriginMaxima.
store/migrations.go Introduces schema v19 migration adding verify_method columns.
store/schema.sql Updates flattened schema snapshot to v19 with new columns.
store/destination_run_ids_test.go Adds migration + method-provenance + snapshot-pinned advance tests.
store/files.go Extends FileRow reads to include status_changed_run_id for freshness gating.
store/runs.go Adds cross-kind sync exclusion and LastSuccessfulWholeVolumePushRunID watermark query.
store/store_test.go Adds tests for new cross-kind sync/index run gating behavior.
sync/sync.go Captures enumeration snapshot pre-transfer and advances vector via AdvanceDestinationVectorTo.
sync/content_addressed.go Switches content-addressed advance to snapshot-pinned AdvanceDestinationVectorTo and records presence+size.
sync/content_addressed_test.go Asserts presence+size is recorded and is not content-verified.
sync/handler.go Re-exports verification method identifiers from store as the source of truth.
sync/kopia.go Adds --verify-files-percent (configurable, non-zero default) and pins kopia advance scope to the captured snapshot.
sync/kopia_test.go Updates argv expectations and adds tests for configurable/default verify percent + pinned advance scope.
config/destinations.go Allows kopia destination verify_files_percent param in config schema validation.
offload/gate.go Implements freshness + method-aware durability gate (including fingerprint-backed upgrade for presence+size).
offload/offload_test.go Updates durability seeding to include method + push watermark evidence for the new gate.
offload/durability_soundness_test.go Adds targeted tests for freshness, snapshot-pinning behavior, and method provenance gating.
cmd/squirrel/offload_test.go Updates CLI fixture durability evidence to include method + push watermark.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread sync/sync.go
Comment on lines +272 to +276
if !opts.DryRun {
if rep.durabilityAdvance, err = captureDurabilityAdvance(ctx, s, volID); err != nil {
return rep, err
}
}

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — fixed in 49d48e8. Both the bucket (Sync) and kopia paths now close the runs row as failed (synthetic FatalError result → finishRun/finishHandlerRun) before returning on a capture error, mirroring the existing buildArgs-error path, so a capture failure before rclone/kopia starts can no longer leave the row stuck in 'running'.

Comment on lines 296 to 300
ON CONFLICT(volume_id, destination, origin_node_id) DO UPDATE SET
origin_run_id = excluded.origin_run_id,
updated_at_ns = excluded.updated_at_ns
updated_at_ns = excluded.updated_at_ns,
verify_method = excluded.verify_method
WHERE excluded.origin_run_id >= destination_run_ids.origin_run_id OR ?

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 49d48e8. verify_method is now a CASE: a non-NULL incoming method always wins; an empty method clears it only on a strict origin_run_id advance (a genuinely unverified new coordinate); on an unchanged run an empty method preserves the existing one — so a methodless re-confirmation (pull from a pre-v19 peer) no longer degrades a content-verified component. TestUpsertDestinationRunIDPreservesMethodOnMethodlessReconfirm pins both directions.

Comment thread store/runs.go Outdated
Comment on lines +518 to +520
// Cross-kind exclusion: an in-flight index or audit run on the volume
// blocks a sync (and BeginIndexRunIfClear blocks the reverse). A sync
// advances the destination's durability vector from the present set its

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 49d48e8 — the doc now states the one-directional contract explicitly: an in-flight index/audit blocks a new sync, but a running sync does NOT block a new index (cross-referencing BeginIndexRunIfClear for the rationale).

- sync: close the runs row as failed when captureDurabilityAdvance fails
  after the row is allocated, so a capture error before rclone/kopia
  starts can't leave it stuck in 'running' (bucket and kopia paths).
- store: preserve verify_method on a methodless re-confirmation at an
  unchanged origin run, so a pull from a pre-v19 peer never degrades a
  content-verified component to unknown; a methodless strict advance
  still clears it (soundness). Test pins both.
- store: fix the BeginSyncRunIfClear doc to state the one-directional
  cross-kind block (a running sync does not block a new index).
The offload gate's freshness condition compares a target's last
successful whole-volume push against a gated path. For a target the local
node never pushes to (a peer-relayed offsite, named in offload_requires
but absent from sync_to) that watermark is local-run-space and always 0,
so every file fails freshness forever.

Add destination_push_freshness (migration v20): the per-origin-node
maxima of the present set captured at a destination's most recent
successful whole-volume push, in origin-space coordinates the gate
already holds for the gated content. AdvanceDestinationVectorTo records
it from the same pinned snapshot it advances the monotonic vector with,
so every gating push (rclone bucket, content-addressed, kopia, and the
peer leg once snapshot-pinned) leaves freshness behind. The push side
overwrites (latest-push truth); a later pull merges monotonically.

Drop the live-read AdvanceDestinationVector — every production caller now
pins to a captured snapshot via AdvanceDestinationVectorTo; its two store
tests move onto the snapshot path.
DurabilityResponse gains a Freshness list — per (destination, origin
node) the highest origin run a successful whole-volume push to that
destination covered, expressed in origin space. The responder (the node
that actually pushes to the relayed target) reports its recorded
push-freshness alongside the monotonic vector; the puller merges it into
its local destination_push_freshness monotonically, accumulating the
highest coordinate ever observed (append-only targets make that the
soundest cached fact).

A node that never pushes to a relayed offsite can now satisfy its offload
gate's freshness condition from the pushing node's determination, in the
same origin coordinates the gated content carries. The field is additive
and omitted by an older responder; a puller that finds no freshness for a
relayed target refuses to offload.
phaseClose advanced the peer's durability vector from a post-transfer
live read of the present set, so a row indexed between enumeration and
close could be folded into the advance — the #103 over-advance window the
bucket, content-addressed, and kopia handlers already closed by pinning
to a PresentOriginMaxima snapshot captured before the transfer.

Capture that snapshot before phaseTransfer and feed it to
AdvanceDestinationVectorTo with VerifyMethodPeer in phaseClose, matching
the other three handlers. The same call now also records origin-space
push-freshness, so the peer leg feeds the relayed-target freshness
mechanism.
verify_files_percent = "0" was accepted and still tagged the advance
kopia-verify (content-verified), but kopia reads zero file bytes at 0 —
re-opening the #108 hole by config. A kopia component gates offload as
content-verified, so a zero-byte verify would delete the only local copy
on a check that read none of the content.

Require a value in (0, 100]; zero, negative, and out-of-range values are
configuration errors. The end-to-end push fails and advances nothing on a
rejected percent.
)

The #115 freshness condition wedged the headline workflow: it required,
per required target, a LOCAL successful whole-volume push at or beyond a
gated path's became-present run. A peer-relayed offsite — one the NAS
pushes to but the laptop never does — has zero local push runs, so the
watermark is always 0 and every file fails freshness forever. Offload
became a permanent silent no-op for the offsite tier.

Branch the freshness condition by whether this node pushes to the target
directly. A locally-pushed target keeps the local-run-space watermark vs
status_changed_run_id (#115, unchanged). A relayed target (no local push)
gates the gated content's origin run against the pulled origin-space
push-freshness coordinate for that origin node — the pushing node's own
freshness determination, carried over the wire. Absence of freshness
evidence still refuses, the safe direction.

The soundness test that documented the wedge as intended is replaced by
the anti-wedge proof: a relayed target now offloads when pulled freshness
covers the content's origin run, and refuses when it is below it or
absent.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants