offload: make the durability gate sound (freshness + method provenance + snapshot-pinned advance)#120
Conversation
Adds destination_run_ids.verify_method (and the history mirror) so the offload gate can tell a content-verified durability component apart from a presence-only one. Both columns are additive and nullable; existing components backfill to NULL, which the gate reads as not-content-verified so it refuses rather than over-claims.
AdvanceDestinationVectorTo advances the durability vector to exactly the caller-supplied origin components (a push's own enumeration snapshot, captured via the now-exported PresentOriginMaxima) instead of re-reading the live present set after a transfer — closing the over-advance window where a row committed mid-push is claimed durable though never transferred (#103). Components now carry the verification method that advanced them; ContentVerifiedMethod names which methods (blake3/peer-blake3/ kopia-verify) the offload gate accepts. The peer-sync close-phase advance keeps its live-set AdvanceDestinationVector, tagged peer-blake3. UpsertDestinationRunIDVerified threads a pulled peer's method through.
LastSuccessfulWholeVolumePushRunID derives a destination's last successful whole-volume push in local run space (the max successful sync run id), the freshness watermark the offload gate requires to be at or beyond a path's status_changed_run_id (#115). FileRow now exposes status_changed_run_id for the gate. BeginSyncRunIfClear refuses a new sync while an index or audit walks the volume, so a sync never captures its enumeration snapshot against a mutating tree (#103). The reverse (sync blocking index) is deliberately not enforced: the advance is already snapshot-pinned, and blocking index during sync would wedge the scheduler's index-before-sync invariant.
The gate now passes a present row only when, for every required target, all three hold: the origin vector covers the content's origin run; the target's last successful whole-volume push (local run space) is at or beyond the run the path last became present (status_changed_run_id); and the component is content-verified — blake3/peer-blake3/kopia-verify directly, or a presence+size content-addressed component only once a verified scan-back fingerprint (remote_objects.checksum + verified_at_ns) backs the gated object. Freshness closes the re-acquisition hole (#115) and the over-advance windows (#103); method provenance holds presence+size offsites out of the gate until verified (#109). The change is strictly stricter — it can only refuse more, never delete more.
Bucket, content-addressed, and kopia pushes capture the volume's present-set origin maxima before the transfer and advance the durability vector to exactly that snapshot via AdvanceDestinationVectorTo, tagged with the push's verification method (#103). The content-addressed push records presence+size; kopia records kopia-verify. kopia snapshot verify now passes an explicit --verify-files-percent (configurable per destination via verify_files_percent, non-zero default 10) so verification reads a real fraction of file bytes rather than kopia's bytes-free default, and the advance is scoped to the captured present set rather than kopia's independent live walk (#108). VerifyMethod constants alias the canonical identifiers now owned by store, keeping the reported method and the recorded method in lockstep.
DurabilityComponent gains an additive verify_method field so a peer's recorded verification method travels verbatim to the puller, whose offload gate weighs a pulled component exactly as the responder did (empty for a pre-v19 responder, read as unverified). Without it a peer relaying a presence+size offsite component would gate like a content- verified one. The agent durability endpoint populates it from the row. cmd offload tests seed the freshness push run and a verified method so the stricter gate's happy paths still pass.
A target whose vector arrives only via the peer durability pull (no local push to it) is held out by the freshness condition — the watermark is in local run space, and the laptop cannot locally verify a re-acquired path was re-pushed on a hop it never performs. Strictly stricter; add a test pinning the behavior and clarify the peer-pulled-evidence test.
|
Follow-up note on a #115 judgment call worth a second look: The freshness watermark is in local run space and is applied uniformly to every required target. A consequence: a target whose vector arrives only via the peer durability pull (a destination the laptop never pushes to directly, e.g. an offsite only the NAS can reach) has a local-push watermark of 0, so it is held out of the gate until a local whole-volume push covers the path. This is the mandated strictly-stricter direction (refuse more, never delete more): the laptop genuinely cannot verify, from a local run, that a re-acquired path was re-pushed on a hop it never performs, and the re-acquisition hole exists cross-node too (local-origin content keeps a frozen If cross-node offload of peer-relayed-only targets is desired later, the sound way is to carry a freshness coordinate over the durability-pull wire (the responder's last-push run in its own space, mapped through the pulled vector) — out of scope here and strictly an unblock, not a tightening. |
There was a problem hiding this comment.
Pull request overview
This PR hardens offload’s durability gate (the only data-deleting command) by tightening the evidence required to delete local bytes: adding a per-destination freshness watermark, recording per-component verification method provenance, and ensuring post-sync durability vector advances are pinned to the push’s own enumeration snapshot.
Changes:
- Add freshness gating: require each destination’s last successful whole-volume push run (local run space) to be ≥ the file’s
status_changed_run_id. - Add verification-method provenance end-to-end (schema v19 + wire field), and enforce method-aware gating (including fingerprint-required upgrade for
presence+size). - Make durability vector advances snapshot-pinned via
PresentOriginMaxima+AdvanceDestinationVectorTo, and add cross-kind run exclusion to prevent sync enumerations racing with index/audit.
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| syncproto/syncproto.go | Adds verify_method to durability components so peers transmit provenance. |
| agent/durability.go | Includes VerifyMethod in the durability response payload. |
| sync/durability.go | Persists pulled durability components with their verification method. |
| store/destination_run_ids.go | Adds verify-method constants, method-aware upsert, snapshot-pinned advance API, and PresentOriginMaxima. |
| store/migrations.go | Introduces schema v19 migration adding verify_method columns. |
| store/schema.sql | Updates flattened schema snapshot to v19 with new columns. |
| store/destination_run_ids_test.go | Adds migration + method-provenance + snapshot-pinned advance tests. |
| store/files.go | Extends FileRow reads to include status_changed_run_id for freshness gating. |
| store/runs.go | Adds cross-kind sync exclusion and LastSuccessfulWholeVolumePushRunID watermark query. |
| store/store_test.go | Adds tests for new cross-kind sync/index run gating behavior. |
| sync/sync.go | Captures enumeration snapshot pre-transfer and advances vector via AdvanceDestinationVectorTo. |
| sync/content_addressed.go | Switches content-addressed advance to snapshot-pinned AdvanceDestinationVectorTo and records presence+size. |
| sync/content_addressed_test.go | Asserts presence+size is recorded and is not content-verified. |
| sync/handler.go | Re-exports verification method identifiers from store as the source of truth. |
| sync/kopia.go | Adds --verify-files-percent (configurable, non-zero default) and pins kopia advance scope to the captured snapshot. |
| sync/kopia_test.go | Updates argv expectations and adds tests for configurable/default verify percent + pinned advance scope. |
| config/destinations.go | Allows kopia destination verify_files_percent param in config schema validation. |
| offload/gate.go | Implements freshness + method-aware durability gate (including fingerprint-backed upgrade for presence+size). |
| offload/offload_test.go | Updates durability seeding to include method + push watermark evidence for the new gate. |
| offload/durability_soundness_test.go | Adds targeted tests for freshness, snapshot-pinning behavior, and method provenance gating. |
| cmd/squirrel/offload_test.go | Updates CLI fixture durability evidence to include method + push watermark. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if !opts.DryRun { | ||
| if rep.durabilityAdvance, err = captureDurabilityAdvance(ctx, s, volID); err != nil { | ||
| return rep, err | ||
| } | ||
| } |
There was a problem hiding this comment.
Good catch — fixed in 49d48e8. Both the bucket (Sync) and kopia paths now close the runs row as failed (synthetic FatalError result → finishRun/finishHandlerRun) before returning on a capture error, mirroring the existing buildArgs-error path, so a capture failure before rclone/kopia starts can no longer leave the row stuck in 'running'.
| ON CONFLICT(volume_id, destination, origin_node_id) DO UPDATE SET | ||
| origin_run_id = excluded.origin_run_id, | ||
| updated_at_ns = excluded.updated_at_ns | ||
| updated_at_ns = excluded.updated_at_ns, | ||
| verify_method = excluded.verify_method | ||
| WHERE excluded.origin_run_id >= destination_run_ids.origin_run_id OR ? |
There was a problem hiding this comment.
Fixed in 49d48e8. verify_method is now a CASE: a non-NULL incoming method always wins; an empty method clears it only on a strict origin_run_id advance (a genuinely unverified new coordinate); on an unchanged run an empty method preserves the existing one — so a methodless re-confirmation (pull from a pre-v19 peer) no longer degrades a content-verified component. TestUpsertDestinationRunIDPreservesMethodOnMethodlessReconfirm pins both directions.
| // Cross-kind exclusion: an in-flight index or audit run on the volume | ||
| // blocks a sync (and BeginIndexRunIfClear blocks the reverse). A sync | ||
| // advances the destination's durability vector from the present set its |
There was a problem hiding this comment.
Fixed in 49d48e8 — the doc now states the one-directional contract explicitly: an in-flight index/audit blocks a new sync, but a running sync does NOT block a new index (cross-referencing BeginIndexRunIfClear for the rationale).
- sync: close the runs row as failed when captureDurabilityAdvance fails after the row is allocated, so a capture error before rclone/kopia starts can't leave it stuck in 'running' (bucket and kopia paths). - store: preserve verify_method on a methodless re-confirmation at an unchanged origin run, so a pull from a pre-v19 peer never degrades a content-verified component to unknown; a methodless strict advance still clears it (soundness). Test pins both. - store: fix the BeginSyncRunIfClear doc to state the one-directional cross-kind block (a running sync does not block a new index).
The offload gate's freshness condition compares a target's last successful whole-volume push against a gated path. For a target the local node never pushes to (a peer-relayed offsite, named in offload_requires but absent from sync_to) that watermark is local-run-space and always 0, so every file fails freshness forever. Add destination_push_freshness (migration v20): the per-origin-node maxima of the present set captured at a destination's most recent successful whole-volume push, in origin-space coordinates the gate already holds for the gated content. AdvanceDestinationVectorTo records it from the same pinned snapshot it advances the monotonic vector with, so every gating push (rclone bucket, content-addressed, kopia, and the peer leg once snapshot-pinned) leaves freshness behind. The push side overwrites (latest-push truth); a later pull merges monotonically. Drop the live-read AdvanceDestinationVector — every production caller now pins to a captured snapshot via AdvanceDestinationVectorTo; its two store tests move onto the snapshot path.
DurabilityResponse gains a Freshness list — per (destination, origin node) the highest origin run a successful whole-volume push to that destination covered, expressed in origin space. The responder (the node that actually pushes to the relayed target) reports its recorded push-freshness alongside the monotonic vector; the puller merges it into its local destination_push_freshness monotonically, accumulating the highest coordinate ever observed (append-only targets make that the soundest cached fact). A node that never pushes to a relayed offsite can now satisfy its offload gate's freshness condition from the pushing node's determination, in the same origin coordinates the gated content carries. The field is additive and omitted by an older responder; a puller that finds no freshness for a relayed target refuses to offload.
phaseClose advanced the peer's durability vector from a post-transfer live read of the present set, so a row indexed between enumeration and close could be folded into the advance — the #103 over-advance window the bucket, content-addressed, and kopia handlers already closed by pinning to a PresentOriginMaxima snapshot captured before the transfer. Capture that snapshot before phaseTransfer and feed it to AdvanceDestinationVectorTo with VerifyMethodPeer in phaseClose, matching the other three handlers. The same call now also records origin-space push-freshness, so the peer leg feeds the relayed-target freshness mechanism.
verify_files_percent = "0" was accepted and still tagged the advance kopia-verify (content-verified), but kopia reads zero file bytes at 0 — re-opening the #108 hole by config. A kopia component gates offload as content-verified, so a zero-byte verify would delete the only local copy on a check that read none of the content. Require a value in (0, 100]; zero, negative, and out-of-range values are configuration errors. The end-to-end push fails and advances nothing on a rejected percent.
) The #115 freshness condition wedged the headline workflow: it required, per required target, a LOCAL successful whole-volume push at or beyond a gated path's became-present run. A peer-relayed offsite — one the NAS pushes to but the laptop never does — has zero local push runs, so the watermark is always 0 and every file fails freshness forever. Offload became a permanent silent no-op for the offsite tier. Branch the freshness condition by whether this node pushes to the target directly. A locally-pushed target keeps the local-run-space watermark vs status_changed_run_id (#115, unchanged). A relayed target (no local push) gates the gated content's origin run against the pulled origin-space push-freshness coordinate for that origin node — the pushing node's own freshness determination, carried over the wire. Absence of freshness evidence still refuses, the safe direction. The soundness test that documented the wedge as intended is replaced by the anti-wedge proof: a relayed target now offloads when pulled freshness covers the content's origin run, and refuses when it is below it or absent.
Makes the offload durability gate sound.
offloadis the only command thatdeletes user data, so every change here moves the gate strictly toward
"refuse more / claim less" — it can only ever start refusing offload, never
start permitting it — with one deliberate exception: the relayed-target
freshness fix below, which un-wedges a workflow the first cut of #115 broke.
Closes #115
Closes #103
Closes #108
Closes #109
A re-audit of the first cut found that the #115 freshness condition, while
sound, wedged the headline workflow for peer-relayed offsites, and that the
#103 fix left the peer leg unpinned. Both are fixed here; see the offsite-wedge
section.
The gate, after this PR
A present row offloads only when, for every required target, all three hold:
node covers the content's origin run.
the run in which the path last became present. For a target this node pushes
to directly the watermark is the last push in local run space
(vs
files.status_changed_run_id); for a relayed target it never pushesto, the watermark is the pulled origin-space push-freshness coordinate
for the content's origin node (vs the content's origin run).
blake3/peer-blake3/kopia-verifypass directly; apresence+sizecontent-addressed component passes only once a verified scan-back fingerprint
backs the gated object.
Offsite wedge fix — relayed freshness travels the wire (BLOCKER)
The wedge. #115's freshness watermark is
LastSuccessfulWholeVolumePushRunIDin local run space. For a peer-relayed target — an offsite the NAS pushes to
but the laptop never syncs to directly (named in
offload_requires, absent fromthe laptop's
sync_to) — the laptop has zero local sync runs to it, so thewatermark is always 0 and every file fails freshness forever. Offload became
a permanent silent no-op for the offsite tier — exactly the workflow the feature
exists for. (The first cut's
TestOffloadPeerRelayedTargetNeedsLocalPushdocumented this as "deliberate"; it was the wedge.)
The fix (per design §3's origin-space model). Carry a whole-volume-push
freshness coordinate over the durability-pull wire, expressed in origin
space — the same coordinate space the durability vector already uses, so it
composes across hops with no translation and reuses the gated content's own
origin coordinate (which the requester already holds).
destination_push_freshnesstable (migration v20): per(volume, destination, origin_node), the per-origin-node maxima of thepresent set captured at the destination's most recent successful whole-volume
push. It is non-monotonic (overwritten per push) — distinct from the
monotonic durability vector — so it answers "is this origin run covered by a
fresh push", which the vector's all-time max cannot.
AdvanceDestinationVectorTorecords it from exactly the pinned snapshot itadvances the vector with, so every gating push (rclone bucket,
content-addressed, kopia, and the now-pinned peer leg) leaves freshness behind
for free.
DurabilityResponsegains aFreshnesslist. The responder (thenode that actually pushes to the relayed target) reports its recorded
push-freshness; the puller merges it into its local table monotonically
(highest coordinate ever observed). Append-only targets make this sound: once
a push covered origin run R, everything ≤ R of that node was pushed and
persists, so a stale pull never needs to lower the puller, and re-acquired
old content is still durable.
content's origin run against the pulled freshness for that origin node. No
freshness evidence ⇒ refuse (the safe direction). A locally-pushed target is
unchanged.
Why origin-space per-node max is sound here (confirmed against the vector
math): the gating destinations are append-only (offsites are
content-addressed append-only; sync never deletes; kopia retains snapshots), and
advances are snapshot-pinned (#103). So "origin run r of node N was covered
by a fresh push" + the monotonic vector ⇒ that content was pushed and remains.
The only corner is a relayed file whose origin run exceeds the latest push's
maxima (content the pushing node has since dropped from its present set): the
gate refuses it even though the append-only target still holds it — over-conservative
but safe, and outside the common growing-corpus case (the maxima climbs
monotonically when the source of truth never loses content).
#115 — local freshness watermark (unchanged from the first cut)
Derived at gate time:
LastSuccessfulWholeVolumePushRunID(volume, destination)=MAX(id)ofkind='sync' status='success'rows for the pair. A path presentafter the last push fails freshness regardless of vector state — which also
catches the #103 over-advance windows. No new column for this watermark.
#103 — advance from the push's own enumeration, not the live table (now CLOSED)
PresentOriginMaximabeforethe transfer and advance to exactly that snapshot via
AdvanceDestinationVectorTo(vol, dest, method, components).sync/node.gophaseClosepreviously advanced from a post-transfer live read
(
AdvanceDestinationVector). It now captures the snapshot beforephaseTransferand feeds it toAdvanceDestinationVectorTo(…, peer-blake3, …),matching the other three handlers. The live-read
AdvanceDestinationVectorisremoved (no remaining caller); its two store tests move onto the snapshot path.
A peer-path snapshot-pin test proves a row committed mid-transfer is not
covered. This closes the [critical] Durability vector over-advances from the live index after a push #103 residual the first cut left open.
BeginSyncRunIfClearrefuses a new sync while anindex or audit walks the volume (asymmetry rationale unchanged).
#108 — kopia verify depth & scope (re-audit finding 2)
kopia snapshot verifypasses an explicit--verify-files-percent,configurable via
verify_files_percent, default non-zero10.verify_files_percent = "0"was accepted yet still tagged theadvance
kopia-verifydespite reading zero file bytes — re-opening [high] kopia verification overclaims scope and depth #108 byconfig. The value is now required in (0, 100]; zero/negative/out-of-range
are configuration errors and the push fails (advancing nothing). Pinned by a
unit test over the parser and an end-to-end push test.
#109 — verification-method provenance (unchanged from the first cut)
destination_run_ids.verify_methodrecords the method that advanced eachcomponent;
store.ContentVerifiedMethodnames the accepted ones.presence+sizecomponent gates only onceremote_objectscarries both achecksumand averified_at_ns.DurabilityComponent.verify_methodcarries the method verbatim to the puller.Schema changes
Additive migration v20 (
migrateV19ToV20) — no existing migration edited;store/schema.sqlregenerated viago test ./store -update-schema:destination_push_freshness(volume_id, destination, origin_node_id, origin_run_id, updated_at_ns), PK(volume_id, destination, origin_node_id).(v19 from the first cut still adds
verify_methodtodestination_run_idsandits history.)
Re-audit findings disposition
wire (above).
verify_files_percent=0): fixed — rejected for a gating verify.phaseCloseis nowsnapshot-pinned;
Addresses #103→Closes #103.offload: fail fast when a required target can never satisfy the gate #121 — it needs cross-cutting config-capability analysis rather than a
localized gate change, and is pure fail-fast UX (gate soundness is unaffected;
the gate already refuses correctly, it just doesn't explain why).
Tests (this gates deletion)
push-freshness covers the content's origin run, and refuses when freshness
is below it or absent —
TestOffloadPeerRelayedTargetGatesOnPulledFreshness.freshness —
TestOffloadLocalPushTargetIgnoresRelayedFreshness.verify_files_percent=0rejected (parser + end-to-end push).phaseCloseadvance is snapshot-pinned (a mid-transfer row is notcovered), tagged
peer-blake3, and records push-freshness.advance, listed for the wire; the wire round-trip merges freshness monotonically.
go vet ./...,go test ./..., andgolangci-lint runare clean.