Skip to content

offload: time-based durability-evidence staleness policy (#131)#134

Open
mbertschler wants to merge 1 commit into
issue-104-durability-provenancefrom
issue-131-offload-staleness
Open

offload: time-based durability-evidence staleness policy (#131)#134
mbertschler wants to merge 1 commit into
issue-104-durability-provenancefrom
issue-131-offload-staleness

Conversation

@mbertschler

Copy link
Copy Markdown
Owner

Stacked on #133 — merge #133 first

This PR is stacked on #133 (branch issue-104-durability-provenance, the per-peer provenance tagging). Its base is that branch, so the diff here shows only the #131 work. Merge #133 before this. Once #133 merges to main, GitHub will retarget this PR to main.

Closes #131

What this does

The offload durability gate weighed only version-vector staleness (a component stale when its covered origin run < the file's origin run). It had no notion of wall-clock freshness, so evidence for a destination dead/unreachable for months still gated offload indefinitely — the laptop kept deleting local bytes on the strength of a "durable" claim never since re-confirmed. This is defence-in-depth, not a live data-loss path: coverage was already correct; only time-based freshness was missing. It pairs with the periodic re-verification story (squirrel verify, scrub).

Store (migration v23)

  • Adds a nullable verified_at_ns to destination_run_ids.
  • updated_at_ns keeps its meaning (last applied write, bumped even by an equal-value re-confirmation). verified_at_ns advances only on a write backed by genuine re-verification — a content-verified method, or a strict run advance — so a no-op touch never makes stale evidence look freshly checked.
  • destination_run_ids_history is unchanged (it already records each advance's at_ns); the gate reads only the live vector. No index (the gate loads the whole vector and filters in Go).

Gate (offload/gate.go)

  • Injected nowNs (captured once per invocation in Offload, so all candidates are judged against one instant and tests are deterministic) + a maxEvidenceAge.
  • New staleEvidenceFailure check, applied per target after coverage passes: refuses when verified_at_ns is unknown or older than now − maxEvidenceAge, fail-closed, naming the age and provenance.

Config + CLI

  • New per-volume offload_max_evidence_age knob (e.g. "720h"), plumbed through Options.MaxEvidenceAge. Default disabled (zero = no max age).

Decisions (need your sign-off)

  1. Refuse vs warn → refuse (fail-closed). A staleness policy must refuse to offload when evidence is stale, never delete/expire evidence rows. The policy only blocks the delete; no rows are touched.
  2. Default → disabled (zero). Matches the existing "explicit opt-in, no default" philosophy of offload_requires; existing configs are unaffected and no one is surprised by new refusals. Opt-in via the knob.
  3. New verified_at_ns column vs reuse updated_at_ns. I added a separate column. Reusing updated_at_ns is unsound because an equal-value re-advance bumps it without any re-verification (the issue's own concern) — a timestamp wouldn't imply "recently checked." verified_at_ns advances only on genuine re-verification, leaving updated_at_ns semantics intact for any existing reader. Tradeoff: +1 nullable column + a migration vs. zero schema change; I judged the soundness worth it. NULL (pre-v23 rows, or methodless-only advances) is fail-closed.
  4. Peer-relayed freshness. For a peer-asserted component (built on sync: tag pulled durability evidence with its asserting peer (#104 residual) #133's provenance), verified_at_ns records when this node last pulled a fresh assertion — the peer's own verification instant never travels the wire. So the policy bounds how long this node trusts relayed evidence without hearing from the peer again (the dead-peer defence). The refusal names the asserting peer, not the local node. A live peer's pull carrying a verified method re-stamps verified_at_ns, keeping its relayed evidence fresh.

#104 / #133 regression check

No change to #133's provenance logic; I only read VerifiedAtNs alongside SourceNodeID in the gate and SELECTs. The fail-closed freshness behavior (push-watermark / push-freshness) is untouched — the new check is an additional, independent refusal. The wire protocol (syncproto) is unchanged.

Tests (deterministic, injected clock)

  • store: v22→v23 migration adds verified_at_ns NULL on carried rows; verified advance stamps it; an equal-value methodless re-confirm bumps updated_at_ns but not verified_at_ns; a strict methodless advance re-stamps it.
  • offload: stale evidence refuses; fresh passes; disabled policy is a no-op; NULL is fail-closed; peer-relayed staleness ages out and names the peer; end-to-end Offload with a max age deletes a freshly-verified file.
  • config: knob parses; absent defaults to zero; garbage/sub-second/unitless rejected.

go vet ./..., go test ./..., golangci-lint run all green; store/schema.sql regenerated at v23 (TestSchemaSnapshot passes).

The offload gate weighed only version-vector staleness — a component was
stale when its covered origin run fell below the file's origin run. It had
no notion of freshness in wall-clock time, so evidence for a destination
dead or unreachable for months still gated offload indefinitely, letting
the laptop delete local bytes on the strength of a claim never since
re-confirmed (issue #131). This is defence-in-depth, not a live data-loss
path: coverage was already correct, only freshness was missing.

Migration v23 adds a nullable verified_at_ns to destination_run_ids.
updated_at_ns keeps its meaning (the last applied write, bumped even by an
equal-value re-confirmation); verified_at_ns advances only on a write
backed by genuine re-verification — a content-verified method or a strict
run advance — so a no-op touch never makes stale evidence look freshly
checked. The history table is unchanged (it already records each advance's
at_ns) and the gate reads only the live vector. NULL is fail-closed: an
unknown verification time reads as infinitely stale.

The gate gains an injected nowNs (captured once per invocation) and an
opt-in offload_max_evidence_age knob (per volume, default disabled so
existing configs are unaffected). When set, a required target whose
verified_at_ns is unknown or older than the max age is refused, fail-closed,
naming the age and provenance. For a peer-asserted component verified_at_ns
records when this node last pulled a fresh assertion — the peer's own
verification instant never travels the wire — so the policy bounds how long
relayed evidence is trusted without hearing from the peer again, the
dead-peer defence the issue calls for. The staleness policy only refuses to
offload; it never deletes or expires any evidence row.

Closes #131
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant