Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
137 changes: 137 additions & 0 deletions SAFETY-AUDIT.md
Original file line number Diff line number Diff line change
Expand Up @@ -1022,6 +1022,143 @@ when restoreFromNode lands.

---

## Durability evidence & offsite verification (offload-v1)

Findings specific to the offload-v1 feature set — the durability version
vectors that gate `offload`, the peer durability pull that carries
evidence between nodes, and the content-addressed offsite push. These
are framed for the intended deployment: a **single operator** whose
nodes (laptop, NAS) are all machines they control and hold the only
credentials to. Under that model the adversary is overwhelmingly
*entropy and bugs*, not a hostile peer; the findings are sized
accordingly.

### D1: Durability-pull trust boundary (relayed offsite evidence)

**Severity:** Medium (defence-in-depth) • **Likelihood:** N/A
(documented assumption, not a live defect).

**Where**

- `sync/durability.go` — `PullDurability` / `pullDurability`,
`validateComponent`, `validateFreshness`.
- `offload/gate.go` — `check`, `methodVerified`, `freshnessFailure`.
- `store/nodes.go` — `GetOrCreateOriginNode`.

**The boundary**

A durability component is recorded one of two ways, and they differ in
what they trust:

- **Direct, self-verified.** When this node pushes to a target itself —
the NAS via peer sync (`sync/node.go`, tagged `peer-blake3` after the
receiver re-hashes every path) or a bucket via rclone (`sync/sync.go`)
— it writes the component into its **own** store from its **own**
confirmed transfer (`AdvanceDestinationVectorTo`). No peer is trusted.
- **Relayed, peer-asserted.** For a target this node never pushes to (an
offsite only the NAS reaches), the only evidence is what the NAS
reports over the durability pull (`UpsertDestinationRunIDVerified`,
reached only from `pullDurability`). Putting such a target in a
volume's `offload_requires` means the local delete decision trusts the
NAS's recorded `(origin, run, method)` assertion. The pull validates
shape (positive run, valid origin name, recognised method) and is
monotonic, but carries **no proof of possession** — a peer that
asserts an inflated run for a destination in the accepted set would be
believed.

**Decision (intended):** the relayed-evidence trust is **accepted**. The
NAS is in the same trust domain as the laptop; a NAS that lies about
durability is a compromised-or-broken NAS, in which case the archive it
holds is already in question and `offload` is a footnote. The gate fails
*closed* on absent evidence, so a peer can only ever *withhold*
offload-eligibility, and the redundancy decision (gate on **all** copies
via `offload_requires`, not the fewest-trusted subset) is what protects
against data loss — see the offload section of `README.md`.

**Defence-in-depth implemented in this branch** (cheap; turns *bugs*
into loud failures, not a security control):

- **Verify-method allow-list at the pull boundary** — `validateComponent`
refuses a non-empty `verify_method` that isn't one this build defines
(`store.KnownVerifyMethod`). Previously an unknown method was stored
and then silently treated as unverified by the gate; now a peer bug or
version-skew string is rejected at receipt. Empty (legitimately
"unverified") still passes.
- **Origin-node creation cap** — `pullDurability` refuses a pull that
names more than `maxOriginNodesPerPull` (256) distinct origins, so a
runaway peer cannot grow the local `nodes` table without bound via
`GetOrCreateOriginNode`. A real volume references a handful of origins;
the cap only converts a flood into an observable refusal.

**Not done (deliberately):** no proof-of-possession protocol, no
laptop-side independent verification of relayed offsites. Those defend
against a malicious NAS, which is out of model. The random per-file
nonce in the rclone crypt overlay also makes "the NAS proves the stored
ciphertext decrypts to the right content" impractical without either a
content-derived nonce or the laptop downloading and decrypting the
object — neither warranted here.

**Issue:** `durability: document the relayed-evidence trust boundary; add verify-method allow-list and origin-node cap as defence-in-depth` (implemented in this branch)

### D2: Content-addressed offsite push proves presence+size, not decrypt-correctness

**Severity:** Medium • **Likelihood:** Low (requires a transfer-time
corruption that preserves decrypted size, or the documented
re-hash→read TOCTOU window to fire).

**Where**

- `sync/content_addressed.go` — `uploadOneObject` (re-hash → `copyTo` →
`statRemote` size check), `captureFingerprints`.
- `sync/verify_remote.go` — `VerifyRemote` (scan-back re-check).

**What it does / doesn't establish**

At upload the push: (1) re-hashes the **local plaintext** and refuses on
drift, so the encryption input is the right content; (2) confirms the
object is **present** and its **decrypted size** (stat is through the
crypt overlay) matches the index; (3) records the provider's checksum of
the ciphertext as the scan-back baseline. The underlying backend's own
transfer integrity (e.g. S3 Content-MD5 on PUT) covers "the ciphertext
rclone sent is the ciphertext stored."

It does **not** confirm that the stored ciphertext *decrypts back* to the
indexed hash — there is no post-upload decrypt-and-rehash. The
unguarded slivers are the documented fork/exec window between the
re-hash and rclone's open (`uploadOneObject` "Residual:" comment) and a
hypothetical crypt bug that produces a right-decrypted-size, wrong
content object. Ongoing bitrot is caught by the scan-back re-verify; a
*wrong-at-upload* object is the gap.

**Proposed mitigation (opt-in, NAS-local — sketch, not built):**

- Add a `--verify` mode to the content-addressed push that, after an
object lands, downloads it back **through the crypt overlay**,
BLAKE3s the plaintext, and compares to the indexed hash. This is the
only check that closes decrypt-correctness, and it lives entirely on
the pushing node (which holds the plaintext) with no protocol or
laptop change.
- Scope it to the **initial upload** of each content hash (or a sampled
subset), not every run — the object is append-only and immutable, so
one read-back per object is sufficient. Cost is one download per
verified object (egress), so it must be opt-in and never run against
cold-tier targets (e.g. Glacier Deep Archive, where a read needs a
restore).
- Tightening the re-hash→read TOCTOU window further (snapshot/lock the
source) is **not** recommended — the window is one fork/exec, the
indexer and scrub already surface drift, and chasing it is
disproportionate.

**Acceptance**

- With `--verify`, a seeded object whose stored ciphertext decrypts to
the wrong content is caught and the run fails before the durability
vector advances; without it, behaviour is unchanged.

**Issue:** `sync: optional read-back-decrypt-rehash verification for content-addressed uploads`

---

## Cross-cutting recommendations

### Tests we should add now
Expand Down
18 changes: 18 additions & 0 deletions store/destination_run_ids.go
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,24 @@ func ContentVerifiedMethod(method string) bool {
}
}

// KnownVerifyMethod reports whether method is one of the defined
// verification-method identifiers above. It exists for the wire
// boundary: a pulled durability component carries its origin's
// VerifyMethod verbatim, and the offload gate later switches on it
// (ContentVerifiedMethod), so an unrecognised non-empty method should be
// refused on receipt rather than stored and silently ignored. The empty
// method (a pre-v19 row, or one whose provenance is unknown) is a
// legitimate "unverified" state and is deliberately excluded here;
// callers that accept it test for "" explicitly.
func KnownVerifyMethod(method string) bool {
switch method {
case VerifyMethodBlake3, VerifyMethodSizeMtime, VerifyMethodPeer, VerifyMethodKopia, VerifyMethodPresenceSize:
return true
default:
return false
}
}

// DestinationRunID is one component of a destination's durability
// version vector: the highest origin-space run id of OriginNodeID's
// content known durable on Destination for VolumeID. Destination is the
Expand Down
28 changes: 27 additions & 1 deletion sync/durability.go
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,13 @@ import (
// destinations cannot blow up the report or the output that renders it.
const maxDurabilityDropSamples = 16

// maxOriginNodesPerPull bounds how many distinct origin-node names one
// durability pull will resolve (and so, for novel names, create as local
// nodes rows). A real volume's content originates on a handful of nodes;
// the cap is deliberately generous and exists only to convert a runaway
// peer into a loud refusal rather than to police a legitimate topology.
const maxOriginNodesPerPull = 256

// DurabilityPullReport summarises one durability metadata pull from a
// peer: how many entries (vector components and freshness coordinates)
// were fetched, how many components landed in the local
Expand Down Expand Up @@ -135,6 +142,15 @@ func pullDurability(ctx context.Context, s *store.Store, client *nodeClient, vol
if id, ok := originIDs[name]; ok {
return id, nil
}
// GetOrCreateOriginNode creates a local nodes row for any name
// not seen before. Bound how many distinct origins one pull will
// resolve so a peer bug (or hostile peer) flooding novel names
// cannot grow the local nodes table without limit — a real
// volume references only a handful of origins. Fails the pull
// rather than truncating, so the cap is observable.
if len(originIDs) >= maxOriginNodesPerPull {
return 0, fmt.Errorf("durability pull names more than %d distinct origin nodes; refusing to create unbounded node rows from one pull", maxOriginNodesPerPull)
}
node, err := s.GetOrCreateOriginNode(ctx, name)
if err != nil {
return 0, fmt.Errorf("resolve origin node %q: %w", name, err)
Expand Down Expand Up @@ -192,7 +208,14 @@ func pullDurability(ctx context.Context, s *store.Store, client *nodeClient, vol

// validateComponent guards the wire-supplied component before it
// touches the local vector: destination and origin names are
// identities, the run id must be a positive origin-space id.
// identities, the run id must be a positive origin-space id, and the
// verify method must be empty or a method this build recognises. The
// method check is defence-in-depth, not a trust boundary (the peer is
// trusted to assert its own durability — see SAFETY-AUDIT.md D1): the
// gate already refuses to offload on an unrecognised method, so the
// only effect of an unknown non-empty method reaching the store is a
// silently-inert row. Refusing it here turns a peer bug or a
// version-skew method string into a loud error at the pull instead.
func validateComponent(c syncproto.DurabilityComponent) error {
if c.Destination == "" {
return errors.New("destination must be non-empty")
Expand All @@ -203,6 +226,9 @@ func validateComponent(c syncproto.DurabilityComponent) error {
if c.OriginRun <= 0 {
return fmt.Errorf("origin_run %d must be positive", c.OriginRun)
}
if c.VerifyMethod != "" && !store.KnownVerifyMethod(c.VerifyMethod) {
return fmt.Errorf("verify_method %q is not a recognised verification method", c.VerifyMethod)
}
return nil
}

Expand Down
65 changes: 65 additions & 0 deletions sync/durability_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,9 @@ import (
"fmt"
"strings"
"testing"

"github.com/mbertschler/squirrel/store"
"github.com/mbertschler/squirrel/syncproto"
)

// seedReceiverDurability records vector components on the receiver so
Expand Down Expand Up @@ -313,3 +316,65 @@ func TestPullDurabilityRequiresLocalVolume(t *testing.T) {
t.Fatalf("err = %v, want the no-local-index-row guard", err)
}
}

// TestValidateComponentVerifyMethod: the wire-boundary guard accepts the
// empty method (a legitimate "unverified" state) and every method this
// build defines, and refuses an unrecognised non-empty method so a peer
// bug or version-skew string is loud at the pull rather than a
// silently-inert local row (SAFETY-AUDIT.md D1). Freshness carries no
// method and is unaffected.
func TestValidateComponentVerifyMethod(t *testing.T) {
base := syncproto.DurabilityComponent{Destination: "offsite-a", OriginNode: "laptop", OriginRun: 5}
for _, method := range []string{
"",
store.VerifyMethodBlake3,
store.VerifyMethodSizeMtime,
store.VerifyMethodPeer,
store.VerifyMethodKopia,
store.VerifyMethodPresenceSize,
} {
c := base
c.VerifyMethod = method
if err := validateComponent(c); err != nil {
t.Errorf("validateComponent(method=%q) = %v, want nil", method, err)
}
}
c := base
c.VerifyMethod = "totally-bogus"
if err := validateComponent(c); err == nil || !strings.Contains(err.Error(), "recognised verification method") {
t.Fatalf("validateComponent(unknown method) = %v, want the unrecognised-method refusal", err)
}
}

// TestPullDurabilityCapsOriginNodeCreation: a pull that names more than
// maxOriginNodesPerPull distinct origins is refused before it grows the
// local nodes table without bound (SAFETY-AUDIT.md D1). Seeds cap+1
// distinct-origin components on the receiver, all on one accepted
// destination, and asserts the pull fails with the cap message.
func TestPullDurabilityCapsOriginNodeCreation(t *testing.T) {
f := setupNodeFixtureNoRclone(t)
ctx := context.Background()
f.initVol.OffloadRequires = []string{"offsite-a"}

rv, err := f.recvStore.CreateVolume(ctx, f.recvVol.Name, f.recvVol.Path)
if err != nil {
t.Fatalf("CreateVolume on receiver: %v", err)
}
for i := 0; i <= maxOriginNodesPerPull; i++ {
origin, err := f.recvStore.GetOrCreateOriginNode(ctx, fmt.Sprintf("origin-%04d", i))
if err != nil {
t.Fatalf("seed origin %d: %v", i, err)
}
if err := f.recvStore.UpsertDestinationRunID(ctx, rv.ID, "offsite-a", origin.ID, int64(i+1), false); err != nil {
t.Fatalf("seed component %d: %v", i, err)
}
}

if _, err := f.initStore.CreateVolume(ctx, f.initVol.Name, f.initVol.Path); err != nil {
t.Fatalf("CreateVolume on initiator: %v", err)
}
_, err = PullDurability(ctx, f.initStore, f.initVol, f.node, false)
if err == nil || !strings.Contains(err.Error(), "distinct origin nodes") {
t.Fatalf("PullDurability = %v, want the origin-node cap refusal", err)
}
}
Loading