Skip to content

index, sync: stop binding wrong bytes to a content hash (#107)#124

Merged
mbertschler merged 3 commits into
offload-v1from
fix-upload-integrity
Jun 11, 2026
Merged

index, sync: stop binding wrong bytes to a content hash (#107)#124
mbertschler merged 3 commits into
offload-v1from
fix-upload-integrity

Conversation

@mbertschler

Copy link
Copy Markdown
Owner

Fixes #107.

The content-addressed upload path and the indexer could each permanently bind the wrong bytes to a content hash. This PR closes both.

1. Indexer stat/hash TOCTOU (index/index.go)

The indexer captured size/mtime at directory-walk time, then hashed later from a fresh open with no re-stat. A file appended-to between the walk and the hash bound the new BLAKE3 digest to the old size in an immutable contents row. Because contents is append-only, every later observation of those true bytes then failed the size_bytes cross-check in lookupContentTx and aborted the whole ApplyIndexBatch — refusing the content-addressed push forever.

Fix: hashFile now Stats the open handle after hashing and returns size+mtime alongside the digest; rowFor builds the row from those. Hash and metadata now describe the same inode state. The live handle is used deliberately — re-opening by path would reintroduce the race.

2. Content-addressed upload could record unverified bytes (sync/content_addressed.go)

uploadOneObject stat-guarded the source on size+mtime before rclone read it (a stat→copy race) and confirmed only size after upload. A size+mtime-preserving in-place edit could therefore upload bytes that don't match the recorded hash; the resulting remote_objects row then made HasRemoteObject suppress any future re-upload of the honest content.

Upload-hash approach chosen: re-hash-then-copy. The pre-transfer guard now re-hashes the source with BLAKE3 immediately before copyto and refuses (errContentDrift) when the digest no longer matches the indexed hash — catching exactly the size+mtime-preserving edit the metadata stat alone cannot. A refusal is surfaced as a warning and fails the run without recording an object or writing the manifest segment, so the watermark holds and the next run re-offers the object. The honest bytes land once restored; the drifted bytes never bind to the hash.

A true stream-while-uploading hash isn't practical here: rclone reads the source in a separate child process, so squirrel can't tee the bytes as they're sent.

Residual: a writer that edits the file in the window between the re-hash and rclone's read could still upload drifted bytes. That window is the fork/exec of one rclone invocation rather than the whole walk-to-push span, and the scan-back fingerprint pass (#109) re-reads the landed object to upgrade the durability vector before it counts as content-verified — catching anything that slips through.

Tests

  • index: TestHashStatPinnedToHashedBytes — a process call given a stale (smaller) walk size records the row with the size of the hashed bytes, not the walk size. TestReindexStableBytesNoSizeMismatch — re-indexing stable bytes does not hit the size-mismatch abort.
  • sync: TestContentAddressedDriftRefusesObject — a same-length, mtime-preserving in-place edit is not recorded in remote_objects (run refused + warned, watermark held, no segment), and a later run lands+records the honest bytes normally.
  • Existing content-addressed + index tests stay green.

go vet ./..., go test ./..., and golangci-lint run are clean.

The indexer captured size and mtime at directory-walk time, then hashed
later from a fresh open with no re-stat. A file appended-to between the
walk and the hash bound the new BLAKE3 digest to the stale walk size in
an immutable contents row; every later observation of those true bytes
then failed the size cross-check in lookupContentTx and aborted the
whole ApplyIndexBatch repeatedly.

hashFile now Stats the open handle after hashing and returns the size
and mtime alongside the digest, so the row's metadata describes the same
inode state as the bytes that produced the hash. Re-opening by path
would reintroduce the race, so the live handle is used.

Refs #107
uploadOneObject stat-guarded the source on size+mtime before rclone
read it (a stat->copy race) and confirmed only size after upload, so a
size+mtime-preserving in-place edit could upload bytes that did not
match the recorded hash; the recorded remote_objects row then suppressed
any future re-upload of the honest content forever.

The pre-transfer guard now re-hashes the source with BLAKE3 immediately
before copyto and refuses (errContentDrift) when the digest no longer
matches the indexed hash. A refusal is surfaced as a warning and fails
the run without recording an object or writing the manifest segment, so
the watermark holds and the next run re-offers the object — the honest
bytes land once restored, and the drifted bytes never bind to the hash.

Residual: rclone reads the file in a child process after the re-hash, so
an edit in that fork/exec window could still upload drifted bytes; the
window is one rclone invocation rather than the whole walk-to-push span,
and the scan-back fingerprint pass re-reads the landed object before it
counts as content-verified.

Refs #107
Copilot AI review requested due to automatic review settings June 11, 2026 03:45

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a correctness issue where (1) the indexer could bind inconsistent (blake3, size_bytes) pairs into the immutable contents table due to a TOCTOU window, and (2) the content-addressed upload path could record remote objects whose uploaded bytes no longer match the indexed hash due to a stat→copy race.

Changes:

  • Indexer now derives SizeBytes/MtimeNs from a stat of the open file handle after hashing, and uses that to build FileRow updates.
  • Content-addressed sync now re-hashes the local source immediately before upload and refuses uploads when the digest no longer matches the indexed hash (surfaced as warnings, run fails, watermark does not advance).
  • Adds targeted regression tests covering both the indexer TOCTOU and content-addressed drift refusal behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File Description
sync/content_addressed.go Re-hash-before-upload drift guard; introduce errContentDrift; fail run without recording drifted objects.
sync/content_addressed_test.go New test ensuring drifted bytes are refused (no object record, no segment, watermark holds) and succeed after restoring honest bytes.
index/index.go hashFile returns digest + post-hash metadata; rowFor now uses hashed result to avoid stale walk-time size/mtime.
index/index_test.go New tests validating size recorded matches hashed bytes and re-index does not trip size mismatch abort.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread index/index.go
Comment on lines 704 to +708
h := blake3.New()
if _, err := io.CopyBuffer(h, f, buf); err != nil {
return nil, err
return hashedFile{}, err
}
fi, err := f.Stat()
@mbertschler mbertschler merged commit eb4c785 into offload-v1 Jun 11, 2026
2 checks passed
@mbertschler mbertschler deleted the fix-upload-integrity branch June 11, 2026 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants