index, sync: stop binding wrong bytes to a content hash (#107)#124
Merged
Conversation
The indexer captured size and mtime at directory-walk time, then hashed later from a fresh open with no re-stat. A file appended-to between the walk and the hash bound the new BLAKE3 digest to the stale walk size in an immutable contents row; every later observation of those true bytes then failed the size cross-check in lookupContentTx and aborted the whole ApplyIndexBatch repeatedly. hashFile now Stats the open handle after hashing and returns the size and mtime alongside the digest, so the row's metadata describes the same inode state as the bytes that produced the hash. Re-opening by path would reintroduce the race, so the live handle is used. Refs #107
uploadOneObject stat-guarded the source on size+mtime before rclone read it (a stat->copy race) and confirmed only size after upload, so a size+mtime-preserving in-place edit could upload bytes that did not match the recorded hash; the recorded remote_objects row then suppressed any future re-upload of the honest content forever. The pre-transfer guard now re-hashes the source with BLAKE3 immediately before copyto and refuses (errContentDrift) when the digest no longer matches the indexed hash. A refusal is surfaced as a warning and fails the run without recording an object or writing the manifest segment, so the watermark holds and the next run re-offers the object — the honest bytes land once restored, and the drifted bytes never bind to the hash. Residual: rclone reads the file in a child process after the re-hash, so an edit in that fork/exec window could still upload drifted bytes; the window is one rclone invocation rather than the whole walk-to-push span, and the scan-back fingerprint pass re-reads the landed object before it counts as content-verified. Refs #107
There was a problem hiding this comment.
Pull request overview
This PR addresses a correctness issue where (1) the indexer could bind inconsistent (blake3, size_bytes) pairs into the immutable contents table due to a TOCTOU window, and (2) the content-addressed upload path could record remote objects whose uploaded bytes no longer match the indexed hash due to a stat→copy race.
Changes:
- Indexer now derives
SizeBytes/MtimeNsfrom a stat of the open file handle after hashing, and uses that to buildFileRowupdates. - Content-addressed sync now re-hashes the local source immediately before upload and refuses uploads when the digest no longer matches the indexed hash (surfaced as warnings, run fails, watermark does not advance).
- Adds targeted regression tests covering both the indexer TOCTOU and content-addressed drift refusal behavior.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
sync/content_addressed.go |
Re-hash-before-upload drift guard; introduce errContentDrift; fail run without recording drifted objects. |
sync/content_addressed_test.go |
New test ensuring drifted bytes are refused (no object record, no segment, watermark holds) and succeed after restoring honest bytes. |
index/index.go |
hashFile returns digest + post-hash metadata; rowFor now uses hashed result to avoid stale walk-time size/mtime. |
index/index_test.go |
New tests validating size recorded matches hashed bytes and re-index does not trip size mismatch abort. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
704
to
+708
| h := blake3.New() | ||
| if _, err := io.CopyBuffer(h, f, buf); err != nil { | ||
| return nil, err | ||
| return hashedFile{}, err | ||
| } | ||
| fi, err := f.Stat() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #107.
The content-addressed upload path and the indexer could each permanently bind the wrong bytes to a content hash. This PR closes both.
1. Indexer stat/hash TOCTOU (
index/index.go)The indexer captured size/mtime at directory-walk time, then hashed later from a fresh open with no re-stat. A file appended-to between the walk and the hash bound the new BLAKE3 digest to the old size in an immutable
contentsrow. Becausecontentsis append-only, every later observation of those true bytes then failed thesize_bytescross-check inlookupContentTxand aborted the wholeApplyIndexBatch— refusing the content-addressed push forever.Fix:
hashFilenowStats the open handle after hashing and returns size+mtime alongside the digest;rowForbuilds the row from those. Hash and metadata now describe the same inode state. The live handle is used deliberately — re-opening by path would reintroduce the race.2. Content-addressed upload could record unverified bytes (
sync/content_addressed.go)uploadOneObjectstat-guarded the source on size+mtime before rclone read it (a stat→copy race) and confirmed only size after upload. A size+mtime-preserving in-place edit could therefore upload bytes that don't match the recorded hash; the resultingremote_objectsrow then madeHasRemoteObjectsuppress any future re-upload of the honest content.Upload-hash approach chosen: re-hash-then-copy. The pre-transfer guard now re-hashes the source with BLAKE3 immediately before
copytoand refuses (errContentDrift) when the digest no longer matches the indexed hash — catching exactly the size+mtime-preserving edit the metadata stat alone cannot. A refusal is surfaced as a warning and fails the run without recording an object or writing the manifest segment, so the watermark holds and the next run re-offers the object. The honest bytes land once restored; the drifted bytes never bind to the hash.A true stream-while-uploading hash isn't practical here: rclone reads the source in a separate child process, so squirrel can't tee the bytes as they're sent.
Residual: a writer that edits the file in the window between the re-hash and rclone's read could still upload drifted bytes. That window is the fork/exec of one rclone invocation rather than the whole walk-to-push span, and the scan-back fingerprint pass (#109) re-reads the landed object to upgrade the durability vector before it counts as content-verified — catching anything that slips through.
Tests
index:TestHashStatPinnedToHashedBytes— aprocesscall given a stale (smaller) walk size records the row with the size of the hashed bytes, not the walk size.TestReindexStableBytesNoSizeMismatch— re-indexing stable bytes does not hit the size-mismatch abort.sync:TestContentAddressedDriftRefusesObject— a same-length, mtime-preserving in-place edit is not recorded inremote_objects(run refused + warned, watermark held, no segment), and a later run lands+records the honest bytes normally.go vet ./...,go test ./..., andgolangci-lint runare clean.