-
Notifications
You must be signed in to change notification settings - Fork 30
Add checksum verification feature #72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Implements feature request tarka#71 for optional checksum verification during file copies to detect hardware errors (memory/storage issues). Implementation: - Uses xxHash64 for fast non-cryptographic checksumming - Calculates checksum during copy (zero overhead on source read) - Verifies by re-reading destination file only - Returns immediate error on mismatch (no retry) - Works with both parfile and parblock drivers - Thread-safe using Mutex for final checksum storage Changes: - Add --verify-checksum CLI flag - Add Config.verify_checksum field - Add XcpError::ChecksumMismatch error type - Update CopyHandle to compute and verify checksums - Add xxhash-rust dependency Tests: - 28 comprehensive test cases covering: - Empty, small, and large files - Binary patterns (zeros, random, alternating) - Recursive directory copies - Multiple files - Both drivers (parfile/parblock) - Various block sizes and worker counts - Sparse files - File overwriting Performance: - ~2x overhead due to destination re-read (e.g., 34ms → 70ms for 50MB) - Acceptable trade-off for critical data integrity Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue tarka#71.
|
Why not xxh3, it's superior in basically every way... |
xxHash3 is superior to xxHash64 in every way: - Faster (~1.5-3x depending on data size) - Better hash quality - Optimized for modern architectures Thanks to @lespea for the suggestion.
|
Because .......... I hadn't thought of that... |
|
Thanks so much for this! This is exactly what I was hoping for. |
|
Hmmm might be getting some false negatives? ❯ xcp -r ./A001A20M/ /projects/deletme/ --verify-checksum --no-perms but checking those directories against each other with rclone: ❯ rclone check "/moredata/197 Moon Fever Cage Video" "/projects/deleteme" --one-way --log-level INFO Let me know how I can help figure out what's going on here! |
|
Trying one of the problem files on it's own gives error: 21:06:02 [ERROR] Error during finalising copy operation File { fd: 3, path: "/moredata/197 Moon Fever Cage Video/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: true, write: false } -> File { fd: 4, path: "/projects/deleteme/SALVAGE.TMP", read: false, write: true }: Checksum verification failed for /projects/deleteme/SALVAGE.TMP: expected 2d06800538d394c2, got 998e[##########################################################################################################################] file in question if its helpful https://drive.google.com/file/d/14-w8aU_9JZqu1i5KrgWcVfjzR-BaVl0O/view?usp=sharing |
Bug: Checksum verification was failing with false positives when used with --no-perms flag, reporting mismatches even though files were identical (verified with rclone). Root cause: The destination file wasn't being synced to disk before re-reading for verification. BufWriter::flush() only writes to the file descriptor, not to disk. Without fsync(), the kernel's page cache could return stale data when re-opening the file for checksum verification. Solution: Always call sync() on the destination file descriptor before verifying checksums, regardless of whether --fsync flag is set. This ensures all data is written to disk before we re-read for verification. Fixes issue reported by @Duckfromearth in PR tarka#72.
Bug: Checksum verification was failing for sparse files because: - During copy: Only data blocks were hashed (holes were skipped) - During verification: Entire file was hashed (including zero-filled holes) - Result: Checksum mismatch even though files were functionally identical Root cause: Sparse file optimization in both parfile (copy_sparse) and parblock (map_extents) drivers skips holes during copy, but the verification re-reads the entire destination file including all holes. Solution: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying and hashing all file content including holes. Trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings. Changes: - operations.rs: Skip copy_sparse() when verify_checksum is enabled - parblock.rs: Skip extent mapping when verify_checksum is enabled - checksum.rs test: Update assertion to expect non-sparse destination Also added fsync before verification to ensure data is written to disk. Fixes issue reported by @Duckfromearth in PR tarka#72.
|
Found the issue! The problem was with sparse files. Fix: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying all file content including holes. The trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings. Note: This only affects behavior when --verify-checksum is explicitly used. Normal copies (without checksum) still use sparse optimization for maximum performance. Apologies for the initial issue, I'm new to this codebase and didn't fully explore all the internal mechanisms before implementing the feature. Thanks @Duckfromearth |
|
@Duckfromearth I'll give you time to test it on your side and give me feedback if necessary... thanks in advance ! |
|
@felix068 thanks for the fix!! That's working beautifully now. Did some more testing today and here is what I noticed. Copying to an SSD works flawlessly with no hiccups. Copying to a single hard drive or ZFS array, it tends to hang intermittently. It does eventually finish, but it takes a while, and if you try and cancel the command it can take 15-30 seconds to cancel. As far as I can tell this is because it's reading and writing to the disk in parallel, and the disk eventually gets overwhelmed with commands forcing the process to wait? Seems tricky to address this, because I think the current implementation where it writes and does the verification in parallel is ideal for fast SSDs. I think the ideal setup for HDD's would be to write an entire file, then verify that entire file, before moving on to the next file. But ofc xcp doesn't know what kind of disk it's using. Maybe an additional flag? I tried reducing the workers to one, but doesn't seem to make a difference. Curious if you have any thoughts This could also be an issue with my specific hardware or OS, I tested with a couple different drives and two different sata controllers. Thanks again for making this happen!! |
Issue: Checksum verification was causing hangs on mechanical hard drives (HDDs) and ZFS arrays due to forced fsync() after each file write. This was particularly problematic when copying many files, as each fsync() blocks waiting for physical disk writes to complete. Analysis: - SSDs: Fast fsync, no issues - HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs - ZFS: Similar issues with sync performance Solution: Make fsync optional rather than automatic with checksum verification. Changes: - Removed automatic fsync() when verify_checksum is enabled - Users can now choose: * --verify-checksum alone: Fast, works well on most systems * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD) - Updated README with HDD performance notes and --fsync recommendation Trade-off: Without fsync, there's a theoretical risk of false positives in rare cache coherency scenarios, but in practice this is extremely rare on modern systems. Users who need absolute certainty can use --fsync. Addresses performance issue reported by @Duckfromearth in PR tarka#72.
|
Sorry for the delay, (I was sick) About the HDD performance issue: I found the problem. When using --verify-checksum, the code was automatically calling fsync after every file write. On mechanical hard drives and ZFS arrays, fsync blocks I've removed the automatic fsync. Now you have two options: --verify-checksum alone: Fast, no hangs, works on most systems including HDDs --verify-checksum --fsync: Maximum integrity mode, forces disk sync before verification. Slower on HDDs but guarantees correctness even in rare cache scenarios. For your HDD and ZFS setup, I recommend using --verify-checksum without --fsync. The fsync flag is optional and only needed if you want absolute certainty for mission-critical data. Let me know if this fixes the hang issue on your drives. |
|
I forgot to reply here, but I just wanted to say this is working perfectly! Thanks so much for making it happen. |
|
Hi all, Sorry I haven't replied to this, I'm focusing on something else at the moment. I'll take a look when I get a chance. |
| performance on HDDs, the verification should still work correctly but may be slower. | ||
| For maximum data integrity assurance, add `--fsync` to force data to disk before | ||
| verification (slower but guarantees correct checksums even in rare cache coherency | ||
| scenarios): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will also be very slow on NFS mounts; copy_file_range() allows the copy to happen server-side; checksumming will require that data to be copied back to the client for verification.
|
Tests are failing for me; |
Bug: Checksum verification was failing with false positives when used with --no-perms flag, reporting mismatches even though files were identical (verified with rclone). Root cause: The destination file wasn't being synced to disk before re-reading for verification. BufWriter::flush() only writes to the file descriptor, not to disk. Without fsync(), the kernel's page cache could return stale data when re-opening the file for checksum verification. Solution: Always call sync() on the destination file descriptor before verifying checksums, regardless of whether --fsync flag is set. This ensures all data is written to disk before we re-read for verification. Fixes issue reported by @Duckfromearth in PR #72.
Bug: Checksum verification was failing for sparse files because: - During copy: Only data blocks were hashed (holes were skipped) - During verification: Entire file was hashed (including zero-filled holes) - Result: Checksum mismatch even though files were functionally identical Root cause: Sparse file optimization in both parfile (copy_sparse) and parblock (map_extents) drivers skips holes during copy, but the verification re-reads the entire destination file including all holes. Solution: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying and hashing all file content including holes. Trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings. Changes: - operations.rs: Skip copy_sparse() when verify_checksum is enabled - parblock.rs: Skip extent mapping when verify_checksum is enabled - checksum.rs test: Update assertion to expect non-sparse destination Also added fsync before verification to ensure data is written to disk. Fixes issue reported by @Duckfromearth in PR #72.
Issue: Checksum verification was causing hangs on mechanical hard drives (HDDs) and ZFS arrays due to forced fsync() after each file write. This was particularly problematic when copying many files, as each fsync() blocks waiting for physical disk writes to complete. Analysis: - SSDs: Fast fsync, no issues - HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs - ZFS: Similar issues with sync performance Solution: Make fsync optional rather than automatic with checksum verification. Changes: - Removed automatic fsync() when verify_checksum is enabled - Users can now choose: * --verify-checksum alone: Fast, works well on most systems * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD) - Updated README with HDD performance notes and --fsync recommendation Trade-off: Without fsync, there's a theoretical risk of false positives in rare cache coherency scenarios, but in practice this is extremely rare on modern systems. Users who need absolute certainty can use --fsync. Addresses performance issue reported by @Duckfromearth in PR #72.
| } | ||
|
|
||
| if let Some(h) = hasher { | ||
| *self.src_checksum.lock().unwrap() = Some(h.digest()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shouldn't be unwrapped; it can be mapped to an error type.
| @@ -119,7 +139,9 @@ impl CopyHandle { | |||
| if self.try_reflink()? { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This setting will be silently ignored if on a reflink-capable filesystem (btrfs, XFS). Is this deliberate?
Implements feature request #71 for optional checksum verification during file copies to detect hardware errors (memory/storage issues).
Implementation:
Changes:
Tests:
Performance:
Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue #71.