Add checksum verification feature #72

felix068 · 2025-10-06T14:25:35Z

Implements feature request #71 for optional checksum verification during file copies to detect hardware errors (memory/storage issues).

Implementation:

Uses xxHash64 for fast non-cryptographic checksumming
Calculates checksum during copy (zero overhead on source read)
Verifies by re-reading destination file only
Returns immediate error on mismatch (no retry)
Works with both parfile and parblock drivers
Thread-safe using Mutex for final checksum storage

Changes:

Add --verify-checksum CLI flag
Add Config.verify_checksum field
Add XcpError::ChecksumMismatch error type
Update CopyHandle to compute and verify checksums
Add xxhash-rust dependency

Tests:

28 comprehensive test cases covering:
- Empty, small, and large files
- Binary patterns (zeros, random, alternating)
- Recursive directory copies
- Multiple files
- Both drivers (parfile/parblock)
- Various block sizes and worker counts
- Sparse files
- File overwriting

Performance:

~2x overhead due to destination re-read (e.g., 34ms → 70ms for 50MB)
Acceptable trade-off for critical data integrity

Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue #71.

@tarka

Implements feature request tarka#71 for optional checksum verification during file copies to detect hardware errors (memory/storage issues). Implementation: - Uses xxHash64 for fast non-cryptographic checksumming - Calculates checksum during copy (zero overhead on source read) - Verifies by re-reading destination file only - Returns immediate error on mismatch (no retry) - Works with both parfile and parblock drivers - Thread-safe using Mutex for final checksum storage Changes: - Add --verify-checksum CLI flag - Add Config.verify_checksum field - Add XcpError::ChecksumMismatch error type - Update CopyHandle to compute and verify checksums - Add xxhash-rust dependency Tests: - 28 comprehensive test cases covering: - Empty, small, and large files - Binary patterns (zeros, random, alternating) - Recursive directory copies - Multiple files - Both drivers (parfile/parblock) - Various block sizes and worker counts - Sparse files - File overwriting Performance: - ~2x overhead due to destination re-read (e.g., 34ms → 70ms for 50MB) - Acceptable trade-off for critical data integrity Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue tarka#71.

lespea · 2025-10-06T14:32:54Z

Why not xxh3, it's superior in basically every way...

@lespea

xxHash3 is superior to xxHash64 in every way: - Faster (~1.5-3x depending on data size) - Better hash quality - Optimized for modern architectures Thanks to @lespea for the suggestion.

felix068 · 2025-10-06T16:16:26Z

Because .......... I hadn't thought of that...
Yeah, fair point, xxh3 makes more sense. Will fix
Thanks !

Duckfromearth · 2025-10-06T20:37:18Z

Thanks so much for this! This is exactly what I was hoping for.

Duckfromearth · 2025-10-06T21:03:04Z

Hmmm might be getting some false negatives?

❯ xcp -r ./A001A20M/ /projects/deletme/ --verify-checksum --no-perms
[######################################################################################################################]
222.42 GiB / 222.42 GiB | 100% | 10.32 TiB/s | 00:00:00 remaining ❯ z /moredata/tot/med
zoxide: no match found
❯ z more
❯ xcp -r 197\ Moon\ Fever\ Cage\ Video/ /projects/deleteme --no-perms --verify-checksum
[>---------------------------------------------------------------------------------------------------------------------]
1.00 MiB / 149.31 GiB | 0% | 95.13 KiB/s | 19d 01:09:22 remaining 20:48:29 [ERROR] Error during finalising copy operation File { fd: 9, path: "/moredata/197 Moon Fever Cage Video/media/card c/private/DATABASE/DATABASE.BIN", read: true, write: false } -> File { fd: 8, path: "/projects/deleteme/media/card c/private/DATABASE/DATABASE.BIN", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/c[#>--------------------------------------------------------------------------------------------------------------------]
2.51 GiB / 149.31 GiB | 2% | 99.15 MiB/s | 00:25:16 remaining 20:48:33 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1697.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1697.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLI[####>-----------------------------------------------------------------------------------------------------------------]
5.67 GiB / 149.31 GiB | 4% | 287.65 MiB/s | 00:08:31 remaining 20:48:37 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1696.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1696.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLI[##########>------------[################################>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
19.25 GiB / 149.31 GiB | 13% | 678.46 MiB/s | 00:03:16 remaining [###[#######################>--------------------------------------------------------------------------------------------------]
29.23 GiB / 149.31 GiB | 20% | 703.09 MiB/s | 00:02:54 remaining 20:49:11 [ERROR] Error during finalising copy operation File { fd: 7, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1701.MP4", read: true, write: false } -> File { fd: 10, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1701.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1701.MP4[########################>-------------------------------------------------------------------------------------------------]
29.44 GiB / 149.31 GiB | 20% | 703.97 MiB/s | 00:02:54 remaining 20:49:11 [ERROR] Error during finalising copy operation File { fd: 3, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1699.MP4", read: true, write: false } -> File { fd: 4, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1699.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1699.MP4:[#########################>------------------------------------------------------------------------------------------------]
30.96 GiB / 149.31 GiB | 21% | 712.06 MiB/s | 00:02:50 remaining 20:49:13 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1700.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1700.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1700.MP4:[##########################>-----------------------------------------------------------------------------------------------]
32.02 GiB / 149.31 GiB | 21% | 716.50 MiB/s | 00:02:47 remaining 20:49:15 [ERROR] Error during finalising copy operation File { fd: 8, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1698.MP4", read: true, write: false } -> File { fd: 9, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1698.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1698.MP4:[#####################################################################################################################>----]
143.85 GiB / 149.31 GiB | 96% | 643.18 MiB/s | 00:00:08 remaining 20:52:07 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card b/XD[##########################################################################################################################]
149.31 GiB / 149.31 GiB | 100% | 647.41 MiB/s | 00:00:00 remaining

but checking those directories against each other with rclone:

❯ rclone check "/moredata/197 Moon Fever Cage Video" "/projects/deleteme" --one-way --log-level INFO
2025/10/06 20:56:00 NOTICE: Config file "/home/hex/.config/rclone/rclone.conf" not found - using defaults
2025/10/06 20:56:00 INFO : Using md5 for hash comparisons
2025/10/06 21:00:34 NOTICE: Local file system at /projects/deleteme: 0 differences found
2025/10/06 21:00:34 NOTICE: Local file system at /projects/deleteme: 148 matching files
2025/10/06 21:00:34 INFO :
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 148 / 148, 100%
Elapsed time: 4m33.7s

Let me know how I can help figure out what's going on here!

Duckfromearth · 2025-10-06T21:09:47Z

Trying one of the problem files on it's own gives error:

21:06:02 [ERROR] Error during finalising copy operation File { fd: 3, path: "/moredata/197 Moon Fever Cage Video/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: true, write: false } -> File { fd: 4, path: "/projects/deleteme/SALVAGE.TMP", read: false, write: true }: Checksum verification failed for /projects/deleteme/SALVAGE.TMP: expected 2d06800538d394c2, got 998e[##########################################################################################################################]

file in question if its helpful

https://drive.google.com/file/d/14-w8aU_9JZqu1i5KrgWcVfjzR-BaVl0O/view?usp=sharing

@Duckfromearth

Bug: Checksum verification was failing with false positives when used with --no-perms flag, reporting mismatches even though files were identical (verified with rclone). Root cause: The destination file wasn't being synced to disk before re-reading for verification. BufWriter::flush() only writes to the file descriptor, not to disk. Without fsync(), the kernel's page cache could return stale data when re-opening the file for checksum verification. Solution: Always call sync() on the destination file descriptor before verifying checksums, regardless of whether --fsync flag is set. This ensures all data is written to disk before we re-read for verification. Fixes issue reported by @Duckfromearth in PR tarka#72.

@Duckfromearth

Bug: Checksum verification was failing for sparse files because: - During copy: Only data blocks were hashed (holes were skipped) - During verification: Entire file was hashed (including zero-filled holes) - Result: Checksum mismatch even though files were functionally identical Root cause: Sparse file optimization in both parfile (copy_sparse) and parblock (map_extents) drivers skips holes during copy, but the verification re-reads the entire destination file including all holes. Solution: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying and hashing all file content including holes. Trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings. Changes: - operations.rs: Skip copy_sparse() when verify_checksum is enabled - parblock.rs: Skip extent mapping when verify_checksum is enabled - checksum.rs test: Update assertion to expect non-sparse destination Also added fsync before verification to ensure data is written to disk. Fixes issue reported by @Duckfromearth in PR tarka#72.

felix068 · 2025-10-07T07:57:53Z

Found the issue!

The problem was with sparse files.
Root cause: SALVAGE.TMP is a sparse file (256 MB logical size, but only 4 KB actually on disk). During copy, xcp's sparse optimization only hashes the actual data blocks (4 KB), but during verification, it re-reads and hashes the entire file (256 MB including zero-filled holes). Result: checksum mismatch even though files are functionally identical.

Fix: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying all file content including holes. The trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings.

Note: This only affects behavior when --verify-checksum is explicitly used. Normal copies (without checksum) still use sparse optimization for maximum performance.

Apologies for the initial issue, I'm new to this codebase and didn't fully explore all the internal mechanisms before implementing the feature. Thanks @Duckfromearth

felix068 · 2025-10-07T15:08:44Z

@Duckfromearth I'll give you time to test it on your side and give me feedback if necessary... thanks in advance !

Duckfromearth · 2025-10-07T18:42:14Z

@felix068 thanks for the fix!! That's working beautifully now.

Did some more testing today and here is what I noticed.

Copying to an SSD works flawlessly with no hiccups.

Copying to a single hard drive or ZFS array, it tends to hang intermittently. It does eventually finish, but it takes a while, and if you try and cancel the command it can take 15-30 seconds to cancel.

As far as I can tell this is because it's reading and writing to the disk in parallel, and the disk eventually gets overwhelmed with commands forcing the process to wait?

Seems tricky to address this, because I think the current implementation where it writes and does the verification in parallel is ideal for fast SSDs. I think the ideal setup for HDD's would be to write an entire file, then verify that entire file, before moving on to the next file. But ofc xcp doesn't know what kind of disk it's using. Maybe an additional flag?

I tried reducing the workers to one, but doesn't seem to make a difference.

Curious if you have any thoughts

This could also be an issue with my specific hardware or OS, I tested with a couple different drives and two different sata controllers.

Thanks again for making this happen!!

@Duckfromearth

Issue: Checksum verification was causing hangs on mechanical hard drives (HDDs) and ZFS arrays due to forced fsync() after each file write. This was particularly problematic when copying many files, as each fsync() blocks waiting for physical disk writes to complete. Analysis: - SSDs: Fast fsync, no issues - HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs - ZFS: Similar issues with sync performance Solution: Make fsync optional rather than automatic with checksum verification. Changes: - Removed automatic fsync() when verify_checksum is enabled - Users can now choose: * --verify-checksum alone: Fast, works well on most systems * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD) - Updated README with HDD performance notes and --fsync recommendation Trade-off: Without fsync, there's a theoretical risk of false positives in rare cache coherency scenarios, but in practice this is extremely rare on modern systems. Users who need absolute certainty can use --fsync. Addresses performance issue reported by @Duckfromearth in PR tarka#72.

felix068 · 2025-10-14T09:06:24Z

Sorry for the delay, (I was sick)

About the HDD performance issue: I found the problem. When using --verify-checksum, the code was automatically calling fsync after every file write. On mechanical hard drives and ZFS arrays, fsync blocks
waiting for physical disk writes which caused the hangs you experienced.

I've removed the automatic fsync. Now you have two options:

--verify-checksum alone: Fast, no hangs, works on most systems including HDDs

--verify-checksum --fsync: Maximum integrity mode, forces disk sync before verification. Slower on HDDs but guarantees correctness even in rare cache scenarios.

For your HDD and ZFS setup, I recommend using --verify-checksum without --fsync. The fsync flag is optional and only needed if you want absolute certainty for mission-critical data.

Let me know if this fixes the hang issue on your drives.

Duckfromearth · 2025-10-21T17:29:35Z

I forgot to reply here, but I just wanted to say this is working perfectly! Thanks so much for making it happen.

tarka · 2025-10-21T23:30:23Z

Hi all, Sorry I haven't replied to this, I'm focusing on something else at the moment. I'll take a look when I get a chance.

tarka · 2026-01-19T05:46:36Z

README.md

+performance on HDDs, the verification should still work correctly but may be slower.
+For maximum data integrity assurance, add `--fsync` to force data to disk before
+verification (slower but guarantees correct checksums even in rare cache coherency
+scenarios):


This will also be very slow on NFS mounts; copy_file_range() allows the copy to happen server-side; checksumming will require that data to be copied back to the client for verification.

tarka · 2026-01-19T05:47:28Z

Tests are failing for me;

---- checksum_sparse_file::test_with_parallel_block_driver stdout ----
STDOUT: 
STDERR: 
Checking: "/home/ssmith/programming/xcp/target/.tmpIVwNL0/sparse.bin"

thread 'checksum_sparse_file::test_with_parallel_block_driver' (62073) panicked at tests/checksum.rs:191:5:
assertion failed: !probably_sparse(&dest).unwrap()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- checksum_sparse_file::test_with_parallel_file_driver stdout ----
STDOUT: 
STDERR: 
Checking: "/home/ssmith/programming/xcp/target/.tmp6q3KRv/sparse.bin"

thread 'checksum_sparse_file::test_with_parallel_file_driver' (62074) panicked at tests/checksum.rs:191:5:
assertion failed: !probably_sparse(&dest).unwrap()


failures:
    checksum_sparse_file::test_with_parallel_block_driver
    checksum_sparse_file::test_with_parallel_file_driver

test result: FAILED. 26 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.32s

@Duckfromearth

Bug: Checksum verification was failing with false positives when used with --no-perms flag, reporting mismatches even though files were identical (verified with rclone). Root cause: The destination file wasn't being synced to disk before re-reading for verification. BufWriter::flush() only writes to the file descriptor, not to disk. Without fsync(), the kernel's page cache could return stale data when re-opening the file for checksum verification. Solution: Always call sync() on the destination file descriptor before verifying checksums, regardless of whether --fsync flag is set. This ensures all data is written to disk before we re-read for verification. Fixes issue reported by @Duckfromearth in PR #72.

@Duckfromearth

Bug: Checksum verification was failing for sparse files because: - During copy: Only data blocks were hashed (holes were skipped) - During verification: Entire file was hashed (including zero-filled holes) - Result: Checksum mismatch even though files were functionally identical Root cause: Sparse file optimization in both parfile (copy_sparse) and parblock (map_extents) drivers skips holes during copy, but the verification re-reads the entire destination file including all holes. Solution: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying and hashing all file content including holes. Trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings. Changes: - operations.rs: Skip copy_sparse() when verify_checksum is enabled - parblock.rs: Skip extent mapping when verify_checksum is enabled - checksum.rs test: Update assertion to expect non-sparse destination Also added fsync before verification to ensure data is written to disk. Fixes issue reported by @Duckfromearth in PR #72.

@Duckfromearth

Issue: Checksum verification was causing hangs on mechanical hard drives (HDDs) and ZFS arrays due to forced fsync() after each file write. This was particularly problematic when copying many files, as each fsync() blocks waiting for physical disk writes to complete. Analysis: - SSDs: Fast fsync, no issues - HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs - ZFS: Similar issues with sync performance Solution: Make fsync optional rather than automatic with checksum verification. Changes: - Removed automatic fsync() when verify_checksum is enabled - Users can now choose: * --verify-checksum alone: Fast, works well on most systems * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD) - Updated README with HDD performance notes and --fsync recommendation Trade-off: Without fsync, there's a theoretical risk of false positives in rare cache coherency scenarios, but in practice this is extremely rare on modern systems. Users who need absolute certainty can use --fsync. Addresses performance issue reported by @Duckfromearth in PR #72.

tarka · 2026-01-29T00:56:34Z

libxcp/src/operations.rs

        }

+        if let Some(h) = hasher {
+            *self.src_checksum.lock().unwrap() = Some(h.digest());


This shouldn't be unwrapped; it can be mapped to an error type.

tarka · 2026-01-29T02:30:10Z

libxcp/src/operations.rs

@@ -119,7 +139,9 @@ impl CopyHandle {
        if self.try_reflink()? {


This setting will be silently ignored if on a reflink-capable filesystem (btrfs, XFS). Is this deliberate?

Switch to xxHash3 for better performance

c3d2635

xxHash3 is superior to xxHash64 in every way: - Faster (~1.5-3x depending on data size) - Better hash quality - Optimized for modern architectures Thanks to @lespea for the suggestion.

felix068 added 2 commits October 7, 2025 07:51

tarka reviewed Jan 19, 2026

View reviewed changes

tarka reviewed Jan 29, 2026

View reviewed changes

		@@ -119,7 +139,9 @@ impl CopyHandle {
		if self.try_reflink()? {

Add checksum verification feature #72

Are you sure you want to change the base?

Add checksum verification feature #72

Uh oh!

Conversation

felix068 commented Oct 6, 2025

Uh oh!

lespea commented Oct 6, 2025

Uh oh!

felix068 commented Oct 6, 2025

Uh oh!

Duckfromearth commented Oct 6, 2025

Uh oh!

Duckfromearth commented Oct 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Duckfromearth commented Oct 6, 2025

Uh oh!

felix068 commented Oct 7, 2025

Uh oh!

felix068 commented Oct 7, 2025

Uh oh!

Duckfromearth commented Oct 7, 2025

Uh oh!

felix068 commented Oct 14, 2025

Uh oh!

Duckfromearth commented Oct 21, 2025

Uh oh!

tarka commented Oct 21, 2025

Uh oh!

tarka Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

tarka commented Jan 19, 2026

Uh oh!

tarka Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

tarka Jan 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Duckfromearth commented Oct 6, 2025 •

edited

Loading