Skip to content

Conversation

@felix068
Copy link

@felix068 felix068 commented Oct 6, 2025

Implements feature request #71 for optional checksum verification during file copies to detect hardware errors (memory/storage issues).

Implementation:

  • Uses xxHash64 for fast non-cryptographic checksumming
  • Calculates checksum during copy (zero overhead on source read)
  • Verifies by re-reading destination file only
  • Returns immediate error on mismatch (no retry)
  • Works with both parfile and parblock drivers
  • Thread-safe using Mutex for final checksum storage

Changes:

  • Add --verify-checksum CLI flag
  • Add Config.verify_checksum field
  • Add XcpError::ChecksumMismatch error type
  • Update CopyHandle to compute and verify checksums
  • Add xxhash-rust dependency

Tests:

  • 28 comprehensive test cases covering:
    • Empty, small, and large files
    • Binary patterns (zeros, random, alternating)
    • Recursive directory copies
    • Multiple files
    • Both drivers (parfile/parblock)
    • Various block sizes and worker counts
    • Sparse files
    • File overwriting

Performance:

  • ~2x overhead due to destination re-read (e.g., 34ms → 70ms for 50MB)
  • Acceptable trade-off for critical data integrity

Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue #71.

Implements feature request tarka#71 for optional checksum verification during file
copies to detect hardware errors (memory/storage issues).

Implementation:
- Uses xxHash64 for fast non-cryptographic checksumming
- Calculates checksum during copy (zero overhead on source read)
- Verifies by re-reading destination file only
- Returns immediate error on mismatch (no retry)
- Works with both parfile and parblock drivers
- Thread-safe using Mutex for final checksum storage

Changes:
- Add --verify-checksum CLI flag
- Add Config.verify_checksum field
- Add XcpError::ChecksumMismatch error type
- Update CopyHandle to compute and verify checksums
- Add xxhash-rust dependency

Tests:
- 28 comprehensive test cases covering:
  - Empty, small, and large files
  - Binary patterns (zeros, random, alternating)
  - Recursive directory copies
  - Multiple files
  - Both drivers (parfile/parblock)
  - Various block sizes and worker counts
  - Sparse files
  - File overwriting

Performance:
- ~2x overhead due to destination re-read (e.g., 34ms → 70ms for 50MB)
- Acceptable trade-off for critical data integrity

Addresses feedback from @tarka, @Kalinda-Myriad, and @OndrikB in issue tarka#71.
@lespea
Copy link

lespea commented Oct 6, 2025

Why not xxh3, it's superior in basically every way...

xxHash3 is superior to xxHash64 in every way:
- Faster (~1.5-3x depending on data size)
- Better hash quality
- Optimized for modern architectures

Thanks to @lespea for the suggestion.
@felix068
Copy link
Author

felix068 commented Oct 6, 2025

Because .......... I hadn't thought of that...
Yeah, fair point, xxh3 makes more sense. Will fix
Thanks !

@Duckfromearth
Copy link

Thanks so much for this! This is exactly what I was hoping for.

@Duckfromearth
Copy link

Duckfromearth commented Oct 6, 2025

Hmmm might be getting some false negatives?

❯ xcp -r ./A001A20M/ /projects/deletme/ --verify-checksum --no-perms
[######################################################################################################################]
222.42 GiB / 222.42 GiB | 100% | 10.32 TiB/s | 00:00:00 remaining ❯ z /moredata/tot/med
zoxide: no match found
❯ z more
❯ xcp -r 197\ Moon\ Fever\ Cage\ Video/ /projects/deleteme --no-perms --verify-checksum
[>---------------------------------------------------------------------------------------------------------------------]
1.00 MiB / 149.31 GiB | 0% | 95.13 KiB/s | 19d 01:09:22 remaining 20:48:29 [ERROR] Error during finalising copy operation File { fd: 9, path: "/moredata/197 Moon Fever Cage Video/media/card c/private/DATABASE/DATABASE.BIN", read: true, write: false } -> File { fd: 8, path: "/projects/deleteme/media/card c/private/DATABASE/DATABASE.BIN", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/c[#>--------------------------------------------------------------------------------------------------------------------]
2.51 GiB / 149.31 GiB | 2% | 99.15 MiB/s | 00:25:16 remaining 20:48:33 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1697.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1697.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLI[####>-----------------------------------------------------------------------------------------------------------------]
5.67 GiB / 149.31 GiB | 4% | 287.65 MiB/s | 00:08:31 remaining 20:48:37 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1696.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1696.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLI[##########>------------[################################>-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------]
19.25 GiB / 149.31 GiB | 13% | 678.46 MiB/s | 00:03:16 remaining [###[#######################>--------------------------------------------------------------------------------------------------]
29.23 GiB / 149.31 GiB | 20% | 703.09 MiB/s | 00:02:54 remaining 20:49:11 [ERROR] Error during finalising copy operation File { fd: 7, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1701.MP4", read: true, write: false } -> File { fd: 10, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1701.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1701.MP4[########################>-------------------------------------------------------------------------------------------------]
29.44 GiB / 149.31 GiB | 20% | 703.97 MiB/s | 00:02:54 remaining 20:49:11 [ERROR] Error during finalising copy operation File { fd: 3, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1699.MP4", read: true, write: false } -> File { fd: 4, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1699.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1699.MP4:[#########################>------------------------------------------------------------------------------------------------]
30.96 GiB / 149.31 GiB | 21% | 712.06 MiB/s | 00:02:50 remaining 20:49:13 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1700.MP4", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1700.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1700.MP4:[##########################>-----------------------------------------------------------------------------------------------]
32.02 GiB / 149.31 GiB | 21% | 716.50 MiB/s | 00:02:47 remaining 20:49:15 [ERROR] Error during finalising copy operation File { fd: 8, path: "/moredata/197 Moon Fever Cage Video/media/card c/M4ROOT/CLIP/C1698.MP4", read: true, write: false } -> File { fd: 9, path: "/projects/deleteme/media/card c/M4ROOT/CLIP/C1698.MP4", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card c/M4ROOT/CLIP/C1698.MP4:[#####################################################################################################################>----]
143.85 GiB / 149.31 GiB | 96% | 643.18 MiB/s | 00:00:08 remaining 20:52:07 [ERROR] Error during finalising copy operation File { fd: 5, path: "/moredata/197 Moon Fever Cage Video/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: true, write: false } -> File { fd: 6, path: "/projects/deleteme/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: false, write: true }: Checksum verification failed for /projects/deleteme/media/card b/XD[##########################################################################################################################]
149.31 GiB / 149.31 GiB | 100% | 647.41 MiB/s | 00:00:00 remaining

but checking those directories against each other with rclone:

❯ rclone check "/moredata/197 Moon Fever Cage Video" "/projects/deleteme" --one-way --log-level INFO
2025/10/06 20:56:00 NOTICE: Config file "/home/hex/.config/rclone/rclone.conf" not found - using defaults
2025/10/06 20:56:00 INFO : Using md5 for hash comparisons
2025/10/06 21:00:34 NOTICE: Local file system at /projects/deleteme: 0 differences found
2025/10/06 21:00:34 NOTICE: Local file system at /projects/deleteme: 148 matching files
2025/10/06 21:00:34 INFO :
Transferred: 0 B / 0 B, -, 0 B/s, ETA -
Checks: 148 / 148, 100%
Elapsed time: 4m33.7s

Let me know how I can help figure out what's going on here!

@Duckfromearth
Copy link

Trying one of the problem files on it's own gives error:

21:06:02 [ERROR] Error during finalising copy operation File { fd: 3, path: "/moredata/197 Moon Fever Cage Video/media/card b/XDROOT/General/Sony/SALVAGE.TMP", read: true, write: false } -> File { fd: 4, path: "/projects/deleteme/SALVAGE.TMP", read: false, write: true }: Checksum verification failed for /projects/deleteme/SALVAGE.TMP: expected 2d06800538d394c2, got 998e[##########################################################################################################################]

file in question if its helpful

https://drive.google.com/file/d/14-w8aU_9JZqu1i5KrgWcVfjzR-BaVl0O/view?usp=sharing

Bug: Checksum verification was failing with false positives when used
with --no-perms flag, reporting mismatches even though files were identical
(verified with rclone).

Root cause: The destination file wasn't being synced to disk before
re-reading for verification. BufWriter::flush() only writes to the file
descriptor, not to disk. Without fsync(), the kernel's page cache could
return stale data when re-opening the file for checksum verification.

Solution: Always call sync() on the destination file descriptor before
verifying checksums, regardless of whether --fsync flag is set. This
ensures all data is written to disk before we re-read for verification.

Fixes issue reported by @Duckfromearth in PR tarka#72.
Bug: Checksum verification was failing for sparse files because:
- During copy: Only data blocks were hashed (holes were skipped)
- During verification: Entire file was hashed (including zero-filled holes)
- Result: Checksum mismatch even though files were functionally identical

Root cause: Sparse file optimization in both parfile (copy_sparse) and
parblock (map_extents) drivers skips holes during copy, but the verification
re-reads the entire destination file including all holes.

Solution: Disable sparse file optimization when --verify-checksum is enabled.
This ensures consistent hashing by copying and hashing all file content
including holes. Trade-off is acceptable: users wanting checksum verification
prioritize data integrity over sparse file space savings.

Changes:
- operations.rs: Skip copy_sparse() when verify_checksum is enabled
- parblock.rs: Skip extent mapping when verify_checksum is enabled
- checksum.rs test: Update assertion to expect non-sparse destination

Also added fsync before verification to ensure data is written to disk.

Fixes issue reported by @Duckfromearth in PR tarka#72.
@felix068
Copy link
Author

felix068 commented Oct 7, 2025

Found the issue!

The problem was with sparse files.
Root cause: SALVAGE.TMP is a sparse file (256 MB logical size, but only 4 KB actually on disk). During copy, xcp's sparse optimization only hashes the actual data blocks (4 KB), but during verification, it re-reads and hashes the entire file (256 MB including zero-filled holes). Result: checksum mismatch even though files are functionally identical.

Fix: Disable sparse file optimization when --verify-checksum is enabled. This ensures consistent hashing by copying all file content including holes. The trade-off is acceptable: users wanting checksum verification prioritize data integrity over sparse file space savings.

Note: This only affects behavior when --verify-checksum is explicitly used. Normal copies (without checksum) still use sparse optimization for maximum performance.

Apologies for the initial issue, I'm new to this codebase and didn't fully explore all the internal mechanisms before implementing the feature. Thanks @Duckfromearth

@felix068
Copy link
Author

felix068 commented Oct 7, 2025

@Duckfromearth I'll give you time to test it on your side and give me feedback if necessary... thanks in advance !

@Duckfromearth
Copy link

@felix068 thanks for the fix!! That's working beautifully now.

Did some more testing today and here is what I noticed.

Copying to an SSD works flawlessly with no hiccups.

Copying to a single hard drive or ZFS array, it tends to hang intermittently. It does eventually finish, but it takes a while, and if you try and cancel the command it can take 15-30 seconds to cancel.

As far as I can tell this is because it's reading and writing to the disk in parallel, and the disk eventually gets overwhelmed with commands forcing the process to wait?

Seems tricky to address this, because I think the current implementation where it writes and does the verification in parallel is ideal for fast SSDs. I think the ideal setup for HDD's would be to write an entire file, then verify that entire file, before moving on to the next file. But ofc xcp doesn't know what kind of disk it's using. Maybe an additional flag?

I tried reducing the workers to one, but doesn't seem to make a difference.

Curious if you have any thoughts

This could also be an issue with my specific hardware or OS, I tested with a couple different drives and two different sata controllers.

Thanks again for making this happen!!

Issue: Checksum verification was causing hangs on mechanical hard drives
(HDDs) and ZFS arrays due to forced fsync() after each file write. This was
particularly problematic when copying many files, as each fsync() blocks
waiting for physical disk writes to complete.

Analysis:
- SSDs: Fast fsync, no issues
- HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs
- ZFS: Similar issues with sync performance

Solution: Make fsync optional rather than automatic with checksum verification.

Changes:
- Removed automatic fsync() when verify_checksum is enabled
- Users can now choose:
  * --verify-checksum alone: Fast, works well on most systems
  * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD)
- Updated README with HDD performance notes and --fsync recommendation

Trade-off: Without fsync, there's a theoretical risk of false positives in
rare cache coherency scenarios, but in practice this is extremely rare on
modern systems. Users who need absolute certainty can use --fsync.

Addresses performance issue reported by @Duckfromearth in PR tarka#72.
@felix068
Copy link
Author

Sorry for the delay, (I was sick)

About the HDD performance issue: I found the problem. When using --verify-checksum, the code was automatically calling fsync after every file write. On mechanical hard drives and ZFS arrays, fsync blocks
waiting for physical disk writes which caused the hangs you experienced.

I've removed the automatic fsync. Now you have two options:

--verify-checksum alone: Fast, no hangs, works on most systems including HDDs

--verify-checksum --fsync: Maximum integrity mode, forces disk sync before verification. Slower on HDDs but guarantees correctness even in rare cache scenarios.

For your HDD and ZFS setup, I recommend using --verify-checksum without --fsync. The fsync flag is optional and only needed if you want absolute certainty for mission-critical data.

Let me know if this fixes the hang issue on your drives.

@Duckfromearth
Copy link

I forgot to reply here, but I just wanted to say this is working perfectly! Thanks so much for making it happen.

@tarka
Copy link
Owner

tarka commented Oct 21, 2025

Hi all, Sorry I haven't replied to this, I'm focusing on something else at the moment. I'll take a look when I get a chance.

performance on HDDs, the verification should still work correctly but may be slower.
For maximum data integrity assurance, add `--fsync` to force data to disk before
verification (slower but guarantees correct checksums even in rare cache coherency
scenarios):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will also be very slow on NFS mounts; copy_file_range() allows the copy to happen server-side; checksumming will require that data to be copied back to the client for verification.

@tarka
Copy link
Owner

tarka commented Jan 19, 2026

Tests are failing for me;

---- checksum_sparse_file::test_with_parallel_block_driver stdout ----
STDOUT: 
STDERR: 
Checking: "/home/ssmith/programming/xcp/target/.tmpIVwNL0/sparse.bin"

thread 'checksum_sparse_file::test_with_parallel_block_driver' (62073) panicked at tests/checksum.rs:191:5:
assertion failed: !probably_sparse(&dest).unwrap()
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- checksum_sparse_file::test_with_parallel_file_driver stdout ----
STDOUT: 
STDERR: 
Checking: "/home/ssmith/programming/xcp/target/.tmp6q3KRv/sparse.bin"

thread 'checksum_sparse_file::test_with_parallel_file_driver' (62074) panicked at tests/checksum.rs:191:5:
assertion failed: !probably_sparse(&dest).unwrap()


failures:
    checksum_sparse_file::test_with_parallel_block_driver
    checksum_sparse_file::test_with_parallel_file_driver

test result: FAILED. 26 passed; 2 failed; 0 ignored; 0 measured; 0 filtered out; finished in 1.32s


tarka pushed a commit that referenced this pull request Jan 19, 2026
Bug: Checksum verification was failing with false positives when used
with --no-perms flag, reporting mismatches even though files were identical
(verified with rclone).

Root cause: The destination file wasn't being synced to disk before
re-reading for verification. BufWriter::flush() only writes to the file
descriptor, not to disk. Without fsync(), the kernel's page cache could
return stale data when re-opening the file for checksum verification.

Solution: Always call sync() on the destination file descriptor before
verifying checksums, regardless of whether --fsync flag is set. This
ensures all data is written to disk before we re-read for verification.

Fixes issue reported by @Duckfromearth in PR #72.
tarka pushed a commit that referenced this pull request Jan 19, 2026
Bug: Checksum verification was failing for sparse files because:
- During copy: Only data blocks were hashed (holes were skipped)
- During verification: Entire file was hashed (including zero-filled holes)
- Result: Checksum mismatch even though files were functionally identical

Root cause: Sparse file optimization in both parfile (copy_sparse) and
parblock (map_extents) drivers skips holes during copy, but the verification
re-reads the entire destination file including all holes.

Solution: Disable sparse file optimization when --verify-checksum is enabled.
This ensures consistent hashing by copying and hashing all file content
including holes. Trade-off is acceptable: users wanting checksum verification
prioritize data integrity over sparse file space savings.

Changes:
- operations.rs: Skip copy_sparse() when verify_checksum is enabled
- parblock.rs: Skip extent mapping when verify_checksum is enabled
- checksum.rs test: Update assertion to expect non-sparse destination

Also added fsync before verification to ensure data is written to disk.

Fixes issue reported by @Duckfromearth in PR #72.
tarka pushed a commit that referenced this pull request Jan 19, 2026
Issue: Checksum verification was causing hangs on mechanical hard drives
(HDDs) and ZFS arrays due to forced fsync() after each file write. This was
particularly problematic when copying many files, as each fsync() blocks
waiting for physical disk writes to complete.

Analysis:
- SSDs: Fast fsync, no issues
- HDDs: Slow fsync due to mechanical seeks, causes intermittent hangs
- ZFS: Similar issues with sync performance

Solution: Make fsync optional rather than automatic with checksum verification.

Changes:
- Removed automatic fsync() when verify_checksum is enabled
- Users can now choose:
  * --verify-checksum alone: Fast, works well on most systems
  * --verify-checksum --fsync: Maximum integrity, forces disk sync (slower on HDD)
- Updated README with HDD performance notes and --fsync recommendation

Trade-off: Without fsync, there's a theoretical risk of false positives in
rare cache coherency scenarios, but in practice this is extremely rare on
modern systems. Users who need absolute certainty can use --fsync.

Addresses performance issue reported by @Duckfromearth in PR #72.
}

if let Some(h) = hasher {
*self.src_checksum.lock().unwrap() = Some(h.digest());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be unwrapped; it can be mapped to an error type.

@@ -119,7 +139,9 @@ impl CopyHandle {
if self.try_reflink()? {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This setting will be silently ignored if on a reflink-capable filesystem (btrfs, XFS). Is this deliberate?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants