Summary
squirrel's scan-back fingerprint reads the ciphertext checksum from the underlying remote with rclone lsjson --hash. For the s3 backend it uses the md5 hash slot as the object ETag. Per rclone's S3 documentation, this is reliable only for single-part uploads (and multipart objects that carry an MD5 in metadata): a plain multipart object's ETag is <hex>-<parts> and rclone does not surface it in the md5 hash slot — it returns the metadata MD5 or an empty hash. So large objects (those that the backend splits into multiple parts — exactly the multi-GB media case) may leave their fingerprint pending rather than capturing the ETag.
The current code is safe about this — an empty hash leaves the remote_objects pair NULL with a "fingerprint stays pending" warning, never a fake value — but the ETag-capture coverage for multipart objects is unconfirmed against a live S3-compatible backend.
What to confirm / decide
- Against a real S3-compatible backend with multipart uploads: what does
rclone lsjson --hash actually return for a multipart object's md5 hash? Empty, the metadata MD5, or the composite ETag?
- If the composite ETag is needed as the fingerprint, the right rclone surface is likely
lsf --format / --metadata (the ETag is metadata, not a hash), not lsjson --hash. Decide whether to switch the s3 capture path to read the ETag via metadata, or to set an MD5 in object metadata at upload time so single-hash capture works.
- The design doc deferred S3 additional checksums (rclone can't set them); this issue is the read-side counterpart.
Behavior is correct today (pending + warning); this tracks closing the multipart ETag coverage gap so content-addressed offsite objects on s3 can actually gate offload via a recorded fingerprint.
Raised during the standards review of the scan-back fingerprint PR (#116).
Summary
squirrel's scan-back fingerprint reads the ciphertext checksum from the underlying remote withrclone lsjson --hash. For the s3 backend it uses themd5hash slot as the object ETag. Per rclone's S3 documentation, this is reliable only for single-part uploads (and multipart objects that carry an MD5 in metadata): a plain multipart object's ETag is<hex>-<parts>and rclone does not surface it in themd5hash slot — it returns the metadata MD5 or an empty hash. So large objects (those that the backend splits into multiple parts — exactly the multi-GB media case) may leave their fingerprint pending rather than capturing the ETag.The current code is safe about this — an empty hash leaves the
remote_objectspair NULL with a "fingerprint stays pending" warning, never a fake value — but the ETag-capture coverage for multipart objects is unconfirmed against a live S3-compatible backend.What to confirm / decide
rclone lsjson --hashactually return for a multipart object'smd5hash? Empty, the metadata MD5, or the composite ETag?lsf --format/--metadata(the ETag is metadata, not a hash), notlsjson --hash. Decide whether to switch the s3 capture path to read the ETag via metadata, or to set an MD5 in object metadata at upload time so single-hash capture works.Behavior is correct today (pending + warning); this tracks closing the multipart ETag coverage gap so content-addressed offsite objects on s3 can actually gate offload via a recorded fingerprint.
Raised during the standards review of the scan-back fingerprint PR (#116).