Skip to content

Conversation

@a-ba
Copy link
Contributor

@a-ba a-ba commented Jan 22, 2026

This PR drastically improves the throughput of computing delta signatures on large files.

To test it:

mkdir rootfs
fallocate --length 4G rootfs/blob
time dar -c archive -R rootfs --delta sig:fixed:4096

before:

real    1m21,098s
user    0m40,691s
sys     0m39,587s

after:

real    0m8,752s
user    0m6,676s
sys     0m1,685s

Rationale

During archive creation, dar stores the signature in memory using the memoryfile class.

It appears that memoryfile has terrible scaling properties when subjected to short writes, because:

  1. the storage class stores each newly written block in a separate heap-allocated cellule object and these objects are collected in an internal linked list
  2. each call to memoryfile::inherited_write() implies at least one call to storage::size(), which fully traverses the linked list.

As the file grows the throughput decreases asymptotically. Rough measurement: 50 Mo/s at t0, 3.3 Mo/s at t0+0h20, 1.5Mo/s at t0+1h30, 1.2Mo/s at t0+2h30. At some point the cpu spends most of its time on page faults.

This patch modifies generic_rsync::inherited_read() so that its internal buffer is flushed only when full (rather than on every call). This reduces the size of the linked-list by two orders of magnitude. Hashing a 48Go file with a 4ko block size now takes 4 minutes (instead of 8 hours previously).

A longer term solution would be to refactor memoryfile (pehaps a naive std::vector would yield better results than the storage class, is the optimisation for arbitrary insertions really relevant?).

Also the PR fixes a possible null-pointer dereference when generic_rsync::send_eof() is called on delta creation (send_eof() has no use when creating a delta (it is only relevant forcreating signatures) and it may dereference x_output (which is null when creating a delta))

@Edrusb Edrusb self-assigned this Jan 22, 2026
@Edrusb
Copy link
Owner

Edrusb commented Jan 22, 2026

Thanks for this feedback and proposal.

Storage class is probably one of the oldest class of dar, and had the purpose of storing the bits of arbitrary long integers (infinint class)... this was year 2001, the fear of the bug of that time was still warm... :)

And I have not reviewed this code since decades but extended its use a few years later for class memory_file as you have noted...

Well there is clearly something here to review and I thank you for pointing me to it!

For the short term, I will review your pull request, and delay the 2.8.3 release to see how to include your proposal in it.

If you can rebase your pull request on branch_2.8.x this will help me :) Thanks

@Edrusb Edrusb added the enhancement behavor/feature enhancement label Jan 22, 2026
a-ba added 2 commits January 22, 2026 23:42
send_eof() has no use when creating a delta (it is only relevant for
creating signatures)

furthermore it may dereference x_output (which is null when creating a delta)
@a-ba a-ba changed the base branch from master to branch_2.8.x January 23, 2026 00:03
@a-ba
Copy link
Contributor Author

a-ba commented Jan 23, 2026

had the purpose of storing the bits of arbitrary long integers

Ok i see ;)

The branch is rebased!

@Edrusb Edrusb merged commit 49fd7f1 into Edrusb:branch_2.8.x Jan 23, 2026
@Edrusb
Copy link
Owner

Edrusb commented Jan 23, 2026

Thanks for the patch!

I will make two tiny changes :

  • in the coding style
  • and probably reorder things to avoid calling step_forward() with tmp equal to zero.

@Edrusb
Copy link
Owner

Edrusb commented Jan 23, 2026

I have the bad habit to update a THANKS file at each release, can i use your real name (according to the email address in the commit) ?
also, forget the reordering I mentioned above... I'll keep you code as is, just fixing coding style.

last, I'm looking for references, would you mind describing your use case in those simple terms:

  • company/username/or stay anonymous if you prefer
  • at what time this use case started (just the year is fine) and eventually ended
  • use case description
  • typical size created for backups/archives
  • media used (disk/tape/cloud...)
  • key dar/libdar features/differentiators that lead you choosing dar for this use case
  • everything else that does not fit in the previous items

Something I could use to consolidate other use cases...
You can contact me directly to my email for the followup.

@a-ba
Copy link
Contributor Author

a-ba commented Jan 26, 2026

Feedback sent!

Thanks for the great reactivity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement behavor/feature enhancement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants