PoC: Add fdeflate as a new backend#545
Conversation
Implement the fdeflate crate as an optional pure-Rust backend, selectable via the `fdeflate` Cargo feature. fdeflate is a fast DEFLATE implementation that uses only safe Rust code. Key implementation details: - The decompressor maintains an internal window buffer since fdeflate uses the output buffer as a lookback window for back-references, but flate2 passes fresh output slices on each call. - The compressor buffers all output internally and emits it on Finish, since fdeflate writes a single deflate block. - Accounts for fdeflate's bit-buffer over-read via the new Decompressor::unconsumed_bytes() API, ensuring accurate total_in tracking. - Mid-stream flush tests (Partial/Sync/Full) are gated behind cfg(any(feature = "any_zlib", feature = "miniz_oxide")) since fdeflate does not support mid-stream flushing. Known issue: roundtrip through write::DeflateEncoder piped into write::DeflateDecoder fails for some inputs (e.g. single byte [0]). Co-Authored-By: Claude <noreply@anthropic.com>
Instead of buffering all compressed output in a Vec<u8> and draining it later, drain the compressor's inner writer directly after each write_data() call. This avoids unbounded memory growth during compression and reduces unnecessary copying. Uses fdeflate's new get_writer_mut() API to access the inner Vec<u8> writer and drain produced bytes incrementally into the caller's output slice. Co-Authored-By: Claude <noreply@anthropic.com>
|
I've adapted the benchmarking harness @folkertdev shared to also measure fdeflate using this PoC: https://github.com/Shnatsel/flate2_bench/tree/fdeflate The results are shockingly good, with run.sh output with fdeflate on desktop Zen4Ubuntu clang version 14.0.0-1ubuntu1.1 rustc 1.97.0-nightly (14196dbfa 2026-04-12) -- inflate (chunks of 4096 bytes) -- -- deflate level 1 (chunks of 4096 bytes) -- target/release/flate2_bench_miniz_oxide -- deflate level 6 (chunks of 4096 bytes) -- target/release/flate2_bench_miniz_oxide -- deflate level 9 (chunks of 4096 bytes) -- target/release/flate2_bench_miniz_oxide run.sh output with fdeflate on Apple M4Apple clang version 21.0.0 (clang-2100.0.123.102) rustc 1.97.0-nightly (14196dbfa 2026-04-12) |
|
Compression is not yet as optimized as decompression in fdeflate, so there's still low-hanging fruit. A two-line change in image-rs/fdeflate#74 takes compression performance at level 6 from 23MB/s to 27MB/s, which is double the miniz_oxide performance, and at level 9 from 18MB/s to 22MB/s. Compression ratios on silesia are slightly inferior to other backends. Compression heuristics have been tuned for PNG data, and might have to be adjusted to better handle other kinds of inputs. |
|
I've tried a corpus other than Silesia and found that However, at compression level 1 fdeflate performance collapses, dipping even below its own level 6 compression speed, suggesting that level 1 may be overfitted to PNG data and will require changes. You can find all the details at image-rs/fdeflate#75 |
I think this is just fundamental to flate2's API. The copying needs to happen somewhere. Other backends just do it internally while fdeflate outsources the responsibility.
Started working on support here: image-rs/fdeflate#72. The unit tests in flate2 also assume fixed/stored blocks so image-rs/fdeflate#73 will also be needed to get them to work.
I think that miniz_oxide doesn't support this either. AFAIK dictionaries aren't really used much |
|
The apparent slowdown at compression level 1 on some corpora turned out to be entirely due to the jank in this PR and not fdeflate's fault. With up to 80% of the computation time being wasted due to the aforementioned jank, a proper implementation should show much higher compression performance than the early benchmarks conducted on this PR. |
|
Looking at the inflate profile, both appear to compute checksums, so I don't think fdeflate is doing any less work here. It really does seem to be just faster than zlib-rs, which also mirrors our findings for the |
|
Flushing was added to The remaining required changes to This was only ever meant as a proof of concept, and the concept seem sufficiently proven, so I'll go ahead and close this. |
This is a very early proof-of-concept, just to get a sense for performance and API gaps. It is mostly vibe-coded and most definitely should not be merged.
Notable findings:
fdeflate, but nothing too crazy. This branch depends on my fork with the required APIs exposed (jankily).fdeflateTests pass, minus the
[0]input edge case (fails) and mid-stream flushing added in #498 (unsupported?). CI is erroneously green because I didn't add CI jobs for this backend yet.crc32fast is also not the fastest CRC32 around (see #523) so performance could conceivably be pushed further, either via #523 or by adapting the zlib-rs implementation.
@fintelia FYI