Skip to content

PoC: Add fdeflate as a new backend#545

Closed
Shnatsel wants to merge 2 commits intorust-lang:mainfrom
Shnatsel:fdeflate-poc-2
Closed

PoC: Add fdeflate as a new backend#545
Shnatsel wants to merge 2 commits intorust-lang:mainfrom
Shnatsel:fdeflate-poc-2

Conversation

@Shnatsel
Copy link
Copy Markdown
Member

@Shnatsel Shnatsel commented Apr 13, 2026

This is a very early proof-of-concept, just to get a sense for performance and API gaps. It is mostly vibe-coded and most definitely should not be merged.

Notable findings:

  • Some new APIs for private state need to be exposed from fdeflate, but nothing too crazy. This branch depends on my fork with the required APIs exposed (jankily).
  • The flate2 wrapper needs to maintain an internal window buffer since fdeflate uses the output buffer as a lookback window for back-references, but flate2 passes fresh output slices on each call. This introduces an extra in-memory copy in the inflate path.
  • Mid-stream flushing doesn't seem to be supported by fdeflate
  • Dictionary functionality also seems to be missing

Tests pass, minus the [0] input edge case (fails) and mid-stream flushing added in #498 (unsupported?). CI is erroneously green because I didn't add CI jobs for this backend yet.

crc32fast is also not the fastest CRC32 around (see #523) so performance could conceivably be pushed further, either via #523 or by adapting the zlib-rs implementation.

@fintelia FYI

Shnatsel and others added 2 commits April 13, 2026 23:25
Implement the fdeflate crate as an optional pure-Rust backend, selectable
via the `fdeflate` Cargo feature. fdeflate is a fast DEFLATE implementation
that uses only safe Rust code.

Key implementation details:
- The decompressor maintains an internal window buffer since fdeflate uses
  the output buffer as a lookback window for back-references, but flate2
  passes fresh output slices on each call.
- The compressor buffers all output internally and emits it on Finish,
  since fdeflate writes a single deflate block.
- Accounts for fdeflate's bit-buffer over-read via the new
  Decompressor::unconsumed_bytes() API, ensuring accurate total_in tracking.
- Mid-stream flush tests (Partial/Sync/Full) are gated behind
  cfg(any(feature = "any_zlib", feature = "miniz_oxide")) since fdeflate
  does not support mid-stream flushing.

Known issue: roundtrip through write::DeflateEncoder piped into
write::DeflateDecoder fails for some inputs (e.g. single byte [0]).

Co-Authored-By: Claude <noreply@anthropic.com>
Instead of buffering all compressed output in a Vec<u8> and draining it
later, drain the compressor's inner writer directly after each
write_data() call. This avoids unbounded memory growth during
compression and reduces unnecessary copying.

Uses fdeflate's new get_writer_mut() API to access the inner Vec<u8>
writer and drain produced bytes incrementally into the caller's output
slice.

Co-Authored-By: Claude <noreply@anthropic.com>
@Shnatsel Shnatsel changed the title PoC: Add fdeflate as a new decompression backend PoC: Add fdeflate as a new backend Apr 13, 2026
@Shnatsel
Copy link
Copy Markdown
Member Author

Shnatsel commented Apr 13, 2026

I've adapted the benchmarking harness @folkertdev shared to also measure fdeflate using this PoC: https://github.com/Shnatsel/flate2_bench/tree/fdeflate

The results are shockingly good, with fdeflate outperforming even zlib-rs at decompression. It's possible that fdeflate is doing less work in this PoC, e.g. by skipping checksums; but still, this is very promising!

run.sh output with fdeflate on desktop Zen4

Ubuntu clang version 14.0.0-1ubuntu1.1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

rustc 1.97.0-nightly (14196dbfa 2026-04-12)
binary: rustc
commit-hash: 14196dbfa3eb7c30195251eac092b1b86c8a2d84
commit-date: 2026-04-12
host: x86_64-unknown-linux-gnu
release: 1.97.0-nightly
LLVM version: 22.1.2

-- inflate (chunks of 4096 bytes) --
target/release/flate2_bench_miniz_oxide
mean runtime 0.028137s (stdev of 0.000248) at 533.374 MB/s, ratio of 0.404
target/release/flate2_bench_fdeflate
mean runtime 0.018406s (stdev of 0.000335) at 815.348 MB/s, ratio of 0.404
target/release/flate2_bench_zlib_ng
mean runtime 0.024131s (stdev of 0.000032) at 621.911 MB/s, ratio of 0.404
target/release/flate2_bench_zlib_rs
mean runtime 0.020800s (stdev of 0.000088) at 721.516 MB/s, ratio of 0.404

-- deflate level 1 (chunks of 4096 bytes) --

target/release/flate2_bench_miniz_oxide
mean runtime 0.058150s (stdev of 0.002528) at 129.649 MB/s, ratio of 1.991
target/release/flate2_bench_fdeflate
mean runtime 0.040613s (stdev of 0.001354) at 237.573 MB/s, ratio of 1.555
target/release/flate2_bench_zlib_ng
mean runtime 0.044328s (stdev of 0.000423) at 183.417 MB/s, ratio of 1.846
target/release/flate2_bench_zlib_rs
mean runtime 0.047411s (stdev of 0.001216) at 171.491 MB/s, ratio of 1.846

-- deflate level 6 (chunks of 4096 bytes) --

target/release/flate2_bench_miniz_oxide
mean runtime 0.443965s (stdev of 0.000692) at 13.772 MB/s, ratio of 2.454
target/release/flate2_bench_fdeflate
mean runtime 0.267347s (stdev of 0.000694) at 23.136 MB/s, ratio of 2.426
target/release/flate2_bench_zlib_ng
mean runtime 0.162743s (stdev of 0.000310) at 37.848 MB/s, ratio of 2.436
target/release/flate2_bench_zlib_rs
mean runtime 0.170855s (stdev of 0.001357) at 36.051 MB/s, ratio of 2.436

-- deflate level 9 (chunks of 4096 bytes) --

target/release/flate2_bench_miniz_oxide
mean runtime 0.751213s (stdev of 0.005024) at 8.101 MB/s, ratio of 2.466
target/release/flate2_bench_fdeflate
mean runtime 0.332334s (stdev of 0.001516) at 18.511 MB/s, ratio of 2.439
target/release/flate2_bench_zlib_ng
mean runtime 0.335950s (stdev of 0.000510) at 18.051 MB/s, ratio of 2.475
target/release/flate2_bench_zlib_rs
mean runtime 0.335970s (stdev of 0.000383) at 18.050 MB/s, ratio of 2.475

run.sh output with fdeflate on Apple M4

Apple clang version 21.0.0 (clang-2100.0.123.102)
Target: arm64-apple-darwin25.4.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin

rustc 1.97.0-nightly (14196dbfa 2026-04-12)
binary: rustc
commit-hash: 14196dbfa3eb7c30195251eac092b1b86c8a2d84
commit-date: 2026-04-12
host: aarch64-apple-darwin
release: 1.97.0-nightly
LLVM version: 22.1.2
\n -- inflate (chunks of 4096 bytes) -- \n
target/release/flate2_bench_miniz_oxide
mean runtime 0.036069s (stdev of 0.010199) at 416.072 MB/s, ratio of 0.404
target/release/flate2_bench_fdeflate
mean runtime 0.022164s (stdev of 0.006995) at 677.090 MB/s, ratio of 0.404
target/release/flate2_bench_zlib_ng
mean runtime 0.036059s (stdev of 0.012593) at 416.191 MB/s, ratio of 0.404
target/release/flate2_bench_zlib_rs
mean runtime 0.028451s (stdev of 0.009108) at 527.478 MB/s, ratio of 0.404
\n -- deflate level 1 (chunks of 4096 bytes) -- \n
target/release/flate2_bench_miniz_oxide
mean runtime 0.051375s (stdev of 0.000310) at 146.743 MB/s, ratio of 1.991
target/release/flate2_bench_fdeflate
mean runtime 0.031730s (stdev of 0.000100) at 304.089 MB/s, ratio of 1.555
target/release/flate2_bench_zlib_ng
mean runtime 0.047112s (stdev of 0.000163) at 172.578 MB/s, ratio of 1.846
target/release/flate2_bench_zlib_rs
mean runtime 0.051220s (stdev of 0.000208) at 158.735 MB/s, ratio of 1.846
\n -- deflate level 6 (chunks of 4096 bytes) -- \n
target/release/flate2_bench_miniz_oxide
mean runtime 0.276999s (stdev of 0.001565) at 22.074 MB/s, ratio of 2.454
target/release/flate2_bench_fdeflate
mean runtime 0.194151s (stdev of 0.004169) at 31.858 MB/s, ratio of 2.426
target/release/flate2_bench_zlib_ng
mean runtime 0.129773s (stdev of 0.002081) at 47.464 MB/s, ratio of 2.436
target/release/flate2_bench_zlib_rs
mean runtime 0.145596s (stdev of 0.001938) at 42.306 MB/s, ratio of 2.436
\n -- deflate level 9 (chunks of 4096 bytes) -- \n
target/release/flate2_bench_miniz_oxide
mean runtime 0.456074s (stdev of 0.001685) at 13.344 MB/s, ratio of 2.466
target/release/flate2_bench_fdeflate
mean runtime 0.242042s (stdev of 0.006304) at 25.417 MB/s, ratio of 2.439
target/release/flate2_bench_zlib_ng
mean runtime 0.274085s (stdev of 0.002296) at 22.126 MB/s, ratio of 2.475
target/release/flate2_bench_zlib_rs
mean runtime 0.260981s (stdev of 0.001494) at 23.237 MB/s, ratio of 2.475

@Shnatsel
Copy link
Copy Markdown
Member Author

Compression is not yet as optimized as decompression in fdeflate, so there's still low-hanging fruit. A two-line change in image-rs/fdeflate#74 takes compression performance at level 6 from 23MB/s to 27MB/s, which is double the miniz_oxide performance, and at level 9 from 18MB/s to 22MB/s.

Compression ratios on silesia are slightly inferior to other backends. Compression heuristics have been tuned for PNG data, and might have to be adjusted to better handle other kinds of inputs.

@Shnatsel
Copy link
Copy Markdown
Member Author

I've tried a corpus other than Silesia and found that fdeflate beats zlib-rs at compression speed by nearly 50% at levels 6 and 9. This suggests that zlib-ng and by proxy zlib-rs may be overfitted to perform well on Silesia but not on other kinds of data.

However, at compression level 1 fdeflate performance collapses, dipping even below its own level 6 compression speed, suggesting that level 1 may be overfitted to PNG data and will require changes.

You can find all the details at image-rs/fdeflate#75

@fintelia
Copy link
Copy Markdown
Contributor

The flate2 wrapper needs to maintain an internal window buffer since fdeflate uses the output buffer as a lookback window for back-references, but flate2 passes fresh output slices on each call. This introduces an extra in-memory copy in the inflate path.

I think this is just fundamental to flate2's API. The copying needs to happen somewhere. Other backends just do it internally while fdeflate outsources the responsibility.

Mid-stream flushing doesn't seem to be supported by fdeflate

Started working on support here: image-rs/fdeflate#72. The unit tests in flate2 also assume fixed/stored blocks so image-rs/fdeflate#73 will also be needed to get them to work.

Dictionary functionality also seems to be missing

I think that miniz_oxide doesn't support this either. AFAIK dictionaries aren't really used much

@Shnatsel
Copy link
Copy Markdown
Member Author

The apparent slowdown at compression level 1 on some corpora turned out to be entirely due to the jank in this PR and not fdeflate's fault.

With up to 80% of the computation time being wasted due to the aforementioned jank, a proper implementation should show much higher compression performance than the early benchmarks conducted on this PR.

@Shnatsel
Copy link
Copy Markdown
Member Author

Looking at the inflate profile, both appear to compute checksums, so I don't think fdeflate is doing any less work here. It really does seem to be just faster than zlib-rs, which also mirrors our findings for the png crate.

@Shnatsel
Copy link
Copy Markdown
Member Author

Flushing was added to fdeflate in image-rs/fdeflate#72

The remaining required changes to fdefalte are quite trivial.

This was only ever meant as a proof of concept, and the concept seem sufficiently proven, so I'll go ahead and close this.

@Shnatsel Shnatsel closed this Apr 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants