-
Notifications
You must be signed in to change notification settings - Fork 8
Refactor container compression so the zstd ignores the uncompressable WEBP and Lepton data #60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
9db650c
v1 of refactor
mcroomp 82c7b15
work and tests
mcroomp f2394a6
update with claude help
mcroomp e2332ad
remove pdf parse
mcroomp 18701bc
fix formatting
mcroomp 7800a3f
update claude.md
mcroomp cbdd8b6
refactored out container reader / writer
mcroomp 0804476
Apply suggestion from @Copilot
mcroomp File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,73 @@ | ||
| # CLAUDE.md | ||
|
|
||
| This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. | ||
|
|
||
| > **Important:** This file (and all sub-project `CLAUDE.md` files) are checked into the | ||
| > repository. Only include information that is valid for **any** developer or machine: | ||
| > project conventions, architecture, commands, constraints. **Do not** add machine-specific | ||
| > paths, personal tool preferences, local environment settings, or anything that would | ||
| > not apply to every contributor. | ||
|
|
||
| ## Commands | ||
|
|
||
| ```bash | ||
| # Build | ||
| cargo build --all | ||
| cargo build --release --all | ||
|
|
||
| # Test | ||
| cargo test --all | ||
| cargo test <test_name> # Run a single test by name | ||
| cargo test -- --nocapture # Show test output | ||
|
|
||
| # Lint and format | ||
| cargo fmt --check --all | ||
| cargo clippy | ||
| ``` | ||
|
|
||
| The CI runs on `windows-latest` and builds for multiple targets: `wasm32-wasip1`, `aarch64-unknown-linux-musl`, `x86_64-pc-windows-msvc`, `x86_64-unknown-linux-gnu`. | ||
|
|
||
| The release build uses Spectre mitigations (`/Qspectre /sdl`) and produces `preflate_rs_0_7.dll` and `preflate_util.exe`. | ||
|
|
||
| ## Architecture | ||
|
|
||
| **preflate-rs** analyzes DEFLATE-compressed streams, extracts the uncompressed data plus a compact set of reconstruction parameters, and later recreates the exact original DEFLATE bitstream. This enables re-compression with modern algorithms (Zstd, Brotli) while preserving binary-exact round-trip fidelity. The key insight is detecting which compressor (zlib, libdeflate, zlib-ng, miniz, Windows zlib) produced a stream and storing only the differences from what that compressor would predict. | ||
|
|
||
| ### Workspace layout | ||
|
|
||
| | Crate | Output | Role | | ||
| |---|---|---| | ||
| | `preflate/` | library | Core DEFLATE analysis and reconstruction | | ||
| | `container/` | library | Scans binary files (ZIP, PNG, JPEG) for DEFLATE streams | | ||
| | `util/` | `preflate_util.exe` | CLI for testing on files/directories | | ||
| | `dll/` | `preflate_rs_0_7.dll` | C FFI wrapper for .NET interop | | ||
| | `fuzz/` | fuzz harnesses | libfuzzer targets | | ||
| | `tests/` | integration tests | End-to-end round-trip tests using `samples/` | | ||
|
|
||
| ### preflate crate (core) | ||
|
|
||
| The processing pipeline in `preflate/src/stream_processor.rs`: | ||
| 1. **`deflate/`** — Reads a DEFLATE bitstream into tokens (literals and length/distance back-references) and writes tokens back to DEFLATE with custom Huffman trees. | ||
| 2. **`estimator/`** — Estimates the compressor's parameters (`TokenPredictorParameters`): hash algorithm, `nice_length`, `max_chain`, window bits, add policy, matching type. | ||
| 3. **`token_predictor.rs`** — Replays the compression using estimated parameters and hash chains to predict what tokens the original compressor would have produced. | ||
| 4. **`tree_predictor.rs`** — Predicts Huffman tree structure. | ||
| 5. **`statistical_codec.rs` / `cabac_codec.rs`** — Encodes the *differences* from prediction using CABAC (Context Adaptive Binary Arithmetic Coding, shared with Lepton JPEG). | ||
| 6. **`stream_processor.rs`** — Public API: `PreflateStreamProcessor::decompress()` and `RecreateStreamProcessor::recreate()`. | ||
|
|
||
| Parameters are serialized via `bitcode`; corrections via CABAC. The format is chunked to bound memory use. | ||
|
|
||
| ### container crate | ||
|
|
||
| - **`scan_deflate.rs`** — Scans raw bytes to locate DEFLATE stream boundaries, identifying stream type (raw deflate, zlib-wrapped, PNG IDAT, ZIP, JPEG, etc.). | ||
| - **`idat_parse.rs`** — Extracts and reassembles PNG IDAT chunks. | ||
| - **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress). Zstd encode/decode is handled inline using a single persistent encoder. | ||
| - **`utils.rs`** — `process_limited_buffer()` and test helpers. | ||
| - **`scoped_read.rs`** — Bounded reader adapter. | ||
|
|
||
| The optional `webp` feature (enabled by default) allows PNG images to be stored as WebP instead of losslessly. PDF streams are not scanned (pdf_parse was removed). | ||
|
|
||
| ### Code constraints | ||
|
|
||
| - **No unsafe code** — enforced via `#![forbid(unsafe_code)]` in each crate. | ||
| - Minimum Rust version: **1.85**, Edition **2024**. | ||
| - `.cargo/config.toml` sets Windows MSVC linker flags (`/DYNAMICBASE`, `/CETCOMPAT`, `/guard:cf`). |
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,203 @@ | ||
| # container (preflate-container) | ||
|
|
||
| Scans binary files (ZIP, PNG, JPEG) for DEFLATE streams, orchestrates the | ||
| preflate + Zstd pipeline, and reassembles the output. Only format version 2 exists | ||
| (v1 was removed). | ||
|
|
||
| ## Public API (`lib.rs`) | ||
|
|
||
| ```rust | ||
| // Compress a file/buffer containing embedded DEFLATE streams | ||
| PreflateContainerProcessor::new(config: &PreflateContainerConfig, level: i32, test_baseline: bool) -> Self | ||
| impl ProcessBuffer for PreflateContainerProcessor { ... } | ||
|
|
||
| // Decompress a preflate container back to the original file | ||
| RecreateContainerProcessor::new(capacity: usize) -> Self | ||
| impl ProcessBuffer for RecreateContainerProcessor { ... } | ||
|
|
||
| // Core trait — both processors implement this | ||
| pub trait ProcessBuffer { | ||
| fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>; | ||
| fn stats(&self) -> PreflateStats { PreflateStats::default() } // default no-op; overridden by Compress | ||
| fn copy_to_end(&mut self, input: &mut impl BufRead, output: &mut impl Write) -> Result<()>; | ||
| fn copy_to_end_size(&mut self, input: &mut impl BufRead, output: &mut impl Write, chunk: usize) -> Result<()>; | ||
| } | ||
|
|
||
| // DLL helper: writes to a fixed output buffer, spills overflow into a VecDeque | ||
| fn process_limited_buffer( | ||
| process: &mut impl ProcessBuffer, | ||
| input: &[u8], | ||
| input_complete: bool, | ||
| output_buffer: &mut [u8], | ||
| output_extra: &mut VecDeque<u8>, | ||
| ) -> Result<(bool, usize)>; // (all_output_drained, bytes_written_to_output_buffer) | ||
| ``` | ||
|
|
||
| `PreflateContainerConfig` holds knobs: `min_chunk_size`, `max_chunk_size`, | ||
| `total_plain_text_limit`, `chunk_plain_text_limit`, `validate_compression`, `max_chain_length`. | ||
|
|
||
| ## Wire Format (v2 only) | ||
|
|
||
| ### Outer framing (always raw / uncompressed) | ||
|
|
||
| ``` | ||
| [0x02] ← COMPRESSED_WRAPPER_VERSION_2 (1 byte, raw) | ||
|
|
||
| Repeat for each block: | ||
| [type] ← block type byte (1 byte, raw) — see bit-field below | ||
| [varint(content_len)] ← byte count of what follows (1–5 bytes, raw) | ||
| [content_bytes × content_len] ← meaning depends on type (see below) | ||
| ``` | ||
|
|
||
| All framing bytes (`type`, `varint`) are written directly to the output stream — | ||
| they are **never** inside the Zstd encoder. | ||
|
|
||
| ### Block type byte bit-field | ||
|
|
||
| Each block type byte encodes two fields: | ||
|
|
||
| ``` | ||
| Bit 7-6 BLOCK_COMPRESSION_* 00 = none/raw 01 = Zstd 10-11 = reserved | ||
| Bit 5-0 BLOCK_TYPE_* block content kind (0–63) | ||
| ``` | ||
|
|
||
| Mask constants (defined in `container_processor.rs`): | ||
|
|
||
| | Constant | Value | Meaning | | ||
| |---|---|---| | ||
| | `BLOCK_COMPRESSION_MASK` | `0xC0` | extracts bits 7–6 | | ||
| | `BLOCK_TYPE_MASK` | `0x3F` | extracts bits 5–0 | | ||
| | `BLOCK_COMPRESSION_NONE` | `0x00` | content is raw (not Zstd) | | ||
| | `BLOCK_COMPRESSION_ZSTD` | `0x40` | content is a Zstd flush segment | | ||
|
|
||
| ### Block content kinds and combined wire values | ||
|
|
||
| | `BLOCK_TYPE_*` | Value | Combined wire byte | Description | | ||
| |---|---|---|---| | ||
| | `BLOCK_TYPE_LITERAL` | `0x00` | `0x40` | Raw input bytes with no detectable DEFLATE stream | | ||
| | `BLOCK_TYPE_DEFLATE` | `0x01` | `0x41` | A raw/zlib DEFLATE stream (start of a new stream) | | ||
| | `BLOCK_TYPE_PNG` | `0x02` | `0x42` | A PNG IDAT stream stored without WebP | | ||
| | `BLOCK_TYPE_DEFLATE_CONTINUE` | `0x03` | `0x43` | Continuation of a DEFLATE stream that spanned a chunk boundary | | ||
| | `BLOCK_TYPE_JPEG_LEPTON` | `0x04` | `0x04` | JPEG re-compressed with Lepton; bypasses Zstd entirely | | ||
| | `BLOCK_TYPE_WEBP` | `0x05` | `0x05` | PNG image stored as WebP lossless; bypasses Zstd entirely | | ||
|
|
||
| ### Zstd encoder/decoder lifecycle | ||
|
|
||
| - A **single persistent `zstd::stream::write::Encoder`** is created once and shared across | ||
| all Zstd-compressed blocks (compression bits `0x40`). | ||
| - After writing each block's inner payload into the encoder, `encoder.flush()` is called, | ||
| which emits a Zstd `ZSTD_e_flush` segment. Those bytes are what get stored as | ||
| `content_bytes` in the outer framing. | ||
| - Each flush segment is decodable in sequence: the decoder is a persistent | ||
| `zstd::stream::raw::Decoder` that maintains cross-block history, so compression | ||
| quality benefits from all previously seen blocks. | ||
| - The stream is terminated by EOF — there is no explicit end-of-stream block. | ||
|
|
||
| ### Inner payload layout (inside Zstd, after decompression) | ||
|
|
||
| **`BLOCK_TYPE_LITERAL` (wire `0x40`)** | ||
| ``` | ||
| varint(data_len) | ||
| data[data_len] ← verbatim bytes from the original input | ||
| ``` | ||
|
|
||
| **`BLOCK_TYPE_DEFLATE` (wire `0x41`) and `BLOCK_TYPE_DEFLATE_CONTINUE` (wire `0x43`)** | ||
| ``` | ||
| varint(corrections_len) | ||
| varint(plaintext_len) | ||
| corrections[corrections_len] ← CABAC-encoded differences from predicted tokens | ||
| plaintext[plaintext_len] ← uncompressed data | ||
| ``` | ||
| `BLOCK_TYPE_DEFLATE_CONTINUE` has the same layout; the decoder reuses the | ||
| `RecreateStreamProcessor` state from the preceding `BLOCK_TYPE_DEFLATE` block. | ||
|
|
||
| **`BLOCK_TYPE_PNG` (wire `0x42`) — non-WebP path** | ||
| ``` | ||
| varint(corrections_len) | ||
| varint(plaintext_len) | ||
| IdatContents metadata: | ||
| varint(chunk_size_1) … varint(chunk_size_N) varint(0) ← IDAT chunk size list (0-terminated) | ||
| zlib_header[2] | ||
| addler32[4] | ||
| 0xFF ← sentinel: no png_header present | ||
| corrections[corrections_len] | ||
| plaintext[plaintext_len] ← raw unfiltered pixel data | ||
| ``` | ||
|
|
||
| ### Raw block payload layout (outside Zstd) | ||
|
|
||
| **`BLOCK_TYPE_JPEG_LEPTON` (wire `0x04`)** | ||
| ``` | ||
| lepton_bytes[content_len] ← Lepton-compressed JPEG; decoded by lepton_jpeg::decode_lepton() | ||
| ``` | ||
|
|
||
| **`BLOCK_TYPE_WEBP` (wire `0x05`)** | ||
| ``` | ||
| varint(corrections_len) | ||
| varint(webp_data_len) | ||
| IdatContents metadata: | ||
| varint(chunk_size_1) … varint(chunk_size_N) varint(0) | ||
| zlib_header[2] | ||
| addler32[4] | ||
| color_type[1] ← PngColorType (RGB=2, RGBA=6) | ||
| varint(width) | ||
| varint(height) | ||
| filters[height] ← PNG row filter bytes (one per row) | ||
| corrections[corrections_len] | ||
| webp_data[webp_data_len] ← WebP lossless encoded pixel data | ||
| ``` | ||
| On decode, the WebP bytes are decompressed back to pixels, PNG filters are re-applied, | ||
| and the result is re-deflated using the corrections to recreate the original IDAT stream. | ||
|
|
||
| ## Idempotent Finalization (important bug history) | ||
|
|
||
| `process_buffer` may be called with `input_complete=true` multiple times (DLL pattern). | ||
| The finalization block must guard against double-finalization: | ||
|
|
||
| ```rust | ||
| if input_complete && !self.input_complete { // NOT just `if input_complete` | ||
| self.input_complete = true; | ||
| // ... encoder.take().unwrap() | ||
| } | ||
| ``` | ||
|
|
||
| ## Module Layout | ||
|
|
||
| ``` | ||
| src/ | ||
| lib.rs ← public types and re-exports | ||
| container_processor.rs ← PreflateContainerProcessor, RecreateContainerProcessor, | ||
| ProcessBuffer trait, MeasureWriteSink, | ||
| block-type constants, emit_compressed_block(), | ||
| write_chunk_block_v2(), write_varint(), read_varint() | ||
| scan_deflate.rs ← locates DEFLATE stream boundaries in raw bytes | ||
| identifies: raw deflate, zlib-wrapped, PNG IDAT, ZIP, JPEG | ||
| idat_parse.rs ← extracts / reassembles PNG IDAT chunks; parses IHDR | ||
| scoped_read.rs ← bounded reader adapter | ||
| utils.rs ← process_limited_buffer(), TakeReader, test helpers | ||
| ``` | ||
|
|
||
| ## Key Internal Types | ||
|
|
||
| | Type | Purpose | | ||
| |---|---| | ||
| | `MeasureWriteSink` | `pub(crate)` sink that counts bytes; used for baseline Zstd measurement | | ||
| | `PreflateStats` | pub struct: `deflate_compressed_size`, `zstd_compressed_size`, `uncompressed_size`, `overhead_bytes`, `hash_algorithm`, `zstd_baseline_size` | | ||
| | `TakeReader<T>` | `pub` BufRead wrapper that reads at most N bytes (used in utils.rs) | | ||
|
|
||
| ## Features | ||
|
|
||
| - `webp` (default-enabled) — allows PNG images to be stored as WebP instead of lossless PNG, | ||
| using the `webp` crate. | ||
|
|
||
| ## Dependencies of Note | ||
|
|
||
| - `lepton_jpeg` (0.5.1) — JPEG blocks are recompressed with Lepton, bypassing Zstd entirely. | ||
| - `zstd` (0.13) — single persistent encoder across all non-JPEG/WebP blocks. | ||
| - `preflate-rs` — core analysis/reconstruction (path dependency). | ||
| - `webp` (0.3, optional, default-enabled) — PNG images can be stored as WebP lossless. | ||
|
|
||
| ## Constraints | ||
|
|
||
| - `#![forbid(unsafe_code)]` enforced. | ||
| - `main.rs` exists but is a stub; this crate is a library. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"addler32" is a misspelling of "adler32" (the Adler-32 checksum algorithm used in zlib). This appears in two places in the wire format documentation.