Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,9 @@ target/

# MSVC Windows builds of rustc generate these, which store debugging information
*.pdb

# Local Claude Code settings (machine-specific)
.claude/

# Unreferenced / scratch files
.unref/
73 changes: 73 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
# CLAUDE.md

This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.

> **Important:** This file (and all sub-project `CLAUDE.md` files) are checked into the
> repository. Only include information that is valid for **any** developer or machine:
> project conventions, architecture, commands, constraints. **Do not** add machine-specific
> paths, personal tool preferences, local environment settings, or anything that would
> not apply to every contributor.

## Commands

```bash
# Build
cargo build --all
cargo build --release --all

# Test
cargo test --all
cargo test <test_name> # Run a single test by name
cargo test -- --nocapture # Show test output

# Lint and format
cargo fmt --check --all
cargo clippy
```

The CI runs on `windows-latest` and builds for multiple targets: `wasm32-wasip1`, `aarch64-unknown-linux-musl`, `x86_64-pc-windows-msvc`, `x86_64-unknown-linux-gnu`.

The release build uses Spectre mitigations (`/Qspectre /sdl`) and produces `preflate_rs_0_7.dll` and `preflate_util.exe`.

## Architecture

**preflate-rs** analyzes DEFLATE-compressed streams, extracts the uncompressed data plus a compact set of reconstruction parameters, and later recreates the exact original DEFLATE bitstream. This enables re-compression with modern algorithms (Zstd, Brotli) while preserving binary-exact round-trip fidelity. The key insight is detecting which compressor (zlib, libdeflate, zlib-ng, miniz, Windows zlib) produced a stream and storing only the differences from what that compressor would predict.

### Workspace layout

| Crate | Output | Role |
|---|---|---|
| `preflate/` | library | Core DEFLATE analysis and reconstruction |
| `container/` | library | Scans binary files (ZIP, PNG, JPEG) for DEFLATE streams |
| `util/` | `preflate_util.exe` | CLI for testing on files/directories |
| `dll/` | `preflate_rs_0_7.dll` | C FFI wrapper for .NET interop |
| `fuzz/` | fuzz harnesses | libfuzzer targets |
| `tests/` | integration tests | End-to-end round-trip tests using `samples/` |

### preflate crate (core)

The processing pipeline in `preflate/src/stream_processor.rs`:
1. **`deflate/`** — Reads a DEFLATE bitstream into tokens (literals and length/distance back-references) and writes tokens back to DEFLATE with custom Huffman trees.
2. **`estimator/`** — Estimates the compressor's parameters (`TokenPredictorParameters`): hash algorithm, `nice_length`, `max_chain`, window bits, add policy, matching type.
3. **`token_predictor.rs`** — Replays the compression using estimated parameters and hash chains to predict what tokens the original compressor would have produced.
4. **`tree_predictor.rs`** — Predicts Huffman tree structure.
5. **`statistical_codec.rs` / `cabac_codec.rs`** — Encodes the *differences* from prediction using CABAC (Context Adaptive Binary Arithmetic Coding, shared with Lepton JPEG).
6. **`stream_processor.rs`** — Public API: `PreflateStreamProcessor::decompress()` and `RecreateStreamProcessor::recreate()`.

Parameters are serialized via `bitcode`; corrections via CABAC. The format is chunked to bound memory use.

### container crate

- **`scan_deflate.rs`** — Scans raw bytes to locate DEFLATE stream boundaries, identifying stream type (raw deflate, zlib-wrapped, PNG IDAT, ZIP, JPEG, etc.).
- **`idat_parse.rs`** — Extracts and reassembles PNG IDAT chunks.
- **`container_processor.rs`** — Orchestrates scanning → preflate → Zstd (compress) and Zstd → recreate → reassembly (decompress). Zstd encode/decode is handled inline using a single persistent encoder.
- **`utils.rs`** — `process_limited_buffer()` and test helpers.
- **`scoped_read.rs`** — Bounded reader adapter.

The optional `webp` feature (enabled by default) allows PNG images to be stored as WebP instead of losslessly. PDF streams are not scanned (pdf_parse was removed).

### Code constraints

- **No unsafe code** — enforced via `#![forbid(unsafe_code)]` in each crate.
- Minimum Rust version: **1.85**, Edition **2024**.
- `.cargo/config.toml` sets Windows MSVC linker flags (`/DYNAMICBASE`, `/CETCOMPAT`, `/guard:cf`).
10 changes: 5 additions & 5 deletions Cargo.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

16 changes: 12 additions & 4 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
# root project only exists to refer to the packages
# and run the end-to-end tests in the tests directory
# and run the end-to-end tests in the tests directory

[package]
name = "preflate-rs-root"
version = "0.0.0"
edition = "2024"
rust-version = "1.85"
version.workspace = true
edition.workspace = true
rust-version.workspace = true

[profile.release]
debug = true
Expand All @@ -14,6 +14,14 @@ debug = true
members = ["preflate", "container", "dll", "util", "fuzz"]
resolver = "2"

[workspace.package]
version = "0.7.6"
edition = "2024"
authors = ["Kristof Roomp <kristofr@microsoft.com>"]
license = "Apache-2.0"
rust-version = "1.85"
repository = "https://github.com/microsoft/preflate-rs"

[dev-dependencies]
preflate-rs = { path = "preflate" }
preflate-container = { path = "container" }
Expand Down
203 changes: 203 additions & 0 deletions container/CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,203 @@
# container (preflate-container)

Scans binary files (ZIP, PNG, JPEG) for DEFLATE streams, orchestrates the
preflate + Zstd pipeline, and reassembles the output. Only format version 2 exists
(v1 was removed).

## Public API (`lib.rs`)

```rust
// Compress a file/buffer containing embedded DEFLATE streams
PreflateContainerProcessor::new(config: &PreflateContainerConfig, level: i32, test_baseline: bool) -> Self
impl ProcessBuffer for PreflateContainerProcessor { ... }

// Decompress a preflate container back to the original file
RecreateContainerProcessor::new(capacity: usize) -> Self
impl ProcessBuffer for RecreateContainerProcessor { ... }

// Core trait — both processors implement this
pub trait ProcessBuffer {
fn process_buffer(&mut self, input: &[u8], input_complete: bool, writer: &mut impl Write) -> Result<()>;
fn stats(&self) -> PreflateStats { PreflateStats::default() } // default no-op; overridden by Compress
fn copy_to_end(&mut self, input: &mut impl BufRead, output: &mut impl Write) -> Result<()>;
fn copy_to_end_size(&mut self, input: &mut impl BufRead, output: &mut impl Write, chunk: usize) -> Result<()>;
}

// DLL helper: writes to a fixed output buffer, spills overflow into a VecDeque
fn process_limited_buffer(
process: &mut impl ProcessBuffer,
input: &[u8],
input_complete: bool,
output_buffer: &mut [u8],
output_extra: &mut VecDeque<u8>,
) -> Result<(bool, usize)>; // (all_output_drained, bytes_written_to_output_buffer)
```

`PreflateContainerConfig` holds knobs: `min_chunk_size`, `max_chunk_size`,
`total_plain_text_limit`, `chunk_plain_text_limit`, `validate_compression`, `max_chain_length`.

## Wire Format (v2 only)

### Outer framing (always raw / uncompressed)

```
[0x02] ← COMPRESSED_WRAPPER_VERSION_2 (1 byte, raw)

Repeat for each block:
[type] ← block type byte (1 byte, raw) — see bit-field below
[varint(content_len)] ← byte count of what follows (1–5 bytes, raw)
[content_bytes × content_len] ← meaning depends on type (see below)
```

All framing bytes (`type`, `varint`) are written directly to the output stream —
they are **never** inside the Zstd encoder.

### Block type byte bit-field

Each block type byte encodes two fields:

```
Bit 7-6 BLOCK_COMPRESSION_* 00 = none/raw 01 = Zstd 10-11 = reserved
Bit 5-0 BLOCK_TYPE_* block content kind (0–63)
```

Mask constants (defined in `container_processor.rs`):

| Constant | Value | Meaning |
|---|---|---|
| `BLOCK_COMPRESSION_MASK` | `0xC0` | extracts bits 7–6 |
| `BLOCK_TYPE_MASK` | `0x3F` | extracts bits 5–0 |
| `BLOCK_COMPRESSION_NONE` | `0x00` | content is raw (not Zstd) |
| `BLOCK_COMPRESSION_ZSTD` | `0x40` | content is a Zstd flush segment |

### Block content kinds and combined wire values

| `BLOCK_TYPE_*` | Value | Combined wire byte | Description |
|---|---|---|---|
| `BLOCK_TYPE_LITERAL` | `0x00` | `0x40` | Raw input bytes with no detectable DEFLATE stream |
| `BLOCK_TYPE_DEFLATE` | `0x01` | `0x41` | A raw/zlib DEFLATE stream (start of a new stream) |
| `BLOCK_TYPE_PNG` | `0x02` | `0x42` | A PNG IDAT stream stored without WebP |
| `BLOCK_TYPE_DEFLATE_CONTINUE` | `0x03` | `0x43` | Continuation of a DEFLATE stream that spanned a chunk boundary |
| `BLOCK_TYPE_JPEG_LEPTON` | `0x04` | `0x04` | JPEG re-compressed with Lepton; bypasses Zstd entirely |
| `BLOCK_TYPE_WEBP` | `0x05` | `0x05` | PNG image stored as WebP lossless; bypasses Zstd entirely |

### Zstd encoder/decoder lifecycle

- A **single persistent `zstd::stream::write::Encoder`** is created once and shared across
all Zstd-compressed blocks (compression bits `0x40`).
- After writing each block's inner payload into the encoder, `encoder.flush()` is called,
which emits a Zstd `ZSTD_e_flush` segment. Those bytes are what get stored as
`content_bytes` in the outer framing.
- Each flush segment is decodable in sequence: the decoder is a persistent
`zstd::stream::raw::Decoder` that maintains cross-block history, so compression
quality benefits from all previously seen blocks.
- The stream is terminated by EOF — there is no explicit end-of-stream block.

### Inner payload layout (inside Zstd, after decompression)

**`BLOCK_TYPE_LITERAL` (wire `0x40`)**
```
varint(data_len)
data[data_len] ← verbatim bytes from the original input
```

**`BLOCK_TYPE_DEFLATE` (wire `0x41`) and `BLOCK_TYPE_DEFLATE_CONTINUE` (wire `0x43`)**
```
varint(corrections_len)
varint(plaintext_len)
corrections[corrections_len] ← CABAC-encoded differences from predicted tokens
plaintext[plaintext_len] ← uncompressed data
```
`BLOCK_TYPE_DEFLATE_CONTINUE` has the same layout; the decoder reuses the
`RecreateStreamProcessor` state from the preceding `BLOCK_TYPE_DEFLATE` block.

**`BLOCK_TYPE_PNG` (wire `0x42`) — non-WebP path**
```
varint(corrections_len)
varint(plaintext_len)
IdatContents metadata:
varint(chunk_size_1) … varint(chunk_size_N) varint(0) ← IDAT chunk size list (0-terminated)
zlib_header[2]
addler32[4]
0xFF ← sentinel: no png_header present
corrections[corrections_len]
plaintext[plaintext_len] ← raw unfiltered pixel data
```

### Raw block payload layout (outside Zstd)

**`BLOCK_TYPE_JPEG_LEPTON` (wire `0x04`)**
```
lepton_bytes[content_len] ← Lepton-compressed JPEG; decoded by lepton_jpeg::decode_lepton()
```

**`BLOCK_TYPE_WEBP` (wire `0x05`)**
```
varint(corrections_len)
varint(webp_data_len)
IdatContents metadata:
varint(chunk_size_1) … varint(chunk_size_N) varint(0)
zlib_header[2]
addler32[4]
Copy link

Copilot AI Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"addler32" is a misspelling of "adler32" (the Adler-32 checksum algorithm used in zlib). This appears in two places in the wire format documentation.

Suggested change
addler32[4]
adler32[4]

Copilot uses AI. Check for mistakes.
color_type[1] ← PngColorType (RGB=2, RGBA=6)
varint(width)
varint(height)
filters[height] ← PNG row filter bytes (one per row)
corrections[corrections_len]
webp_data[webp_data_len] ← WebP lossless encoded pixel data
```
On decode, the WebP bytes are decompressed back to pixels, PNG filters are re-applied,
and the result is re-deflated using the corrections to recreate the original IDAT stream.

## Idempotent Finalization (important bug history)

`process_buffer` may be called with `input_complete=true` multiple times (DLL pattern).
The finalization block must guard against double-finalization:

```rust
if input_complete && !self.input_complete { // NOT just `if input_complete`
self.input_complete = true;
// ... encoder.take().unwrap()
}
```

## Module Layout

```
src/
lib.rs ← public types and re-exports
container_processor.rs ← PreflateContainerProcessor, RecreateContainerProcessor,
ProcessBuffer trait, MeasureWriteSink,
block-type constants, emit_compressed_block(),
write_chunk_block_v2(), write_varint(), read_varint()
scan_deflate.rs ← locates DEFLATE stream boundaries in raw bytes
identifies: raw deflate, zlib-wrapped, PNG IDAT, ZIP, JPEG
idat_parse.rs ← extracts / reassembles PNG IDAT chunks; parses IHDR
scoped_read.rs ← bounded reader adapter
utils.rs ← process_limited_buffer(), TakeReader, test helpers
```

## Key Internal Types

| Type | Purpose |
|---|---|
| `MeasureWriteSink` | `pub(crate)` sink that counts bytes; used for baseline Zstd measurement |
| `PreflateStats` | pub struct: `deflate_compressed_size`, `zstd_compressed_size`, `uncompressed_size`, `overhead_bytes`, `hash_algorithm`, `zstd_baseline_size` |
| `TakeReader<T>` | `pub` BufRead wrapper that reads at most N bytes (used in utils.rs) |

## Features

- `webp` (default-enabled) — allows PNG images to be stored as WebP instead of lossless PNG,
using the `webp` crate.

## Dependencies of Note

- `lepton_jpeg` (0.5.1) — JPEG blocks are recompressed with Lepton, bypassing Zstd entirely.
- `zstd` (0.13) — single persistent encoder across all non-JPEG/WebP blocks.
- `preflate-rs` — core analysis/reconstruction (path dependency).
- `webp` (0.3, optional, default-enabled) — PNG images can be stored as WebP lossless.

## Constraints

- `#![forbid(unsafe_code)]` enforced.
- `main.rs` exists but is a stub; this crate is a library.
14 changes: 7 additions & 7 deletions container/Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
[package]
name = "preflate-container"
version = "0.7.5"
edition = "2024"
authors = ["Kristof Roomp <kristofr@microsoft.com>"]
license = "Apache-2.0"
rust-version = "1.85"
version.workspace = true
edition.workspace = true
authors.workspace = true
license.workspace = true
rust-version.workspace = true
repository.workspace = true
description = """
Scans binary files for zStd streams and uses Preflate-rs to decompress the stream and repack with
zStd compression. For PNG files, we use WEBP compression for RGB and RGBA to get better results.
zStd compression. For PNG files, we use WEBP compression for RGB and RGBA to get better results.
"""
repository = "https://github.com/microsoft/preflate-rs"
categories = ["compression"]
keywords = ["gzip", "deflate", "zlib", "zip"]

Expand Down
Loading