Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 70 additions & 0 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
name: Publish Crate

permissions:
contents: read

on:
push:
tags:
- "v*.*.*" # Triggers only for version tag pushes

jobs:
publish:
runs-on: ubuntu-latest

steps:
- name: Checkout code with full history
uses: actions/checkout@v3
with:
fetch-depth: 0 # Needed to compare commits and access tag history

- name: Ensure tag is at tip of main
id: verify_tag_commit
run: |
echo "Verifying tag points to main branch tip..."
git fetch origin main

TAG_COMMIT=$(git rev-parse ${{ github.ref }})
MAIN_COMMIT=$(git rev-parse origin/main)

echo "Tag commit: $TAG_COMMIT"
echo "Main commit: $MAIN_COMMIT"

if [ "$TAG_COMMIT" != "$MAIN_COMMIT" ]; then
echo "Tag is not at tip of main. Aborting."
exit 1
fi
echo "Tag is at tip of main."

- name: Extract tag version
id: tag_version
run: |
echo "TAG_VERSION=${GITHUB_REF#refs/tags/v}" >> "$GITHUB_OUTPUT"

- name: Read version from Cargo.toml
id: cargo_version
run: |
CARGO_VERSION=$(grep '^version\s*=' Cargo.toml | head -1 | sed -E 's/version\s*=\s*"([^"]+)"/\1/')
echo "CARGO_VERSION=$CARGO_VERSION" >> "$GITHUB_OUTPUT"

- name: Check tag version matches Cargo.toml
run: |
echo "Comparing tag and Cargo.toml versions..."
echo "Tag: ${{ steps.tag_version.outputs.TAG_VERSION }}"
echo "Cargo.toml: ${{ steps.cargo_version.outputs.CARGO_VERSION }}"

if [ "${{ steps.tag_version.outputs.TAG_VERSION }}" != "${{ steps.cargo_version.outputs.CARGO_VERSION }}" ]; then
echo "Version mismatch: tag does not match Cargo.toml"
exit 1
fi
echo "Tag version matches Cargo.toml."

- name: Set up Rust
uses: dtolnay/rust-toolchain@stable
with:
toolchain: stable

- name: Publish preflate-rs to crates.io
env:
CARGO_REGISTRY_TOKEN: ${{ secrets.CRATE_PUBLISH }}
run: cargo publish --verbose --package preflate-rs
2 changes: 1 addition & 1 deletion CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,5 +69,5 @@ The optional `webp` feature (enabled by default) allows PNG images to be stored
### Code constraints

- **No unsafe code** — enforced via `#![forbid(unsafe_code)]` in each crate.
- Minimum Rust version: **1.85**, Edition **2024**.
- Minimum Rust version: **1.89**, Edition **2024**.
- `.cargo/config.toml` sets Windows MSVC linker flags (`/DYNAMICBASE`, `/CETCOMPAT`, `/guard:cf`).
2 changes: 1 addition & 1 deletion Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ version = "0.7.6"
edition = "2024"
authors = ["Kristof Roomp <kristofr@microsoft.com>"]
license = "Apache-2.0"
rust-version = "1.85"
rust-version = "1.89"
repository = "https://github.com/microsoft/preflate-rs"

[dev-dependencies]
Expand Down
159 changes: 103 additions & 56 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,93 +1,140 @@
# preflate-rs
Preflate-rs is a library initally based on a port of the C++ [preflate library](https://github.com/deus-libri/preflate/) with the purpose of splitting deflate streams into uncompressed data and reconstruction information, or reconstruct the original deflate stream from those two.

Other similar libraries include precomp, reflate, grittibanzli, although this libary is probably the most feature rich and supports a lower overhead from more libraries.
**preflate-rs** is a Rust library for lossless re-compression of DEFLATE-compressed data. It analyzes an existing DEFLATE bitstream, extracts the uncompressed plaintext along with a compact set of reconstruction parameters, and later recreates the **bit-exact** original DEFLATE stream from those two pieces. This makes it possible to re-compress the plaintext with a more modern algorithm (Zstd, Brotli, LZMA) while preserving perfect binary round-trip fidelity.

IMPORTANT: This library is still in initial development, so there are probably breaking changes being done fairly frequently.
The library is used in production cloud storage systems where content must be stored with bit-exact fidelity while still benefiting from better compression ratios.

The resulting uncompressed content can then be recompressed by a more modern compression technique such as Zstd, lzma, etc. This library is designed to be used as part of a cloud
storage system that requires exact binary storage of content, so the libary needs to make
sure that the DEFLATE content is recreated exactly as it was written. This is not trivial, since
DEFLATE has a large degree of freedom in choosing both how the distance/length pairs are chose
and how the Huffman trees are created.
[![unsafe forbidden](https://img.shields.io/badge/unsafe-forbidden-success.svg)](https://github.com/rust-secure-code/safety-dance/)

The library tries to detect the following compressors to try to do a reasonable job:
- [Zlib](https://github.com/madler/zlib): Zlib is more or less perfectly compressed.
- [MiniZ](https://github.com/richgel999/miniz): The fastest mode uses a different hash function.
- [Libdeflate](https://github.com/ebiggers/libdeflate): This library uses 4 byte hash-tables, which we try to detect.
- [Libzng](https://github.com/zlib-ng/zlib-ng): Works well except level 9
- Windows zlib implementation (used by the built-in PNG codec and shell ZIP compression)
---

The general approach is as follows:
1. Decompress stream into plaintext and a list of blocks containing tokens that are either literals (bytes) or distance, length pairs.
2. Estimate the dictionary update strategy by looking at which strings are referenced by the compressed data. For example, zlib will only add the beginning of each compressed token for low compression levels.
3. Estimate the maximum number times we execute the loop to look for matches (also called chains, as in walking the chain of the hash table). We also test with different hash functions to figure out which hash funciton was likely used. Given the chain length, we estimate the other parameters that were likely used.
4. Rerun compression using the zlib algorithm using the parameters gathered above. A difference encoder is used to record each instance where the token predicted by our implementation of DEFLATE differs from what we found in the file.
## Why preflate-rs?

The following differences are corrected:
- Type of block (uncompressed, static huffman, dynamic huffman)
- Number of tokens in block (normally 16385)
- Dynamic huffman encoding (estimated using the zlib algorithm, but there are multiple ways to construct more or less optimal length limited Huffman codes)
- Literal vs (distance, length) pair (corrected by a single bit)
- Length or distance is incorrect (corrected by encoding the number of hops backwards until the correct one)
DEFLATE streams are not uniquely determined by their plaintext. The same input can compress to many different valid bitstreams depending on the compressor, its version, and the parameters used. Simply decompressing and recompressing will produce a *different* bitstream — which is a problem for systems that need to verify or reproduce file hashes exactly.

Note that the data formats of the recompression information are different and incompatible to the original preflate implementation, as this library uses a different arithmetic encoder (shared from the Lepton JPEG compression library).
preflate-rs solves this by treating the original DEFLATE stream as the ground truth and recording only the *differences* from what a reference model would predict. Since well-tuned compressors are highly predictable, these corrections are very small — typically well under 1% of the uncompressed data size.

[![unsafe forbidden](https://img.shields.io/badge/unsafe-forbidden-success.svg)](https://github.com/rust-secure-code/safety-dance/)
---

## How It Works

### Analysis (compress direction)

1. **Parse** — The DEFLATE bitstream is decoded into a sequence of tokens: literals (raw bytes) and length/distance back-references.
2. **Estimate** — The token sequence is analyzed to fingerprint the original compressor: hash algorithm, chain depth, nice-length cutoff, window size, and block-splitting strategy.
3. **Predict** — Compression is re-run using the estimated parameters. For each token, the model predicts what the original compressor would have chosen.
4. **Encode differences** — Wherever the prediction differs from the actual token, a correction is recorded using CABAC (Context Adaptive Binary Arithmetic Coding, the same codec used in Lepton JPEG compression).

The result is the uncompressed plaintext plus a small corrections blob. Both can be stored or re-compressed with any modern algorithm.

### Reconstruction (decompress direction)

### Overhead
The plaintext and corrections are fed back into the predictor, which replays the original compression decisions step by step to recreate the exact original DEFLATE bitstream.

In order to faithfully recreate the exact deflate stream, the library stores
a stream of corrections to its predictive model. Depending on how good the
predictive model is, the corrections can take up more or less space. If you
want to improve the library, it's probably worth targetting the lower compression
levels that currently have significant overhead.
---

The amount of overhead vs uncompressed data is approximately the following,
depending on the compression level. If you want to benefit from using this
library, whatever better compression algorithm you use needs to be at least
that much better to make it worthwhile to recompress.
## Supported Compressors

| Library | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|--------|
The library detects and models the following DEFLATE implementations:

| Compressor | Notes |
|---|---|
| [zlib](https://github.com/madler/zlib) | All levels; near-zero overhead |
| [zlib-ng](https://github.com/zlib-ng/zlib-ng) | All levels except level 9 |
| [libdeflate](https://github.com/ebiggers/libdeflate) | 4-byte hash table variant detected |
| [miniz / miniz_oxide](https://github.com/richgel999/miniz) | Fastest mode uses distinct hash function |
| Windows zlib | Built-in PNG codec and shell ZIP compression |

Unrecognized compressors still round-trip correctly — the corrections overhead is simply higher.

---

## Reconstruction Overhead

The table below shows overhead (corrections size as a percentage of uncompressed data) for each supported compressor at each compression level. To benefit from re-compression, your target algorithm needs to beat the original by at least this margin.

| Compressor | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 |
|--------------------|--------|--------|--------|--------|--------|--------|--------|--------|--------|-------|
| **zlib** | 0.01% | 0.01% | 0.01% | 0.01% | 0.01% | 0.01% | 0.01% | 0.08% | 0.03% | 0.01% |
| **libngz** | 0.01% | 0.01% | 0.01% | 0.01% | 0.97% | 1.07% | 0.90% | 0.01% | 0.01% | NoCompressionCandidates |
| **zlib-ng** | 0.01% | 0.01% | 0.01% | 0.01% | 0.97% | 1.07% | 0.90% | 0.01% | 0.01% | N/A |
| **libdeflate** | 0.01% | 0.25% | 1.04% | 0.91% | 1.51% | 1.04% | 0.96% | 0.87% | 1.04% | 1.03% |
| **miniz_oxide** | 0.01% | 0.06% | 2.70% | 1.78% | 0.53% | 0.30% | 0.09% | 0.06% | 0.08% | 0.07% |

## How to Use This Library
---

## Workspace Layout

#### Building From Source
| Crate | Output | Description |
|---|---|---|
| [`preflate/`](preflate/) | library | Core DEFLATE analysis and reconstruction engine |
| [`container/`](container/) | library | Scans binary files (ZIP, PNG, JPEG) for DEFLATE streams and orchestrates the Zstd pipeline |
| [`util/`](util/) | `preflate_util.exe` | CLI tool for testing and benchmarking |
| [`dll/`](dll/) | `preflate_rs_0_7.dll` | C FFI wrapper for .NET interop |
| [`fuzz/`](fuzz/) | fuzz harnesses | libfuzzer targets for the core and container APIs |

- [Rust 1.70 or Above](https://www.rust-lang.org/tools/install)
---

```Shell
## Getting Started

### Requirements

- [Rust 1.89 or above](https://www.rust-lang.org/tools/install)

### Build from Source

```shell
git clone https://github.com/microsoft/preflate-rs
cd preflate-rs
cargo build
cargo test
cargo build --release
cargo build --all
cargo test --all
cargo build --release --all
```

#### Running
### Using the CLI

The `preflate_util` binary lets you test the library against any file or directory of files:

```shell
preflate_util [OPTIONS] <PATH>

Options:
--max-chain <N> Hash chain depth limit (default: 4096)
-c, --level <N> Zstd compression level 0–14 (default: 9)
--loglevel <L> Log verbosity (default: Error)
--verify <bool> Round-trip verify after compression (default: true)
--baseline <bool> Also measure raw Zstd-only size for comparison (default: false)
```

### Library Usage

For direct use of the core DEFLATE analysis API, see the [`preflate` crate](preflate/). For processing full binary files containing embedded DEFLATE streams (ZIP, PNG, JPEG), see the [`container` crate](container/).

---

## Design Notes

- **No unsafe code** — `#![forbid(unsafe_code)]` is enforced in every crate.
- **Chunked processing** — memory use is bounded regardless of input size.
- **Format versioning** — the DLL name encodes the format version (`preflate_rs_0_7.dll`) so old decoders can coexist with new ones during upgrades.
- **CABAC coding** — the corrections codec is shared with the [Lepton](https://github.com/microsoft/lepton_jpeg_rust) JPEG re-compression library.
- Parameters are serialized via [`bitcode`](https://crates.io/crates/bitcode); corrections via CABAC.

There is an `preflate_util.exe` wrapper that is built as part of the project that can be used to
test out the library against Deflate compressed content.
---

## Contributing

There are many ways in which you can participate in this project, for example:
* [Submit bugs and feature requests](https://github.com/microsoft/preflate-rs/issues)
* [Review or submit pull requests](https://github.com/microsoft/preflate-rs/pulls)
* The library uses only **stable Rust features**.

* [Submit bugs and feature requests](https://github.com/microsoft/preflate-rs/issues), and help us verify as they are checked in
* Review [source code changes](https://github.com/microsoft/preflate-rs/pulls) or submit your own features as pull requests.
* The library uses only **stable features**.
---

## Code of Conduct

This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). See the [FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions.

## License

Copyright (c) Microsoft Corporation. All rights reserved.

Licensed under the Apache 2.0 license.
Licensed under the [Apache 2.0](LICENSE) license.
11 changes: 8 additions & 3 deletions azure-pipelines.yml
Original file line number Diff line number Diff line change
@@ -1,5 +1,10 @@
trigger:
- main
branches:
include:
- main
tags:
include:
- v*.*.*

resources:
repositories:
Expand Down Expand Up @@ -42,7 +47,7 @@ extends:
workflow: Rust
rust:
rustToolchain:
version: ms-prod-1.88
version: ms-prod-1.90
toolchainFeed: $(toolchainFeed)
cratesIoFeed: $(cratesIoFeed)
target: x86_64-pc-windows-msvc
Expand Down Expand Up @@ -147,7 +152,7 @@ extends:

- task: 1ES.PublishNuGet@1
displayName: 'NuGet push'
condition: and(succeeded(), in(variables['Build.Reason'], 'Manual'), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
condition: and(succeeded(), startsWith(variables['Build.SourceBranch'], 'refs/tags/v'))
inputs:
packageParentPath: '$(Pipeline.Workspace)'
packagesToPush: '$(Build.ArtifactStagingDirectory)\*.nupkg'
Expand Down
Loading
Loading