quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.
Not competing with Google. Not competing with llama.cpp. Filling the gap nobody else fills: TurboQuant-class compression anywhere a C compiler runs.
See docs/positioning.md for the full strategy.
Data-center TurboQuant? → Google reference (arxiv:2504.19874)
Workstation speed? → llama.cpp
Batch serving? → vLLM
TurboQuant on iPhone? → quant.cpp
TurboQuant in a browser? → quant.cpp
TurboQuant in a game engine? → quant.cpp
TurboQuant on a microcontroller? → quant.cpp
The world's simplest way to add LLM to a C/C++ project.
- quant.h single header (15K LOC, 628KB)
- 6-function API (load, new, generate, ask, free_ctx, free_model)
- WASM build (192KB binary)
- MSVC/MinGW Windows support
- Zero external dependencies
- API documentation (docs/api.md)
- quant.h sync with latest source
- Embedding examples (minimal, chat, KV compare)
- pip install quantcpp (Python bindings)
- iOS SDK + demo app
- Android NDK build guide
- Unity C# plugin
- Unreal C++ integration
- npm package (WASM)
- GitHub Pages live demo with pre-loaded model
A C reference engine for KV cache quantization research.
-
turbo_kv_5b🏆 — RHT + 5-bit (32-level) Lloyd-Max codebook, near-lossless (Llama 3.2 3B PPL 13.60, +0.34% vs FP32). Quality-maximizing option. -
turbo_kv_4b⭐ default — RHT + 4-bit Lloyd-Max codebook, beatsuniform_4band llama.cppq4_0KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32) -
turbo_kv_3b— RHT + 3-bit Lloyd-Max codebook (PPL 15.39, +13.5%) -
uniform_4bKV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B) -
uniform_4b+ Q4 V combo (6.9x KV memory reduction) - Delta compression (P-frame encoding)
- QK-norm aware compression (Gemma 4 / hybrid attention models)
- Plugin architecture (3 functions to add new type)
- Regression tests pinning
turbo_kv_4b/5bquality - 35 unit tests across macOS / Linux / Windows
- Random Hadamard Transform (
tq_rht.c) - Lloyd-Max-Gaussian codebook quantizer (
tq_codebook.c, 1–4 bit) - 1-bit QJL sign hash (
tq_qjl.c) — research, contributes ~0 to scores in our regime - PolarQuant (polar coordinate) compression (
tq_polar.c)
- Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
- Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
- 5-bit codebook variant for ~5 bpc quality budget (commit 87e14cb)
- Regression tests pinning quality (commit 475872c)
- Per-channel outlier handling (turbo_kv_4bo/3bo, commits 4576910 + 5b5e4b7) — model-dependent, ships as research types; 5b remains the simpler quality champion
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction — issue #15
- "Add Your Own Type" tutorial polish (docs/custom-quantization.md)
- Arxiv tech report
- llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
- vLLM KV compression plugin
- Benchmarking suite (PPL across models × KV types)
- ❌ GPU speed competition with llama.cpp (requires tensor graph IR)
- ❌ Batch serving (vLLM's domain)
- ❌ Training support
- ❌ 100+ model coverage
- One file forward pass: tq_transformer.c contains the entire inference loop
- Plugin quantization: Add types via tq_traits.c registration
- Zero dependencies: libc + pthreads only (+ Metal on macOS)
- CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
- Embeddable: quant.h works anywhere a C compiler does