quant.cpp Roadmap

Vision

quant.cpp is the single-header C reference implementation of TurboQuant and related KV cache quantization research.

Not competing with Google. Not competing with llama.cpp. Filling the gap nobody else fills: TurboQuant-class compression anywhere a C compiler runs.

See docs/positioning.md for the full strategy.

Positioning

Data-center TurboQuant?       → Google reference (arxiv:2504.19874)
Workstation speed?            → llama.cpp
Batch serving?                → vLLM
TurboQuant on iPhone?         → quant.cpp
TurboQuant in a browser?      → quant.cpp
TurboQuant in a game engine?  → quant.cpp
TurboQuant on a microcontroller? → quant.cpp

Direction 1: Embedding Engine ("LLM의 SQLite")

The world's simplest way to add LLM to a C/C++ project.

Done

quant.h single header (15K LOC, 628KB)
6-function API (load, new, generate, ask, free_ctx, free_model)
WASM build (192KB binary)
MSVC/MinGW Windows support
Zero external dependencies
API documentation (docs/api.md)
quant.h sync with latest source
Embedding examples (minimal, chat, KV compare)

Planned

Direction 2: KV Compression Research Platform

A C reference engine for KV cache quantization research.

Production-ready

Building blocks

Random Hadamard Transform (tq_rht.c)
Lloyd-Max-Gaussian codebook quantizer (tq_codebook.c, 1–4 bit)
1-bit QJL sign hash (tq_qjl.c) — research, contributes ~0 to scores in our regime
PolarQuant (polar coordinate) compression (tq_polar.c)

TurboQuant paper reproduction (issue #14, partially resolved)

Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
5-bit codebook variant for ~5 bpc quality budget (commit 87e14cb)
Regression tests pinning quality (commit 475872c)
Per-channel outlier handling (turbo_kv_4bo/3bo, commits 4576910 + 5b5e4b7) — model-dependent, ships as research types; 5b remains the simpler quality champion
Paper-faithful Llama 3.1 8B + LongBench-E reproduction — issue #15

Planned (after Direction 2 reproduction)

"Add Your Own Type" tutorial polish (docs/custom-quantization.md)
Arxiv tech report
llama.cpp KV type PR (ggml type registration) — only after paper reproduction works
vLLM KV compression plugin
Benchmarking suite (PPL across models × KV types)

Non-Goals

❌ GPU speed competition with llama.cpp (requires tensor graph IR)
❌ Batch serving (vLLM's domain)
❌ Training support
❌ 100+ model coverage

Architecture Principles

One file forward pass: tq_transformer.c contains the entire inference loop
Plugin quantization: Add types via tq_traits.c registration
Zero dependencies: libc + pthreads only (+ Metal on macOS)
CPU-first: NEON/AVX2 optimized, GPU as optional accelerator
Embeddable: quant.h works anywhere a C compiler does

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quant.cpp Roadmap

Vision

Positioning

Direction 1: Embedding Engine ("LLM의 SQLite")

Done

Planned

Direction 2: KV Compression Research Platform

Production-ready

Building blocks

TurboQuant paper reproduction (issue #14, partially resolved)

Planned (after Direction 2 reproduction)

Non-Goals

Architecture Principles

FilesExpand file tree

ROADMAP.md

Latest commit

History

ROADMAP.md

File metadata and controls

quant.cpp Roadmap

Vision

Positioning

Direction 1: Embedding Engine ("LLM의 SQLite")

Done

Planned

Direction 2: KV Compression Research Platform

Production-ready

Building blocks

TurboQuant paper reproduction (issue #14, partially resolved)

Planned (after Direction 2 reproduction)

Non-Goals

Architecture Principles