Release v0.6.0 — turbo_kv_4b champion, beats production baseline · quantumaikr/quant.cpp

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type	Bits/elem	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	32	13.56	—
`turbo_kv_4b` ⭐	4	14.28	+5.3%
`uniform_4b`	4	14.41	+6.3%
`turbo_kv_3b`	3	15.39	+13.5%
llama.cpp q4_0 KV (rough)	4	~14.99	+10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
@quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

Per-channel outlier handling (Google paper's 32-channel split)
Paper-faithful Llama 3.1 8B + LongBench-E reproduction
5-bit codebook variant for ~5 bpc

Bug fixes

`tq_qjl.c`: NaN guard requires `dim > 0`
`tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
`tq_transformer.c`: NULL-check key/value cache calloc results
`tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.0 — turbo_kv_4b champion, beats production baseline

Choose a tag to compare

Sorry, something went wrong.