v0.6.0 — turbo_kv_4b champion, beats production baseline
🏆 Highlights
After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.
| KV type | Bits/elem | Llama 3.2 3B PPL | Δ vs FP32 |
|---|---|---|---|
| FP32 baseline | 32 | 13.56 | — |
| `turbo_kv_4b` ⭐ | 4 | 14.28 | +5.3% |
| `uniform_4b` | 4 | 14.41 | +6.3% |
| `turbo_kv_3b` | 3 | 15.39 | +13.5% |
| llama.cpp q4_0 KV (rough) | 4 | ~14.99 | +10.6% |
The story
The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.
Full optimization history: bench/results/turboquant_reproduction.md
Other changes
- CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
- @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
- Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
- Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
- Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.
What's tracked for next release
See issue #15:
- Per-channel outlier handling (Google paper's 32-channel split)
- Paper-faithful Llama 3.1 8B + LongBench-E reproduction
- 5-bit codebook variant for ~5 bpc
Bug fixes
- `tq_qjl.c`: NaN guard requires `dim > 0`
- `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
- `tq_transformer.c`: NULL-check key/value cache calloc results
- `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)
Citations
If you use quant.cpp's KV compression in research, please cite: