Skip to content

v0.6.0 — turbo_kv_4b champion, beats production baseline

Choose a tag to compare

@unamedkr unamedkr released this 07 Apr 21:53
· 59 commits to main since this release

🏆 Highlights

After 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026), turbo_kv_4b is now the best 4-bit KV quantization in the project — beating both our previous production baseline (`uniform_4b`) and llama.cpp's `q4_0` KV at the same bit budget.

KV type Bits/elem Llama 3.2 3B PPL Δ vs FP32
FP32 baseline 32 13.56
`turbo_kv_4b` 4 14.28 +5.3%
`uniform_4b` 4 14.41 +6.3%
`turbo_kv_3b` 3 15.39 +13.5%
llama.cpp q4_0 KV (rough) 4 ~14.99 +10.6%

The story

The literal paper port (RHT → Lloyd-Max codebook → 1-bit QJL residual + ‖r‖₂) gave PPL 16.03 — worse than the simpler `uniform_4b` (14.41). A Karpathy-loop ablation found the QJL stage contributed byte-identical zero to attention scores. We dropped it and reinvested the freed 16 bytes per block in a 2× larger codebook (3-bit → 4-bit / 8 → 16 levels). Same total block size, finer reconstruction, structurally simpler.

Full optimization history: bench/results/turboquant_reproduction.md

Other changes

  • CLI default switched — `quant model.gguf` now uses `turbo_kv_4b` automatically
  • @quantcpp/wasm npm package — `npm install @quantcpp/wasm` to drop a 192KB GGUF inference engine into any web project
  • Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC `_builtin*` shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
  • Honest TurboQuant story — public reproduction report with full ablation history. No overstated claims.
  • Public PR triage — PR #12 (5 critical bug fixes) cherry-picked; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

What's tracked for next release

See issue #15:

  • Per-channel outlier handling (Google paper's 32-channel split)
  • Paper-faithful Llama 3.1 8B + LongBench-E reproduction
  • 5-bit codebook variant for ~5 bpc

Bug fixes

  • `tq_qjl.c`: NaN guard requires `dim > 0`
  • `tq_uniform.c`: heap-allocate Q8 query buffer (was 512B stack)
  • `tq_transformer.c`: NULL-check key/value cache calloc results
  • `tq_ops.c`: Windows pthread_cond_wait must use SRW variant (CS variant on SRWLOCK = deadlock in test_ops thread pool)

Citations

If you use quant.cpp's KV compression in research, please cite: