v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%
⚠️ Correction
The original release notes claimed 'turbo_kv beats fp32 KV speed'. That was wrong — an artifact of the fp32 attention path being unoptimized scalar while the quant path had NEON. After adding NEON to fp32 (commit `4490c83`):
| Type | Bytes/block | Compression | tok/s | vs FP32 | PPL Δ |
|---|---|---|---|---|---|
| FP32 KV (NEON) | — | 1× | 14.83 | baseline | — |
| `turbo_kv_4b` ⭐ | 72 | 7.1× | 13.67 | −7.8% | +5.7% |
| `turbo_kv_5b` 🏆 | 88 | 5.8× | 13.13 | −11.5% | +0.7% |
| `turbo_kv_3b` | 56 | 9.1× | 13.4 | −9.6% | +13.3% |
The honest story: 9 rounds of Karpathy iteration closed the quant-KV speed gap from −45% to −8%, while the types compress 5.8–9.1×. We do not (yet) beat fp32 raw speed.
What actually changed in this release
Round 5 (the real bottleneck)
`tq_transformer.c`'s `use_quant_kv` path was calling `traits->dequantize` once per cached key per token, which internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating the total cost at long context. Round 5 changes the inner loop to use the type's optimized `traits->attention` kernel, which:
- Pre-rotates the query ONCE per layer
- Does fused dequant + dot product per block in rotated space
- Skips per-position inverse RHT entirely
This took the quant path from 6.9 → 13.7 tok/s on Llama 3.2 3B (a real ~2× speedup). The fast path bypasses the slow path for the common case (no QK-norm-on-stored-keys, no high-res window, no sliding-window attention).
Round 6
Hoist LUT in `turbo_kv_4bo` and `turbo_kv_3bo` dequant functions to match the optimization patterns in 3b/4b/5b.
Karpathy round-by-round
| Round | What changed | turbo_kv_4b tok/s |
|---|---|---|
| 0 | Baseline (per-position dequant + inline dot) | 6.9 |
| 1 | Single-pass dequant with hoisted LUT | 7.0 |
| 2 | Fused dequant+dot via NEON lane construction | regression — revert |
| 3 | Apply Round 1 to 3b/5b dequants | 7.0 |
| 4 | Pure scalar fused with 4 accumulators | 7.0 |
| 5 | transformer uses traits->attention (no per-pos RHT inverse) | 13.5 ✅ |
| 6 | Hoist LUT in 4bo/3bo dequants | 13.7 |
| 7 | NEON-optimize fp32 path (validation finding) | fp32: 12.6 → 14.8 |
Lessons
The validation step (running the same comparison after fixing the unfair baseline) flipped the headline. This is exactly what validation is for. We caught the wrong claim before it shipped to users.
The Karpathy loop's measure → modify → measure → revert if worse discipline kept us honest at each step. Round 2 looked clever but regressed. Round 5 was an unglamorous transformer-level structural change that nothing in the local optimizations could find — but the systematic measurement revealed it.
What you should use
- `./build/quant model.gguf` — defaults to `turbo_kv_4b` (best size × quality, 92% of fp32 speed)
- `-k turbo_kv_5b` — when you need near-lossless quality
- `-k turbo_kv_3b` — for maximum compression (9.1×) at ~13% PPL cost
- `-k fp32` — when you have memory to spare and want max speed
Tests
35/35 tests pass. Regression tests pin cosine ≥ 0.99 (4b/5b) and 5b ≥ 4b accuracy invariant.