Skip to content

v0.6.3 — Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 00:47
· 50 commits to main since this release

⚠️ Correction

The original release notes claimed 'turbo_kv beats fp32 KV speed'. That was wrong — an artifact of the fp32 attention path being unoptimized scalar while the quant path had NEON. After adding NEON to fp32 (commit `4490c83`):

Type Bytes/block Compression tok/s vs FP32 PPL Δ
FP32 KV (NEON) 14.83 baseline
`turbo_kv_4b` 72 7.1× 13.67 −7.8% +5.7%
`turbo_kv_5b` 🏆 88 5.8× 13.13 −11.5% +0.7%
`turbo_kv_3b` 56 9.1× 13.4 −9.6% +13.3%

The honest story: 9 rounds of Karpathy iteration closed the quant-KV speed gap from −45% to −8%, while the types compress 5.8–9.1×. We do not (yet) beat fp32 raw speed.

What actually changed in this release

Round 5 (the real bottleneck)

`tq_transformer.c`'s `use_quant_kv` path was calling `traits->dequantize` once per cached key per token, which internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating the total cost at long context. Round 5 changes the inner loop to use the type's optimized `traits->attention` kernel, which:

  1. Pre-rotates the query ONCE per layer
  2. Does fused dequant + dot product per block in rotated space
  3. Skips per-position inverse RHT entirely

This took the quant path from 6.9 → 13.7 tok/s on Llama 3.2 3B (a real ~2× speedup). The fast path bypasses the slow path for the common case (no QK-norm-on-stored-keys, no high-res window, no sliding-window attention).

Round 6

Hoist LUT in `turbo_kv_4bo` and `turbo_kv_3bo` dequant functions to match the optimization patterns in 3b/4b/5b.

Karpathy round-by-round

Round What changed turbo_kv_4b tok/s
0 Baseline (per-position dequant + inline dot) 6.9
1 Single-pass dequant with hoisted LUT 7.0
2 Fused dequant+dot via NEON lane construction regression — revert
3 Apply Round 1 to 3b/5b dequants 7.0
4 Pure scalar fused with 4 accumulators 7.0
5 transformer uses traits->attention (no per-pos RHT inverse) 13.5
6 Hoist LUT in 4bo/3bo dequants 13.7
7 NEON-optimize fp32 path (validation finding) fp32: 12.6 → 14.8

Lessons

The validation step (running the same comparison after fixing the unfair baseline) flipped the headline. This is exactly what validation is for. We caught the wrong claim before it shipped to users.

The Karpathy loop's measure → modify → measure → revert if worse discipline kept us honest at each step. Round 2 looked clever but regressed. Round 5 was an unglamorous transformer-level structural change that nothing in the local optimizations could find — but the systematic measurement revealed it.

What you should use

  • `./build/quant model.gguf` — defaults to `turbo_kv_4b` (best size × quality, 92% of fp32 speed)
  • `-k turbo_kv_5b` — when you need near-lossless quality
  • `-k turbo_kv_3b` — for maximum compression (9.1×) at ~13% PPL cost
  • `-k fp32` — when you have memory to spare and want max speed

Tests

35/35 tests pass. Regression tests pin cosine ≥ 0.99 (4b/5b) and 5b ≥ 4b accuracy invariant.