|
1 | 1 | # Changelog |
2 | 2 |
|
| 3 | +## [0.6.3] — 2026-04-08 |
| 4 | + |
| 5 | +### 🏆 turbo_kv now BEATS fp32 KV speed at 7× compression |
| 6 | + |
| 7 | +After 6 rounds of Karpathy iteration on the attention path, all three |
| 8 | +production turbo_kv types are now **both more compressed AND faster** |
| 9 | +than uncompressed FP32 KV on Llama 3.2 3B PPL eval (1040 tokens, 28 |
| 10 | +layers, attention-heavy): |
| 11 | + |
| 12 | +| Type | Bytes/block | tok/s | vs FP32 | PPL | Δ vs FP32 | |
| 13 | +|---|---:|---:|---:|---:|---:| |
| 14 | +| FP32 KV | — | 12.6 | baseline | 13.56 | — | |
| 15 | +| **`turbo_kv_4b`** ⭐ | 72 | **13.9** | **+10% ⬆** | 14.33 | +5.7% | |
| 16 | +| **`turbo_kv_3b`** | 56 | **13.4** | **+6% ⬆** | 15.36 | +13.3% | |
| 17 | +| **`turbo_kv_5b`** 🏆 | 88 | **13.2** | **+5% ⬆** | 13.65 | +0.7% | |
| 18 | + |
| 19 | +### What changed (Round 5: the real bottleneck) |
| 20 | + |
| 21 | +The biggest win came from `tq_transformer.c`. The `use_quant_kv` path |
| 22 | +was calling `traits->dequantize` once per cached key per token, which |
| 23 | +internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating |
| 24 | +the total cost at long context. |
| 25 | + |
| 26 | +Round 5 changes the inner loop to use the type's optimized |
| 27 | +`traits->attention` kernel, which: |
| 28 | +1. Pre-rotates the query ONCE per layer |
| 29 | +2. Does fused dequant + dot product per block in rotated space |
| 30 | +3. Skips per-position inverse RHT entirely |
| 31 | + |
| 32 | +Old slow path is preserved as a fallback for the complex cases: |
| 33 | +QK-norm-on-stored-keys, k_highres_window, sliding-window attention. |
| 34 | + |
| 35 | +### Karpathy loop (this release) |
| 36 | + |
| 37 | +| Round | What changed | Llama 3.2 3B turbo_kv_4b tok/s | |
| 38 | +|---:|---|---:| |
| 39 | +| 0 | Baseline (per-position dequant + inline dot) | 6.9 | |
| 40 | +| 1 | Single-pass dequant with hoisted LUT | 7.0 | |
| 41 | +| 2 | Fused dequant+dot via NEON lane construction | regression — revert | |
| 42 | +| 3 | Apply Round 1 to 3b/5b dequants | 7.0 | |
| 43 | +| 4 | Pure scalar fused with 4 accumulators | 7.0 | |
| 44 | +| 5 | **transformer uses traits->attention (no per-pos RHT inverse)** | **13.5** ✅ | |
| 45 | +| 6 | Hoist LUT in 4bo/3bo dequants | 13.9 | |
| 46 | + |
| 47 | +PPL changed slightly across the FP reordering (0.3–0.5% increase per |
| 48 | +type, all within the regression test cosine ≥ 0.99/0.999 thresholds). |
| 49 | +35/35 tests pass. |
| 50 | + |
| 51 | +### Other changes |
| 52 | + |
| 53 | +- New tracking issue #15 follow-up notes for per-head rotation seeds and |
| 54 | + Llama 3.1 8B + LongBench-E reproduction (still open) |
| 55 | + |
3 | 56 | ## [0.6.2] — 2026-04-08 |
4 | 57 |
|
5 | 58 | ### Highlights |
|
0 commit comments