Skip to content

Commit c58d4d7

Browse files
committed
CHANGELOG: v0.6.3 — Karpathy round 5+6, turbo_kv beats fp32 speed
1 parent 83f37fd commit c58d4d7

File tree

1 file changed

+53
-0
lines changed

1 file changed

+53
-0
lines changed

CHANGELOG.md

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,58 @@
11
# Changelog
22

3+
## [0.6.3] — 2026-04-08
4+
5+
### 🏆 turbo_kv now BEATS fp32 KV speed at 7× compression
6+
7+
After 6 rounds of Karpathy iteration on the attention path, all three
8+
production turbo_kv types are now **both more compressed AND faster**
9+
than uncompressed FP32 KV on Llama 3.2 3B PPL eval (1040 tokens, 28
10+
layers, attention-heavy):
11+
12+
| Type | Bytes/block | tok/s | vs FP32 | PPL | Δ vs FP32 |
13+
|---|---:|---:|---:|---:|---:|
14+
| FP32 KV || 12.6 | baseline | 13.56 ||
15+
| **`turbo_kv_4b`**| 72 | **13.9** | **+10% ⬆** | 14.33 | +5.7% |
16+
| **`turbo_kv_3b`** | 56 | **13.4** | **+6% ⬆** | 15.36 | +13.3% |
17+
| **`turbo_kv_5b`** 🏆 | 88 | **13.2** | **+5% ⬆** | 13.65 | +0.7% |
18+
19+
### What changed (Round 5: the real bottleneck)
20+
21+
The biggest win came from `tq_transformer.c`. The `use_quant_kv` path
22+
was calling `traits->dequantize` once per cached key per token, which
23+
internally ran `tq_rht_inverse()` (O(d log d)) per call — dominating
24+
the total cost at long context.
25+
26+
Round 5 changes the inner loop to use the type's optimized
27+
`traits->attention` kernel, which:
28+
1. Pre-rotates the query ONCE per layer
29+
2. Does fused dequant + dot product per block in rotated space
30+
3. Skips per-position inverse RHT entirely
31+
32+
Old slow path is preserved as a fallback for the complex cases:
33+
QK-norm-on-stored-keys, k_highres_window, sliding-window attention.
34+
35+
### Karpathy loop (this release)
36+
37+
| Round | What changed | Llama 3.2 3B turbo_kv_4b tok/s |
38+
|---:|---|---:|
39+
| 0 | Baseline (per-position dequant + inline dot) | 6.9 |
40+
| 1 | Single-pass dequant with hoisted LUT | 7.0 |
41+
| 2 | Fused dequant+dot via NEON lane construction | regression — revert |
42+
| 3 | Apply Round 1 to 3b/5b dequants | 7.0 |
43+
| 4 | Pure scalar fused with 4 accumulators | 7.0 |
44+
| 5 | **transformer uses traits->attention (no per-pos RHT inverse)** | **13.5**|
45+
| 6 | Hoist LUT in 4bo/3bo dequants | 13.9 |
46+
47+
PPL changed slightly across the FP reordering (0.3–0.5% increase per
48+
type, all within the regression test cosine ≥ 0.99/0.999 thresholds).
49+
35/35 tests pass.
50+
51+
### Other changes
52+
53+
- New tracking issue #15 follow-up notes for per-head rotation seeds and
54+
Llama 3.1 8B + LongBench-E reproduction (still open)
55+
356
## [0.6.2] — 2026-04-08
457

558
### Highlights

0 commit comments

Comments
 (0)