v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims
⚠️ This release exists because validation matters
v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.
Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)
| Type | Avg tok/s | vs FP32 | PPL | PPL Δ | Compression |
|---|---|---|---|---|---|
| FP32 KV (NEON) | 14.63 | baseline | 13.56 | — | 1× |
| `turbo_kv_4b` ⭐ default | 13.57 | −7.2% | 14.33 | +5.7% | 7.1× |
| `turbo_kv_3b` | 13.13 | −10.2% | 15.36 | +13.3% | 9.1× |
| `turbo_kv_5b` 🏆 quality | 12.90 | −11.8% | 13.65 | +0.7% | 5.8× |
The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.
What changed in v0.6.4
| File | Change |
|---|---|
| `tq_transformer.c` | NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%). |
| `README.md`, `README.ko.md` | Headline tables and ASCII charts updated with honest numbers and a Correction note. |
| `CHANGELOG.md` | v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass. |
| v0.6.3 release notes | Updated with the same Correction notice. |
| `tq_transformer.c` | Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit). |
What we learned
Validation is the most valuable step. It found the wrong claim before it spread to users.
The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.
Pareto position (still strong)
`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:
| turbo_kv_4b | uniform_4b | |
|---|---|---|
| PPL on Llama 3.2 3B | 14.33 | 14.60 |
| Speed | 13.57 tok/s | 11.7 tok/s |
| Compression | 7.1× | 7.5× |
The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.
Tests
35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).
Closes
- Honest validation of v0.6.3 speed claims ✅
- Corrected README and release notes ✅
- Local optimum reached for the current attention path
- Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately