Skip to content

v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 02:43
· 46 commits to main since this release

⚠️ This release exists because validation matters

v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.

Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)

Type Avg tok/s vs FP32 PPL PPL Δ Compression
FP32 KV (NEON) 14.63 baseline 13.56
`turbo_kv_4b` ⭐ default 13.57 −7.2% 14.33 +5.7% 7.1×
`turbo_kv_3b` 13.13 −10.2% 15.36 +13.3% 9.1×
`turbo_kv_5b` 🏆 quality 12.90 −11.8% 13.65 +0.7% 5.8×

The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.

What changed in v0.6.4

File Change
`tq_transformer.c` NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%).
`README.md`, `README.ko.md` Headline tables and ASCII charts updated with honest numbers and a Correction note.
`CHANGELOG.md` v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass.
v0.6.3 release notes Updated with the same Correction notice.
`tq_transformer.c` Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit).

What we learned

Validation is the most valuable step. It found the wrong claim before it spread to users.

The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.

Pareto position (still strong)

`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:

turbo_kv_4b uniform_4b
PPL on Llama 3.2 3B 14.33 14.60
Speed 13.57 tok/s 11.7 tok/s
Compression 7.1× 7.5×

The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.

Tests

35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).

Closes

  • Honest validation of v0.6.3 speed claims ✅
  • Corrected README and release notes ✅
  • Local optimum reached for the current attention path
  • Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately