Release v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims · quantumaikr/quant.cpp

⚠️ This release exists because validation matters

v0.6.3 shipped with the headline 'turbo_kv beats fp32 KV speed'. After running the comparison again with both paths NEON-optimized, that claim was wrong. v0.6.4 publishes the honest numbers.

Final honest measurements (Llama 3.2 3B PPL eval, 3 runs each)

Type	Avg tok/s	vs FP32	PPL	PPL Δ	Compression
FP32 KV (NEON)	14.63	baseline	13.56	—	1×
`turbo_kv_4b` ⭐ default	13.57	−7.2%	14.33	+5.7%	7.1×
`turbo_kv_3b`	13.13	−10.2%	15.36	+13.3%	9.1×
`turbo_kv_5b` 🏆 quality	12.90	−11.8%	13.65	+0.7%	5.8×

The Round 5 optimization in v0.6.3 (transformer → traits->attention) was real and meaningful: turbo_kv_4b went from 6.9 → 13.57 tok/s (+97%). What was wrong was the comparison baseline: fp32 was unoptimized scalar.

What changed in v0.6.4

File	Change
`tq_transformer.c`	NEON-optimized fp32 attention path. fp32 went 12.6 → 14.83 tok/s (+18%).
`README.md`, `README.ko.md`	Headline tables and ASCII charts updated with honest numbers and a Correction note.
`CHANGELOG.md`	v0.6.3 entry has a prominent Correction notice; v0.6.4 entry documents the validation pass.
v0.6.3 release notes	Updated with the same Correction notice.
`tq_transformer.c`	Round 8 prefetch attempt and Round 9 strided-attention concept reverted (no measurable benefit).

What we learned

Validation is the most valuable step. It found the wrong claim before it spread to users.

The 9-round Karpathy loop was in good faith but the comparison baseline was unfair. Once we fixed the unfair baseline, the headline flipped from 'beats fp32' to 'within 8% of fp32 with 7× compression'. Both stories are interesting — but only one is true.

Pareto position (still strong)

`turbo_kv_4b` is still strictly better than `uniform_4b` on every axis:

	turbo_kv_4b	uniform_4b
PPL on Llama 3.2 3B	14.33	14.60
Speed	13.57 tok/s	11.7 tok/s
Compression	7.1×	7.5×

The compression edge for uniform_4b is marginal; turbo_kv_4b wins on the other two by clear margins.

Tests

35/35 unit tests pass on macOS / Linux / Windows. Regression tests pin cosine ≥ 0.99 (4b) and ≥ 0.999 (5b).

Closes

Honest validation of v0.6.3 speed claims ✅
Corrected README and release notes ✅
Local optimum reached for the current attention path
Future structural work (e.g., GPU dispatch, tensor graph IR) tracked separately

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.6.4 — Honest validation pass + correction of v0.6.3 speed claims

Choose a tag to compare

Sorry, something went wrong.