Release v0.8.0 — Cross-platform SIMD (AVX2) · quantumaikr/quant.cpp

Highlights

AVX2 port of turbo_kv attention. The Round 10/11 NEON vqtbl1q_s8 / vqtbl2q_s8 table-lookup pattern that achieved fp32 parity on Apple Silicon is now mirrored on x86 AVX2 for all four turbo_kv variants (4b / 5b / 5b_fast / 3b). Linux & Windows x86-64 builds get the same kernel structure as ARM, no algorithmic changes.

Variant	NEON	AVX2
4b	`vqtbl1q_s8`	`_mm_shuffle_epi8`
5b	`vqtbl2q_s8`	2× `_mm_shuffle_epi8` + `_mm_blendv_epi8`
5b_fast	`vqtbl2q_s8`	same as 5b, no bit-unpack
3b	`vqtbl1q_s8`	`_mm_shuffle_epi8`

Issue #16 instrumentation. New tq_metal_diag_get/reset() flush counter; the PPL tool prints flushes/token and ops/flush with Metal builds. Reproducing the issue's exact reproducer (Llama 3.2 3B Q8_0 turbo_kv_4b) shows 0 flushes — Metal batch path never enters for Q8_0 weights. Findings posted to #16.

KL divergence tool. --save-logits / --kl-baseline flags for two-pass softmax-distribution comparison against an fp32 baseline. Required by the upcoming llama.cpp PR. Smoke-test on SmolLM2 135M: fp32 PPL 18.66 → turbo_kv_4b PPL 19.73, mean KL 0.1575 over 1040 tokens.

Explored and reverted. v0.9.0 vdotq experiment: int8 query quantization + vdotq_s32 gave +6% speed but +1.5% PPL regression. The cosine test was not sensitive enough to catch it; PPL gating caught it. Documented for future revisit with calibrated per-segment query scale.

Deferred to v0.8.1. WASM SIMD port — requires first un-stubbing turbo_kv attention in quant.h single-header.

Tests

35/35 passing. New TurboKVRegression.KV_5B_FAST_AttentionCosine regression test (was missing coverage).

🤖 Generated with Claude Code

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.0 — Cross-platform SIMD (AVX2)

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Highlights

Tests

Uh oh!