Skip to content

v0.8.0 — Cross-platform SIMD (AVX2)

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 23:04
· 224 commits to main since this release

Highlights

AVX2 port of turbo_kv attention. The Round 10/11 NEON vqtbl1q_s8 / vqtbl2q_s8 table-lookup pattern that achieved fp32 parity on Apple Silicon is now mirrored on x86 AVX2 for all four turbo_kv variants (4b / 5b / 5b_fast / 3b). Linux & Windows x86-64 builds get the same kernel structure as ARM, no algorithmic changes.

Variant NEON AVX2
4b vqtbl1q_s8 _mm_shuffle_epi8
5b vqtbl2q_s8 _mm_shuffle_epi8 + _mm_blendv_epi8
5b_fast vqtbl2q_s8 same as 5b, no bit-unpack
3b vqtbl1q_s8 _mm_shuffle_epi8

Issue #16 instrumentation. New tq_metal_diag_get/reset() flush counter; the PPL tool prints flushes/token and ops/flush with Metal builds. Reproducing the issue's exact reproducer (Llama 3.2 3B Q8_0 turbo_kv_4b) shows 0 flushes — Metal batch path never enters for Q8_0 weights. Findings posted to #16.

KL divergence tool. --save-logits / --kl-baseline flags for two-pass softmax-distribution comparison against an fp32 baseline. Required by the upcoming llama.cpp PR. Smoke-test on SmolLM2 135M: fp32 PPL 18.66 → turbo_kv_4b PPL 19.73, mean KL 0.1575 over 1040 tokens.

Explored and reverted. v0.9.0 vdotq experiment: int8 query quantization + vdotq_s32 gave +6% speed but +1.5% PPL regression. The cosine test was not sensitive enough to catch it; PPL gating caught it. Documented for future revisit with calibrated per-segment query scale.

Deferred to v0.8.1. WASM SIMD port — requires first un-stubbing turbo_kv attention in quant.h single-header.

Tests

35/35 passing. New TurboKVRegression.KV_5B_FAST_AttentionCosine regression test (was missing coverage).

🤖 Generated with Claude Code