Skip to content

Commit 04b08d3

Browse files
unamedkrclaude
andcommitted
v0.8.0: AVX2 turbo_kv port + Metal diag + KL tool
Bundles four work items completed in one session: 1. AVX2 port of turbo_kv_4b/5b/5b_fast/3b attention (commit 2dcbde4) 2. Metal flush diagnostic counter for Issue #16 (commit 34f5ef4) 3. KL divergence two-pass tool for llama.cpp PR validation (fd4148b) 4. v0.9.0 vdotq experiment: explored, measured (-1.5% PPL), reverted CHANGELOG documents each item with measurements where available. The llama.cpp PR draft is updated to mark KL divergence as DONE (was the main remaining blocker). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent fd4148b commit 04b08d3

File tree

2 files changed

+45
-1
lines changed

2 files changed

+45
-1
lines changed

CHANGELOG.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,49 @@
11
# Changelog
22

3+
## [0.8.0] — 2026-04-09
4+
5+
### Cross-platform SIMD: AVX2 port of turbo_kv attention
6+
7+
Round 10/11's NEON `vqtbl1q_s8` / `vqtbl2q_s8` table-lookup pattern is now mirrored on x86 AVX2 for all four turbo_kv attention variants. The breakthrough that achieved fp32 parity on Apple Silicon now extends to Linux/Windows x86-64 builds.
8+
9+
| Variant | NEON instruction | AVX2 instruction(s) | Layout |
10+
|---|---|---|---|
11+
| 4b | `vqtbl1q_s8` | `_mm_shuffle_epi8` | 16-entry codebook fits in 1 register |
12+
| 5b | `vqtbl2q_s8` |`_mm_shuffle_epi8` + `_mm_blendv_epi8` | 32-entry codebook split low/high |
13+
| 5b_fast | `vqtbl2q_s8` | same as 5b, no bit-unpack | direct 1-byte index loads |
14+
| 3b | `vqtbl1q_s8` (lower 8) | `_mm_shuffle_epi8` | 8-entry fits trivially |
15+
16+
The 32-entry codebook (5b/5b_fast) needs the BLENDV bit-trick on AVX2 since `PSHUFB` is per-lane 16-entry only. Performance is unmeasured on x86 in this release (CI builds & runs the new tests; benchmarking deferred to v0.8.x).
17+
18+
Tests added:
19+
- `TurboKVRegression.KV_5B_FAST_AttentionCosine` — was missing coverage; now exercises 5b_fast on synthetic Gaussian keys (cosine > 0.999).
20+
21+
### Investigation: Issue #16 Metal dispatch overhead
22+
23+
Added `tq_metal_diag_get/reset()` flush counter so the PPL tool prints `flushes/token` and `ops/flush` at end of run. Reproducing the issue's exact command on Llama 3.2 3B Q8_0 turbo_kv_4b shows **0 flushes/token** — Metal batch path is never entered for Q8_0 weights because the gate `layer_has_gguf` requires `gguf_w*` (Q4_K on-the-fly path). Metal=ON and Metal=OFF are now identical in throughput on this model.
24+
25+
The remaining suspected slowdown sources (Q4_K + `tq_metal_forward_layer` Q4 path) are documented as next steps in the issue. The diag counter unblocks anyone with the right model from getting empirical numbers in one command.
26+
27+
### llama.cpp PR validation: KL divergence tool
28+
29+
`tools/quant.c` gains `--save-logits` and `--kl-baseline` for two-pass KL measurement against an fp32 baseline:
30+
31+
```bash
32+
quant model.gguf --ppl text.txt -k fp32 --save-logits base.bin
33+
quant model.gguf --ppl text.txt -k turbo_kv_4b --kl-baseline base.bin
34+
# → "KL divergence (baseline || quantized): mean = 0.157466 over 1040 tokens"
35+
```
36+
37+
This is the standard llama-perplexity-style validation needed by the upcoming llama.cpp PR (`docs/pr/2026-04-09-llama-cpp-pr-draft.md`).
38+
39+
### Explored and reverted
40+
41+
- **vdotq query quantization** (v0.9.0 candidate): replacing the int8→fp32→fma chain with `vdotq_s32(int8_codebook, int8_query)` gave +6% speed but **+1.5% PPL regression** on turbo_kv_4b. The cosine test (>0.99) was not sensitive enough to catch it; PPL gating caught it. Reverted; documented in memory `feedback_vdotq_query_quant_tradeoff`.
42+
43+
### Deferred
44+
45+
- **WASM SIMD port**: requires un-stubbing turbo_kv attention in `quant.h` (single-header) first. Tracked for v0.8.1.
46+
347
## [0.7.1] — 2026-04-08
448

549
### Round 11 — NEON tbl pattern applied to 3b/5b (partial parity)

docs/pr/2026-04-09-llama-cpp-pr-draft.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -87,7 +87,7 @@ Per https://github.com/ggml-org/llama.cpp/blob/master/CONTRIBUTING.md the follow
8787
|---|---|---|
8888
| Convert a small model to GGUF using the new type | N/A (KV-only) | This is a runtime KV cache type, not a weight quantization type. Models are not re-converted. |
8989
| Perplexity comparison vs FP16/BF16 and similar types || See result table above. PPL +3.8% vs FP32 KV on Llama 3.2 3B (Q8_0 weights). Need llama.cpp-side reproduction. |
90-
| KL divergence data | ⚠️ TODO | quant.cpp does not currently compute KL div. Will add to the reference engine and report before merge. |
90+
| KL divergence data | ✅ DONE (commit fd4148b) | quant.cpp now has `--save-logits`/`--kl-baseline`. Smoke-test on SmolLM2 135M: fp32 PPL 18.66 → turbo_kv_4b PPL 19.73 (+5.7%), mean KL 0.1575 over 1040 tokens. Reproduce on Llama 3.2 3B before submission. |
9191
| Pure CPU performance benchmarking vs similar types || tok/s on Llama 3.2 3B PPL eval, 3-run average, no Metal. See result table above. |
9292
| Code style: 4-space indent, snake_case, no modern STL || The reference C code follows these. ggml port will too. |
9393

0 commit comments

Comments
 (0)