Skip to content

v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 14:53
· 32 commits to main since this release

Round 11: same primitive, different bit-packing

Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.

Type tok/s (3-run avg) vs FP32 PPL Δ Compression
FP32 18.43 baseline
`turbo_kv_4b` ⭐ default 18.17 −1.4% ✅ parity +3.8% 7.1×
`turbo_kv_5b` 🏆 quality 16.80 −8.8% +0.7% 5.8×
`turbo_kv_3b` 16.57 −10.1% +13.3% 9.1×

5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.

Why 4b reached parity but 5b/3b didn't

Type Bit packing Unpack Result
4b byte-aligned (2 nibbles per byte) pure SIMD `vandq_u8` + `vshrq_n_u8` parity
3b bit-aligned (irregular 3-bit fields) uint64 read + 16 scalar shifts −10.1%
5b bit-aligned (irregular 5-bit fields) uint64 read + 16 scalar shifts −8.8%

For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.

Bonus insight: matmul already used the same pattern

While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.

The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.

What's not in v0.7.1

  • 5b/3b at full parity. Closing the remaining gap needs either:
    • Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
    • SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
    • Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
  • `turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
  • AVX2 / WASM SIMD ports

Tests

35/35 unit tests pass. PPL unchanged across all variants from Round 10.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```

Cross-session lesson

Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:

  1. Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
  2. SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
  3. SIMD unpack constraint — byte-alignment matters as much as the primitive itself
  4. Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly