v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)
Round 11: same primitive, different bit-packing
Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.
| Type | tok/s (3-run avg) | vs FP32 | PPL Δ | Compression |
|---|---|---|---|---|
| FP32 | 18.43 | baseline | — | 1× |
| `turbo_kv_4b` ⭐ default | 18.17 | −1.4% ✅ parity | +3.8% | 7.1× |
| `turbo_kv_5b` 🏆 quality | 16.80 | −8.8% | +0.7% | 5.8× |
| `turbo_kv_3b` | 16.57 | −10.1% | +13.3% | 9.1× |
5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.
Why 4b reached parity but 5b/3b didn't
| Type | Bit packing | Unpack | Result |
|---|---|---|---|
| 4b | byte-aligned (2 nibbles per byte) | pure SIMD `vandq_u8` + `vshrq_n_u8` | parity ✅ |
| 3b | bit-aligned (irregular 3-bit fields) | uint64 read + 16 scalar shifts | −10.1% |
| 5b | bit-aligned (irregular 5-bit fields) | uint64 read + 16 scalar shifts | −8.8% |
For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.
Bonus insight: matmul already used the same pattern
While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.
The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.
What's not in v0.7.1
- 5b/3b at full parity. Closing the remaining gap needs either:
- Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
- SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
- Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
- `turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
- AVX2 / WASM SIMD ports
Tests
35/35 unit tests pass. PPL unchanged across all variants from Round 10.
What you should use
```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j
./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```
Cross-session lesson
Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:
- Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
- SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
- SIMD unpack constraint — byte-alignment matters as much as the primitive itself
- Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly