Release v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity) · quantumaikr/quant.cpp

Round 11: same primitive, different bit-packing

Round 10 (v0.7.0) achieved fp32 KV parity for `turbo_kv_4b` via NEON `vqtbl1q_s8` table lookup. Round 11 applies the same SIMD codebook lookup pattern to the remaining production variants. Result: large improvement for 5b and 3b but not full parity, because their bit-unaligned packing creates a new bottleneck in the unpack stage.

Type	tok/s (3-run avg)	vs FP32	PPL Δ	Compression
FP32	18.43	baseline	—	1×
`turbo_kv_4b` ⭐ default	18.17	−1.4% ✅ parity	+3.8%	7.1×
`turbo_kv_5b` 🏆 quality	16.80	−8.8%	+0.7%	5.8×
`turbo_kv_3b`	16.57	−10.1%	+13.3%	9.1×

5b made a +9% jump (from −14.5% to −8.8%). 3b improved +3 percentage points.

Why 4b reached parity but 5b/3b didn't

Type	Bit packing	Unpack	Result
4b	byte-aligned (2 nibbles per byte)	pure SIMD `vandq_u8` + `vshrq_n_u8`	parity ✅
3b	bit-aligned (irregular 3-bit fields)	uint64 read + 16 scalar shifts	−10.1%
5b	bit-aligned (irregular 5-bit fields)	uint64 read + 16 scalar shifts	−8.8%

For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found, but it costs ~16 instructions per 16-element iteration. The SIMD lookup itself is 1 instruction. So the unpack dominates the runtime for 3b/5b.

Bonus insight: matmul already used the same pattern

While investigating other optimization axes, we discovered the GGUF Q4 matmul code (`tq_gguf_quants.c:1561`) already uses `vqtbl1q_s8` for codebook lookup — has been since v0.5. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile): they share the same NEON tbl matmul kernel.

The "breakthrough" of Round 10 was applying a primitive we'd already been using for matmul to the attention path. Profile-driven analysis would have spotted this in week 1.

What's not in v0.7.1

5b/3b at full parity. Closing the remaining gap needs either:
- Layout change: 1 byte per index, sacrificing compression (5b would go 5.8× → 3.6×)
- SIMD bit-extraction trick: `vshlq` + bit-mask patterns, complex
- Acceptance: ship 5b/3b at near-parity with honest disclosure (chosen for v0.7.1)
`turbo_kv_4bo` / `turbo_kv_3bo` — research types, still on Round 9 path
AVX2 / WASM SIMD ports

Tests

35/35 unit tests pass. PPL unchanged across all variants from Round 10.

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default — fp32 parity at 7.1× compression
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, slightly slower
./build/quant model.gguf -k turbo_kv_3b # max compression, +13% PPL trade
```

Cross-session lesson

Two sessions of 11 Karpathy rounds total. Key learnings now in persistent memory:

Profile before optimizing — Round 10 found in 30s what 9 rounds of guessing missed
SIMD table lookup pattern — vqtbl1q_s8 / vqtbl2q_s8 / vtbl1_s8 for small codebooks
SIMD unpack constraint — byte-alignment matters as much as the primitive itself
Ship honest — 5b/3b are not at parity, the README and CHANGELOG say so explicitly

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.7.1 — Round 11: SIMD lookup applied to 3b/5b (partial parity)

Choose a tag to compare

Sorry, something went wrong.