Skip to content

v0.7.0 — Round 10 BREAKTHROUGH: turbo_kv_4b matches fp32 KV speed at 7.1× compression

Choose a tag to compare

@unamedkr unamedkr released this 08 Apr 13:03
· 35 commits to main since this release

🏆 The breakthrough we've been chasing for 3 sessions

After 10 rounds of Karpathy iteration, `turbo_kv_4b` now runs at fp32 KV parity on Llama 3.2 3B PPL eval — at the same time matching fp32's PPL closely (within 4%) and delivering 7.1× memory compression. This is the moment where the value proposition fundamentally changes.

Type Bytes/block Compression PPL Δ vs FP32 tok/s vs FP32 speed
FP32 KV 13.56 17.9 baseline
`turbo_kv_4b` ⭐ default 72 7.1× 14.08 +3.8% 18.7 +4.5%

The Karpathy story

Rounds 1–9 had been doing local fusions to the inner loop without measuring where time was actually going. Then we ran the existing `--profile` flag at long context (PPL eval, seq_len ~950) and finally saw the truth:

```
matmul attention other total
fp32 38.6 ms 15.7 ms 1.4 ms 55.7 ms
turbo_kv_4b 38.9 ms 19.8 ms 1.8 ms 60.5 ms
delta +0.3 +4.1 +0.4 +4.8 ← entire gap is in attention
```

The matmul code path is identical between fp32 and turbo_kv (it's a Q/K/V projection over Q4 weights). The 8% speed gap was entirely in the attention dot-product loop.

Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per element) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant lookup had become compute-bound, not memory-bound — surprising because we'd assumed memory was the bottleneck.

The fix: NEON vqtbl1q_s8 (Round 10)

Apple Silicon NEON has `vqtbl1q_s8`, a single instruction that does a 16-byte table lookup with 16 lanes. Perfect for our 16-entry codebook.

```c
// One-time at startup: quantize 16 Lloyd-Max-Gaussian centroids to int8
static int8_t s_cb_i8[16];
for (int j = 0; j < 16; j++) {
s_cb_i8[j] = (int8_t)(cb[j] * (127.0f / 2.7326f)); // ~1% precision loss
}
int8x16_t cb_vec = vld1q_s8(s_cb_i8);

// Per attention call, per block:
for (d = 0; d + 31 < dim; d += 32) {
uint8x16_t bytes = vld1q_u8(mi + d/2); // 16 bytes = 32 nibbles
uint8x16_t low_nib = vandq_u8(bytes, vdupq_n_u8(0x0F));
uint8x16_t high_nib = vshrq_n_u8(bytes, 4);
int8x16_t low_vals = vqtbl1q_s8(cb_vec, low_nib); // 1 instruction, 16 gathers
int8x16_t high_vals = vqtbl1q_s8(cb_vec, high_nib);
int8x16x2_t inter = vzipq_s8(low_vals, high_vals); // interleave
// ... int8 → int16 → fp32 → multiply scale → vfmaq_f32
}
```

32 elements per iteration (vs 8 in the previous scalar version), with one `vqtbl1q_s8` per 16 lookups instead of 16 scalar L1 hits.

Cross-model verification

Model Speed gap (R9 → R10) PPL (R10)
SmolLM2 135M -14.5% → -3.1% +5.7%
Llama 3.2 1B -16.3% → -1.3% +5.4%
Llama 3.2 3B -8.4% → +4.5% +3.8%

All three models show massive speed improvement. Llama 3.2 3B is now at parity. PPL also slightly improved on all three (the int8 discretization happens to align favorably with key statistics).

Honest framing change

Before v0.7.0 After v0.7.0
"92% of fp32 speed at 7× compression" "PARITY with fp32 speed at 7× compression"

What you should use

```bash
cmake -B build -DCMAKE_BUILD_TYPE=Release # default: TQ_BUILD_METAL=OFF
cmake --build build -j

./build/quant model.gguf # turbo_kv_4b default (now fp32-parity)
./build/quant model.gguf -k turbo_kv_5b # near-lossless quality, still scalar
```

What's NOT in v0.7.0

The 5b/3b variants still use the previous scalar inner loop. Their numbers in the table are from Round 9. v0.7.1 will apply the same NEON tbl pattern to them (8-entry table for 3b, 32-entry split table for 5b).

Tests

35/35 unit tests pass. Regression tests pin attention cosine ≥ 0.99 (4b) — the int8 codebook precision loss is well within bounds.

The lesson

The user kept pushing: "답은 언제나 존재합니다. 그것을 찾아내는게 어려울 뿐입니다." (The answer always exists; finding it is the hard part.)

For 9 rounds we had been guessing at local optimizations. Round 10 was the result of:

  1. Stopping the guessing and running the existing `--profile` flag
  2. Reading the data: the entire gap was in attention, not matmul
  3. Web search for similar optimization patterns (NEON tbl, MLX implementations, sparse V)
  4. Choosing the right SIMD primitive (`vqtbl1q_s8`) for our specific 16-entry codebook
  5. Accepting the small precision loss (int8 vs fp32 LUT) because the regression tests guard quality

Three sessions of careful Karpathy discipline + one round of profile-driven analysis = the answer existed all along.