Summary
While exploring P3 (Metal compute graph for KV attention), we discovered that the existing Metal backend (`TQ_BUILD_METAL=ON`) makes inference 13–40% slower than the CPU-only build on every model size we tested. This applies to both `fp32` and all `turbo_kv_*` paths.
Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)
| Build |
KV type |
tok/s |
| Metal ON |
fp32 |
15.07 |
| Metal OFF |
fp32 |
17.87 |
| Metal ON |
turbo_kv_4b |
14.17 |
| Metal OFF |
turbo_kv_4b |
16.53 |
| Metal ON |
turbo_kv_5b |
13.43 |
| Metal OFF |
turbo_kv_5b |
15.33 |
Across model sizes:
| Model |
Metal-OFF win |
| SmolLM2 135M |
neutral (within noise) |
| Llama 3.2 1B |
+13–17% |
| Llama 3.2 3B |
+14–22% |
| Gemma 4 26B |
+40% |
Even on the largest model we tested (Gemma 4 26B at 1.0 tok/s with Metal vs 1.4 tok/s without), Metal is net negative.
Why?
The current Metal path uses per-matmul dispatch with command buffer commit + waitUntilCompleted at flush points. At batch-1 inference, the per-op dispatch overhead exceeds the GPU compute benefit. This is the same dispatch-overhead issue documented in our earlier failed compute-graph experiments.
What's surprising is that even on the very large Gemma 4 26B, Metal still loses. The matmul ops are large enough that GPU compute should win, but the dispatch + sync still dominates.
Impact on past benchmarks
All quant.cpp benchmarks published before commit `` (2026-04-08) used `-DTQ_BUILD_METAL=ON` and were therefore 14-22% slower than what users actually get with the default CMake build. README and CHANGELOG numbers have been updated to reflect the honest CPU-only baseline.
The CMake default is and has been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. Only our internal benchmarks were misled.
Action items
Out of scope (won't fix here)
- Adding new Metal kernels (e.g., for turbo_kv attention) — would compound the problem until the existing dispatch path is fixed
- Full GPU compute graph (already failed in previous attempts)
How to reproduce
```bash
CPU-only (fast, default)
cmake -B build_cpu -DTQ_BUILD_METAL=OFF
cmake --build build_cpu -j
Metal (currently slower)
cmake -B build_metal -DTQ_BUILD_METAL=ON
cmake --build build_metal -j
Compare
for k in fp32 turbo_kv_4b; do
for build in build_cpu build_metal; do
$build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf --ppl bench/data/ppl_1k.txt -j 8 -k $k
done
done
```
Summary
While exploring P3 (Metal compute graph for KV attention), we discovered that the existing Metal backend (`TQ_BUILD_METAL=ON`) makes inference 13–40% slower than the CPU-only build on every model size we tested. This applies to both `fp32` and all `turbo_kv_*` paths.
Measurements (3 runs each, Llama 3.2 3B Instruct, PPL eval)
Across model sizes:
Even on the largest model we tested (Gemma 4 26B at 1.0 tok/s with Metal vs 1.4 tok/s without), Metal is net negative.
Why?
The current Metal path uses per-matmul dispatch with command buffer commit + waitUntilCompleted at flush points. At batch-1 inference, the per-op dispatch overhead exceeds the GPU compute benefit. This is the same dispatch-overhead issue documented in our earlier failed compute-graph experiments.
What's surprising is that even on the very large Gemma 4 26B, Metal still loses. The matmul ops are large enough that GPU compute should win, but the dispatch + sync still dominates.
Impact on past benchmarks
All quant.cpp benchmarks published before commit `` (2026-04-08) used `-DTQ_BUILD_METAL=ON` and were therefore 14-22% slower than what users actually get with the default CMake build. README and CHANGELOG numbers have been updated to reflect the honest CPU-only baseline.
The CMake default is and has been `TQ_BUILD_METAL=OFF`, so end users were always getting the fast path. Only our internal benchmarks were misled.
Action items
Out of scope (won't fix here)
How to reproduce
```bash
CPU-only (fast, default)
cmake -B build_cpu -DTQ_BUILD_METAL=OFF
cmake --build build_cpu -j
Metal (currently slower)
cmake -B build_metal -DTQ_BUILD_METAL=ON
cmake --build build_metal -j
Compare
for k in fp32 turbo_kv_4b; do
for build in build_cpu build_metal; do
$build/quant models/Llama-3.2-3B-Instruct-Q8_0.gguf --ppl bench/data/ppl_1k.txt -j 8 -k $k
done
done
```