Skip to content

perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599

Open
popfido wants to merge 1 commit into
jundot:mainfrom
popfido:perf/mlx-vlm-0.6.0
Open

perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599
popfido wants to merge 1 commit into
jundot:mainfrom
popfido:perf/mlx-vlm-0.6.0

Conversation

@popfido
Copy link
Copy Markdown
Contributor

@popfido popfido commented Jun 2, 2026

Summary

Bumps the mlx-vlm dependency from the 6f60ee4 commit pin to the 0.6.0 PyPI release. This is the upstream-released version of the TurboQuant decode fixes oMLX has been carrying, and it additionally lands the RHT value-kernel speedup (Blaizzy/mlx-vlm#1252) that directly accelerates the batched TurboQuant decode path wired up in #1547.

Why it's faster

In 0.5.0 the L=1 value-sum fast kernel was gated off whenever RHT was active (if not self.use_rht and ...) and fell back to an einsum on every decode step. RHT is on for any power-of-two head dim (e.g. Llama head_dim 64), so batched-TQ decode always took the slow path. 0.6.0 removes that gate and applies the RHT inverse on the kernel result (_value_rotate_inverse), so the Metal kernel runs under RHT. The value-sum cost scales with KV length, so the win shows at long context — exactly the regime where you'd reach for TurboQuant KV compression in the first place.

Validation

Re-ran the #1547 harness plus a long-context decode A/B on mlx-community/Llama-3.2-1B-Instruct-4bit (ctx=3584, two-point measurement so prefill cancels and only decode steps are timed):

scenario 0.5.0 0.6.0 Δ
batch TQ (B=4, masked decode) 165.1 tok/s 249.8 tok/s +51%
single TQ (B=1, causal/fused path) 124.2 126.1 flat
batch fp16 (reference) 506.5 493.8 within noise

The gain is isolated to the masked batch-decode path; single-seq (fused causal kernel) and fp16 are unchanged — confirming the improvement is the value kernel, not measurement drift.

KV occupancy and accuracy are identical to 0.5.0 (TQ 0.312× fp16; 69% projected saving at 8K context; batch-vs-single token match unchanged).

Blast radius

  • No oMLX code changes. TurboQuantKVCache.{decode,prefill}_attention and from_cache signatures are unchanged; the existing omlx/patches/turboquant_attention.py routing works as-is.
  • transformers (already 5.8.1) satisfies 0.6.0's >=5.5.0; mlx-audio was already a transitive core dep, so no new core packages.

Test plan

  • TurboQuant suite (-m turboquant): 32/32 pass on 0.6.0
  • Full CI-equivalent suite (pytest -m "not slow and not integration"): 4877 passed, 37 skipped (only the pre-existing, CI-absent local dflash_mlx env failures remain — unrelated to this bump)
  • feat(turboquant): batched KV-cache compression (single + batch), no worse than single #1547 memory/accuracy harness: occupancy + accuracy identical to 0.5.0
  • Long-context decode A/B: +51% batch-TQ decode (table above)

mlx-vlm 0.6.0 (PyPI release) supersedes the 6f60ee4 commit pin. Beyond the
TurboQuant decode correctness fixes oMLX already relied on (Blaizzy/mlx-vlm
jundot#1244), it lands the RHT value-kernel speedup (jundot#1252): the L=1 value-sum now
runs the Metal kernel under RHT instead of falling back to an einsum. Because
that cost scales with KV length, batched TurboQuant masked decode gets markedly
faster at long context.

Validated with the jundot#1547 harness + a long-context decode A/B on
Llama-3.2-1B-Instruct-4bit (ctx=3584, prefill cancelled):

  batch TQ (B=4) decode:  165 -> 250 tok/s  (+51%)
  single TQ (B=1):        124 -> 126 tok/s  (flat; causal/fused path)
  batch fp16 (reference): 506 -> 494 tok/s  (within noise)

No oMLX code change needed: TurboQuantKVCache.{decode,prefill}_attention and
from_cache signatures are unchanged. KV occupancy/accuracy are identical (TQ
0.312x fp16, 69% projected long-context saving); TQ suite 32/32 and the full
CI-equivalent suite pass.
@popfido popfido force-pushed the perf/mlx-vlm-0.6.0 branch from f7eb6dd to 5e5c5b9 Compare June 2, 2026 04:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant