perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup by popfido · Pull Request #1599 · jundot/omlx

popfido · 2026-06-02T04:52:15Z

Summary

Bumps the mlx-vlm dependency from the 6f60ee4 commit pin to the 0.6.0 PyPI release. This is the upstream-released version of the TurboQuant decode fixes oMLX has been carrying, and it additionally lands the RHT value-kernel speedup (Blaizzy/mlx-vlm#1252) that directly accelerates the batched TurboQuant decode path wired up in #1547.

Why it's faster

In 0.5.0 the L=1 value-sum fast kernel was gated off whenever RHT was active (if not self.use_rht and ...) and fell back to an einsum on every decode step. RHT is on for any power-of-two head dim (e.g. Llama head_dim 64), so batched-TQ decode always took the slow path. 0.6.0 removes that gate and applies the RHT inverse on the kernel result (_value_rotate_inverse), so the Metal kernel runs under RHT. The value-sum cost scales with KV length, so the win shows at long context — exactly the regime where you'd reach for TurboQuant KV compression in the first place.

Validation

Re-ran the #1547 harness plus a long-context decode A/B on mlx-community/Llama-3.2-1B-Instruct-4bit (ctx=3584, two-point measurement so prefill cancels and only decode steps are timed):

scenario	0.5.0	0.6.0	Δ
batch TQ (B=4, masked decode)	165.1 tok/s	249.8 tok/s	+51%
single TQ (B=1, causal/fused path)	124.2	126.1	flat
batch fp16 (reference)	506.5	493.8	within noise

The gain is isolated to the masked batch-decode path; single-seq (fused causal kernel) and fp16 are unchanged — confirming the improvement is the value kernel, not measurement drift.

KV occupancy and accuracy are identical to 0.5.0 (TQ 0.312× fp16; 69% projected saving at 8K context; batch-vs-single token match unchanged).

Blast radius

No oMLX code changes. TurboQuantKVCache.{decode,prefill}_attention and from_cache signatures are unchanged; the existing omlx/patches/turboquant_attention.py routing works as-is.
transformers (already 5.8.1) satisfies 0.6.0's >=5.5.0; mlx-audio was already a transitive core dep, so no new core packages.

Test plan

TurboQuant suite (-m turboquant): 32/32 pass on 0.6.0
Full CI-equivalent suite (pytest -m "not slow and not integration"): 4877 passed, 37 skipped (only the pre-existing, CI-absent local dflash_mlx env failures remain — unrelated to this bump)
feat(turboquant): batched KV-cache compression (single + batch), no worse than single #1547 memory/accuracy harness: occupancy + accuracy identical to 0.5.0
Long-context decode A/B: +51% batch-TQ decode (table above)

mlx-vlm 0.6.0 (PyPI release) supersedes the 6f60ee4 commit pin. Beyond the TurboQuant decode correctness fixes oMLX already relied on (Blaizzy/mlx-vlm jundot#1244), it lands the RHT value-kernel speedup (jundot#1252): the L=1 value-sum now runs the Metal kernel under RHT instead of falling back to an einsum. Because that cost scales with KV length, batched TurboQuant masked decode gets markedly faster at long context. Validated with the jundot#1547 harness + a long-context decode A/B on Llama-3.2-1B-Instruct-4bit (ctx=3584, prefill cancelled): batch TQ (B=4) decode: 165 -> 250 tok/s (+51%) single TQ (B=1): 124 -> 126 tok/s (flat; causal/fused path) batch fp16 (reference): 506 -> 494 tok/s (within noise) No oMLX code change needed: TurboQuantKVCache.{decode,prefill}_attention and from_cache signatures are unchanged. KV occupancy/accuracy are identical (TQ 0.312x fp16, 69% projected long-context saving); TQ suite 32/32 and the full CI-equivalent suite pass.

popfido force-pushed the perf/mlx-vlm-0.6.0 branch from f7eb6dd to 5e5c5b9 Compare June 2, 2026 04:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599

perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599
popfido wants to merge 1 commit into
jundot:mainfrom
popfido:perf/mlx-vlm-0.6.0

popfido commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

popfido commented Jun 2, 2026

Summary

Why it's faster

Validation

Blast radius

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant