perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599
Open
popfido wants to merge 1 commit into
Open
perf(turboquant): bump mlx-vlm to 0.6.0 for batched-TQ decode speedup#1599popfido wants to merge 1 commit into
popfido wants to merge 1 commit into
Conversation
mlx-vlm 0.6.0 (PyPI release) supersedes the 6f60ee4 commit pin. Beyond the TurboQuant decode correctness fixes oMLX already relied on (Blaizzy/mlx-vlm jundot#1244), it lands the RHT value-kernel speedup (jundot#1252): the L=1 value-sum now runs the Metal kernel under RHT instead of falling back to an einsum. Because that cost scales with KV length, batched TurboQuant masked decode gets markedly faster at long context. Validated with the jundot#1547 harness + a long-context decode A/B on Llama-3.2-1B-Instruct-4bit (ctx=3584, prefill cancelled): batch TQ (B=4) decode: 165 -> 250 tok/s (+51%) single TQ (B=1): 124 -> 126 tok/s (flat; causal/fused path) batch fp16 (reference): 506 -> 494 tok/s (within noise) No oMLX code change needed: TurboQuantKVCache.{decode,prefill}_attention and from_cache signatures are unchanged. KV occupancy/accuracy are identical (TQ 0.312x fp16, 69% projected long-context saving); TQ suite 32/32 and the full CI-equivalent suite pass.
f7eb6dd to
5e5c5b9
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bumps the
mlx-vlmdependency from the6f60ee4commit pin to the 0.6.0 PyPI release. This is the upstream-released version of the TurboQuant decode fixes oMLX has been carrying, and it additionally lands the RHT value-kernel speedup (Blaizzy/mlx-vlm#1252) that directly accelerates the batched TurboQuant decode path wired up in #1547.Why it's faster
In 0.5.0 the L=1 value-sum fast kernel was gated off whenever RHT was active (
if not self.use_rht and ...) and fell back to an einsum on every decode step. RHT is on for any power-of-two head dim (e.g. Llama head_dim 64), so batched-TQ decode always took the slow path. 0.6.0 removes that gate and applies the RHT inverse on the kernel result (_value_rotate_inverse), so the Metal kernel runs under RHT. The value-sum cost scales with KV length, so the win shows at long context — exactly the regime where you'd reach for TurboQuant KV compression in the first place.Validation
Re-ran the #1547 harness plus a long-context decode A/B on
mlx-community/Llama-3.2-1B-Instruct-4bit(ctx=3584, two-point measurement so prefill cancels and only decode steps are timed):The gain is isolated to the masked batch-decode path; single-seq (fused causal kernel) and fp16 are unchanged — confirming the improvement is the value kernel, not measurement drift.
KV occupancy and accuracy are identical to 0.5.0 (TQ 0.312× fp16; 69% projected saving at 8K context; batch-vs-single token match unchanged).
Blast radius
TurboQuantKVCache.{decode,prefill}_attentionandfrom_cachesignatures are unchanged; the existingomlx/patches/turboquant_attention.pyrouting works as-is.transformers(already 5.8.1) satisfies 0.6.0's>=5.5.0;mlx-audiowas already a transitive core dep, so no new core packages.Test plan
-m turboquant): 32/32 pass on 0.6.0pytest -m "not slow and not integration"): 4877 passed, 37 skipped (only the pre-existing, CI-absent localdflash_mlxenv failures remain — unrelated to this bump)