fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138) by Pushkinist · Pull Request #145 · Pushkinist/rMLX

Pushkinist · 2026-06-18T16:40:44Z

Summary

Fixes #138.

Qwen3-VL image inference ignored --max-ctx: the KV cache stayed at 4096, so a prompt over 4096 tokens (a large image tiles to thousands of soft tokens) hit mlx: slice_update [broadcast_shapes] (1,4,N,128) vs (1,4,4096,128).

Root cause

crates/rmlx-models/src/qwen3_vl_moe/generate.rs built per-layer KV via KvCache::with_quant(kv_quant) — which defaults max_seq to KV_MAX_SEQ_DEFAULT=4096 — and never called with_max_seq_ceiling nor bracketed prefill with enter_prefill()/exit_prefill(). With in_prefill=false, the None-path prefill went to the fixed 4096 decode buffer instead of the lazy-grow ring. Every other arch (gemma4, qwen3_5_moe, qwen3) sizes KV from kv_max_seq_and_ceiling(max_ctx_override, mpe) and brackets prefill; Qwen3-VL did neither, and the image path never even received max_ctx_override.

Fix (general — same chain as the other arches, no model-name special-casing)

Both the image and text paths resolve (initial_max_seq, ceiling) via kv_max_seq_and_ceiling(max_ctx_override, mpe) and build caches with with_quant_max_seq(...).with_max_seq_ceiling(...) — byte-identical call shape to gemma4/qwen3_5_moe.
max_ctx_override plumbed through run_qwen3vl_image → generate_image.
Prefill chunked (text via shared chunked_prefill; image via new forward_embeds_chunked with per-chunk deepstack slicing) under a qwen3_vl_moe prefill_chunk=512 row, so a long prompt isn't one oversized Metal command buffer. ViT now evals per block + materializes merger/deepstack outputs.
Over-cap now hits ensure_prefill_capacity's ceiling → KvCeilingExceeded (already classified Fatal → context_overflow), instead of the migratable slice_update broadcast that was retried 3× delivering 0 tokens.
Removed the now-dead single-shot forward_embeds.

Changes

crates/rmlx-models/src/qwen3_vl_moe/generate.rs — both paths: ceiling-sized caches + chunked prefill; max_ctx_override used (was _-prefixed) + added to generate_image.
crates/rmlx-models/src/qwen3_vl_moe/model.rs — new forward_embeds_chunked (per-chunk deepstack-row slicing + watchdog flush); dead forward_embeds removed.
crates/rmlx-models/src/qwen3_vl_moe/vision.rs — per-block h.eval() + materialize merger/deepstack outputs (ViT watchdog mitigation; numerically identical).
crates/rmlx-models/src/qwen3_vl_moe/generate_tests.rs — NEW: >4096 prefill fits, ceiling honored (not 4096), over-cap → KvCeilingExceeded not broadcast, small image unaffected.
crates/rmlx-models/src/prefill_chunk.rs (+tests) — qwen3_vl_moe → 512 row.
crates/rmlx-server/src/engine/{image.rs,arch_generator.rs} — thread max_ctx_override into the image path.
docs/MODELS.md — Qwen3-VL --max-ctx KV-sizing + chunking + ViT-watchdog note.

Verification

make lint clean; cargo test -p rmlx-models -p rmlx-server all pass (new generate_tests 4/4, prefill_chunk 8/8). Rust review: no findings.
Real-model proof (Qwen3-VL-30B-A3B-Instruct-4bit, --max-ctx 16384):
- 4913-token text prompt completes coherently; KV ring grows past 4096 — log kv_alloc cause="grow" offset=4914 max_seq=8192.
- Over-cap (--max-ctx 4000) → {"error":{"message":"prompt has 4913 tokens, max_ctx is 4000","type":"context_length_exceeded"}} — clean, no panic.
- 2560×2560 image (6400 soft tokens): zero slice_update [broadcast_shapes] across the whole run — the 4096-broadcast symptom is gone.
- Small image (196 soft tokens) regression: coherent.

Known separate limit (not this bug)

On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL image whose ViT exceeds ~1300 soft tokens (≈>5200 patches, O(n²) full-attention) can overrun the ~10s Metal GPU watchdog in the vision tower, before text decode — a pre-existing vision-tower scaling limit independent of KV sizing. The KV fix is what completes such images on a GPU with sufficient watchdog headroom (the original reporter's machine cleared the ViT and failed only at the now-fixed KV slice_update). Tracking ViT streaming/tiling as a follow-up.

🤖 Generated with Claude Code

…138) Qwen3-VL image/text generate built per-layer KV caches with the bare KV_MAX_SEQ_DEFAULT=4096 and never bracketed prefill, so a prompt longer than 4096 (a large image tiles to thousands of soft tokens) overflowed the fixed decode buffer: `slice_update: [broadcast_shapes] (1,4,6776,128) vs (1,4,4096,128)`. The KV stayed at 4096 regardless of --max-ctx. Both paths now resolve (initial_max_seq, ceiling) from the effective --max-ctx via kv_max_seq_and_ceiling (same chain as Gemma4 / Qwen3.5-MoE) and build the ring with with_quant_max_seq(...).with_max_seq_ceiling(...). The ring grows lazily up to the ceiling so a prompt up to --max-ctx fits; an over-cap prompt is rejected with a clean KvCeilingExceeded -> context_overflow instead of a cryptic slice_update broadcast. Plumbs max_ctx_override through run_qwen3vl_image -> generate_image. Both prefills are also chunked (per-arch prefill_chunk=512) so a long prompt is not one multi-thousand-token Metal command buffer (the ~10s GPU watchdog): text uses the shared chunked_prefill; image uses a new forward_embeds_chunked with per-chunk deepstack slicing. The ViT now evaluates per block + materializes its merger/deepstack outputs so the vision graph does not fold into the first prefill chunk. Proven on mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit: a >4096-token text prompt completes with coherent output (KV grows 4096->8192), an over-cap prompt returns context_length_exceeded, and a small image is unaffected. A very large image (tens of thousands of ViT patches, O(n^2) full attention) can still hit the GPU watchdog on memory-constrained Apple Silicon — a vision-tower scaling limit independent of the KV-sizing fix; documented in docs/MODELS.md.

Pushkinist self-assigned this Jun 18, 2026

Pushkinist merged commit 941336a into main Jun 18, 2026
2 checks passed

Pushkinist deleted the fix/138-qwen3vl-maxctx-kv branch June 18, 2026 16:45

Pushkinist mentioned this pull request Jun 18, 2026

Qwen3-VL large-image ViT full-attention overruns Metal GPU watchdog (>~1300 soft tokens) #146

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138)#145

fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138)#145
Pushkinist merged 1 commit into
mainfrom
fix/138-qwen3vl-maxctx-kv

Pushkinist commented Jun 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Pushkinist commented Jun 18, 2026

Summary

Root cause

Fix (general — same chain as the other arches, no model-name special-casing)

Changes

Verification

Known separate limit (not this bug)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant