Skip to content

fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138)#145

Merged
Pushkinist merged 1 commit into
mainfrom
fix/138-qwen3vl-maxctx-kv
Jun 18, 2026
Merged

fix(qwen3-vl): size KV from --max-ctx + chunk prefill (#138)#145
Pushkinist merged 1 commit into
mainfrom
fix/138-qwen3vl-maxctx-kv

Conversation

@Pushkinist

Copy link
Copy Markdown
Owner

Summary

Fixes #138.

Qwen3-VL image inference ignored --max-ctx: the KV cache stayed at 4096, so a prompt over 4096 tokens (a large image tiles to thousands of soft tokens) hit mlx: slice_update [broadcast_shapes] (1,4,N,128) vs (1,4,4096,128).

Root cause

crates/rmlx-models/src/qwen3_vl_moe/generate.rs built per-layer KV via KvCache::with_quant(kv_quant) — which defaults max_seq to KV_MAX_SEQ_DEFAULT=4096 — and never called with_max_seq_ceiling nor bracketed prefill with enter_prefill()/exit_prefill(). With in_prefill=false, the None-path prefill went to the fixed 4096 decode buffer instead of the lazy-grow ring. Every other arch (gemma4, qwen3_5_moe, qwen3) sizes KV from kv_max_seq_and_ceiling(max_ctx_override, mpe) and brackets prefill; Qwen3-VL did neither, and the image path never even received max_ctx_override.

Fix (general — same chain as the other arches, no model-name special-casing)

  • Both the image and text paths resolve (initial_max_seq, ceiling) via kv_max_seq_and_ceiling(max_ctx_override, mpe) and build caches with with_quant_max_seq(...).with_max_seq_ceiling(...) — byte-identical call shape to gemma4/qwen3_5_moe.
  • max_ctx_override plumbed through run_qwen3vl_imagegenerate_image.
  • Prefill chunked (text via shared chunked_prefill; image via new forward_embeds_chunked with per-chunk deepstack slicing) under a qwen3_vl_moe prefill_chunk=512 row, so a long prompt isn't one oversized Metal command buffer. ViT now evals per block + materializes merger/deepstack outputs.
  • Over-cap now hits ensure_prefill_capacity's ceiling → KvCeilingExceeded (already classified Fatal → context_overflow), instead of the migratable slice_update broadcast that was retried 3× delivering 0 tokens.
  • Removed the now-dead single-shot forward_embeds.

Changes

  • crates/rmlx-models/src/qwen3_vl_moe/generate.rs — both paths: ceiling-sized caches + chunked prefill; max_ctx_override used (was _-prefixed) + added to generate_image.
  • crates/rmlx-models/src/qwen3_vl_moe/model.rs — new forward_embeds_chunked (per-chunk deepstack-row slicing + watchdog flush); dead forward_embeds removed.
  • crates/rmlx-models/src/qwen3_vl_moe/vision.rs — per-block h.eval() + materialize merger/deepstack outputs (ViT watchdog mitigation; numerically identical).
  • crates/rmlx-models/src/qwen3_vl_moe/generate_tests.rs — NEW: >4096 prefill fits, ceiling honored (not 4096), over-cap → KvCeilingExceeded not broadcast, small image unaffected.
  • crates/rmlx-models/src/prefill_chunk.rs (+tests) — qwen3_vl_moe → 512 row.
  • crates/rmlx-server/src/engine/{image.rs,arch_generator.rs} — thread max_ctx_override into the image path.
  • docs/MODELS.md — Qwen3-VL --max-ctx KV-sizing + chunking + ViT-watchdog note.

Verification

  • make lint clean; cargo test -p rmlx-models -p rmlx-server all pass (new generate_tests 4/4, prefill_chunk 8/8). Rust review: no findings.
  • Real-model proof (Qwen3-VL-30B-A3B-Instruct-4bit, --max-ctx 16384):
    • 4913-token text prompt completes coherently; KV ring grows past 4096 — log kv_alloc cause="grow" offset=4914 max_seq=8192.
    • Over-cap (--max-ctx 4000) → {"error":{"message":"prompt has 4913 tokens, max_ctx is 4000","type":"context_length_exceeded"}} — clean, no panic.
    • 2560×2560 image (6400 soft tokens): zero slice_update [broadcast_shapes] across the whole run — the 4096-broadcast symptom is gone.
    • Small image (196 soft tokens) regression: coherent.

Known separate limit (not this bug)

On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL image whose ViT exceeds ~1300 soft tokens (≈>5200 patches, O(n²) full-attention) can overrun the ~10s Metal GPU watchdog in the vision tower, before text decode — a pre-existing vision-tower scaling limit independent of KV sizing. The KV fix is what completes such images on a GPU with sufficient watchdog headroom (the original reporter's machine cleared the ViT and failed only at the now-fixed KV slice_update). Tracking ViT streaming/tiling as a follow-up.

🤖 Generated with Claude Code

…138)

Qwen3-VL image/text generate built per-layer KV caches with the bare
KV_MAX_SEQ_DEFAULT=4096 and never bracketed prefill, so a prompt longer than
4096 (a large image tiles to thousands of soft tokens) overflowed the fixed
decode buffer: `slice_update: [broadcast_shapes] (1,4,6776,128) vs
(1,4,4096,128)`. The KV stayed at 4096 regardless of --max-ctx.

Both paths now resolve (initial_max_seq, ceiling) from the effective --max-ctx
via kv_max_seq_and_ceiling (same chain as Gemma4 / Qwen3.5-MoE) and build the
ring with with_quant_max_seq(...).with_max_seq_ceiling(...). The ring grows
lazily up to the ceiling so a prompt up to --max-ctx fits; an over-cap prompt
is rejected with a clean KvCeilingExceeded -> context_overflow instead of a
cryptic slice_update broadcast. Plumbs max_ctx_override through
run_qwen3vl_image -> generate_image.

Both prefills are also chunked (per-arch prefill_chunk=512) so a long prompt is
not one multi-thousand-token Metal command buffer (the ~10s GPU watchdog): text
uses the shared chunked_prefill; image uses a new forward_embeds_chunked with
per-chunk deepstack slicing. The ViT now evaluates per block + materializes its
merger/deepstack outputs so the vision graph does not fold into the first
prefill chunk.

Proven on mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit: a >4096-token text
prompt completes with coherent output (KV grows 4096->8192), an over-cap prompt
returns context_length_exceeded, and a small image is unaffected. A very large
image (tens of thousands of ViT patches, O(n^2) full attention) can still hit
the GPU watchdog on memory-constrained Apple Silicon — a vision-tower scaling
limit independent of the KV-sizing fix; documented in docs/MODELS.md.
@Pushkinist Pushkinist self-assigned this Jun 18, 2026
@Pushkinist Pushkinist merged commit 941336a into main Jun 18, 2026
2 checks passed
@Pushkinist Pushkinist deleted the fix/138-qwen3vl-maxctx-kv branch June 18, 2026 16:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Qwen3-VL image inference ignores --max-ctx: KV cache stays at 4096, large images fail with slice_update broadcast

1 participant