Qwen3-VL large-image ViT full-attention overruns Metal GPU watchdog (>~1300 soft tokens)

## Summary

Separate, pre-existing limit surfaced while fixing #138 (KV-cache sizing, now resolved in #145).

On a watchdog-constrained Apple-Silicon GPU, a Qwen3-VL (`Qwen3VLMoeForConditionalGeneration`) image whose vision tower processes more than ~1300 soft tokens (≈ >5200 patches) overruns the ~10s Metal GPU watchdog **inside the ViT full-attention** (O(n²) per block), aborting the process *before* text decode. This is distinct from #138 — that bug was the KV cache capping at 4096 and is fixed; this is vision-tower compute time.

## Evidence (post-#145, branch was `fix/138-qwen3vl-maxctx-kv`)

Real-model sweep on `mlx-community/Qwen3-VL-30B-A3B-Instruct-4bit`, `--max-ctx 16384`:

| Image | soft tokens | Outcome |
|---|---|---|
| 448×448 | 196 | completes, coherent |
| 896×896 | 784 | completes, coherent |
| 1152×1152 | 1296 | Metal GPU Timeout (ViT stage) |
| 2560×2560 | 6400 | ViT completes (~13s), then Metal GPU Timeout at KV prefill |

No `slice_update [broadcast_shapes]` anywhere (the #138 symptom is gone). The failure is now a Metal command-buffer GPU-timeout in the vision tower / prefill.

## Impact

Large, full-resolution images (which tile to thousands of soft tokens) cannot complete on memory/watchdog-constrained GPUs, even though the KV cache now sizes correctly. The original #138 reporter's machine had enough watchdog headroom to clear the ViT, so for them #145 is sufficient; lower-headroom machines need this.

## Likely direction

Stream / tile the ViT attention (and the image-prefill command buffers) so no single Metal command buffer exceeds the watchdog budget — split the vision-tower attention into windowed/tiled passes with intermediate `eval()` barriers, analogous to the `prefill_chunk` chunking already added for the text/KV prefill in #145. Keep it reference-faithful to mlx-vlm numerics.

## Out of scope

KV-cache sizing (done in #145). This issue is purely the vision-tower / prefill GPU-time scaling.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Qwen3-VL large-image ViT full-attention overruns Metal GPU watchdog (>~1300 soft tokens) #146

Summary

Evidence (post-#145, branch was `fix/138-qwen3vl-maxctx-kv`)

Impact

Likely direction

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Image	soft tokens	Outcome
448×448	196	completes, coherent
896×896	784	completes, coherent
1152×1152	1296	Metal GPU Timeout (ViT stage)
2560×2560	6400	ViT completes (~13s), then Metal GPU Timeout at KV prefill

Qwen3-VL large-image ViT full-attention overruns Metal GPU watchdog (>~1300 soft tokens) #146

Description

Summary

Evidence (post-#145, branch was fix/138-qwen3vl-maxctx-kv)

Impact

Likely direction

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Evidence (post-#145, branch was `fix/138-qwen3vl-maxctx-kv`)