Add NVIDIA LocateAnything-3B (MoonViT + Qwen2.5, autoregressive mode) by beshkenadze · Pull Request #1 · beshkenadze/mlx-vlm

beshkenadze · 2026-05-29T19:06:40Z

Summary

Ports nvidia/LocateAnything-3B — a visual-grounding VLM (object detection / referring-expression grounding / pointing / GUI & text localization) — into mlx-vlm so it runs on Apple Silicon via mlx_vlm.generate.

Architecture: MoonViT-SO-400M vision tower (shared with Kimi-VL) + Qwen2.5-3B text backbone + a 2-layer MLP connector (mlp1). Output is structured coordinate tokens, e.g. <ref>remote</ref><box><64><152><273><244></box>, with coordinates quantized to <0>…<1000> (normalized).

What's implemented

Model package mlx_vlm/models/locateanything/:
- config.py — VisionConfig (moonvit) / TextConfig (qwen2) / ModelConfig with grounding token ids.
- vision.py — MoonViT tower (2D RoPE, per-image block attention, 2×2 patch merge), ported from kimi_vl and reconciled to LocateAnything's weight names, PytorchGELUTanh activation, and LayerNorm eps.
- language.py — standard Qwen2.5-3B causal LM (1D RoPE, GQA 16/2, tied embeddings) + the non-causal "magi" block mask for PBD.
- locateanything.py — mlp1 connector + Model (vision → projector → scatter at <IMG_CONTEXT>) + sanitize().
- pbd.py — Parallel Box Decoding (MTP block decoder + bbox decode utils).
- image_processing_locateanything.py / processing_locateanything.py + chat_template.json (preprocessing matched bit-for-bit to the HF reference — bicubic ceil-resize, not center-crop).
prompt_utils.py — register locateanything → LIST_WITH_IMAGE_FIRST.
generate/dispatch.py — a small additive, opt-in hook routing fast/hybrid to PBD (gated on model_type == "locateanything"); slow and every other model are unchanged.
tests/test_locateanything.py — config, vision/language/full-forward shapes, sanitize coverage, magi-mask, PBD decode utils, max-tokens, image-input handling (22 tests, all green).

Decoding modes

mode	description	throughput*
`slow` (default)	pure autoregressive	1×
`fast`	Parallel Box Decoding (MTP, parallel blocks)	~2×
`hybrid`	PBD with AR fallback on format irregularity	~2×

* 16-token COCO run. All three modes produce byte-identical grounding output.

Parity vs the PyTorch (CUDA) reference

Verified the MLX port numerically against the original HF/PyTorch model on an RTX 4090 (WSL, transformers==4.51, fp32). The HF vision_model + mlp1 were dumped on identical pixel_values, then fed to the MLX modules in fp32 and compared (scripts/la_parity_{ref,mlx}.py):

stage	grid 64×64 (no pos-emb interp)	grid 36×46 (with interp)
`vision_model`	cos 0.999937, mean\|Δ\| ≈ 1e-3	cos 0.9894
`mlp1` (connector)	cos 0.999898	cos 0.9971

The port's math is numerically faithful (cos ≈ 0.99994 on identical inputs).
The entire residual on non-square grids comes from one op: the learnable 2D pos-emb bicubic interpolation — MLX's shared bicubic_interpolate kernel uses a = -0.5 vs PyTorch's a = -0.75. It is localized to the additive pos-emb (hence connector cos > vision cos: the connector's LayerNorm partly cancels it), affects every MLX MoonViT port (incl. kimi_vl), and does not change the grounding output. Tracked in Align bicubic_interpolate with PyTorch (a=-0.75, not -0.5) for MoonViT pos-emb parity #2.
PBD parity: by the model's design invariant (hybrid falls back to AR for consistency), fast/hybrid are verified byte-identical to the slow (AR) path on the same input — so the verified AR path is itself the oracle for the parallel path.

Verification

Unit tests: python -m unittest mlx_vlm.tests.test_locateanything → 22 passed.

End-to-end (real weights, Apple Silicon):

python -m mlx_vlm.generate --model nvidia/LocateAnything-3B \
  --image http://images.cocodataset.org/val2017/000000039769.jpg \
  --prompt "Detect all objects in the image." --max-tokens 128 --temperature 0.0

→ <ref>remote</ref><box><64><152><273><244></box><box><522><160><578><390></box> (boxes match the two remotes); prompt token count matches the HF processor exactly.

Large images: fixed a vision-attention OOM (single-image dense [S,S] mask forced SDPA off the flash path; now flash, O(N) memory) — LocateAnything-3B: dense O(N²) vision attention OOMs on large images (Metal single-buffer cap) #3.
Codex CLI review (codex exec review): no P0/P1; two P2 edge-cases (PBD max_tokens, mx.array image input) fixed.

Blast radius

Everything PBD/model-specific lives in the locateanything package; the only shared edits are one additive line in prompt_utils.py and one additive, gated hook in dispatch.py. No other model, the vision tower, the processor, or shared SDPA is affected; slow AR remains the default.

Quantized weights

MLX builds published to mlx-community: bf16, 8bit, 4bit (mixed 4/8-bit — pure 4-bit degrades the tied coordinate-token embedding).

…nector/model, processor, tests

…e, no crop) + add parity harness Parity vs HF reference (RTX 4090, transformers 4.51, fp32, identical inputs): vision_model cos=0.999937, mlp1 cos=0.999898 (grid 64x64, no pos-emb interp). Image processor previously center-cropped down (grid 34x44, dropped border pixels); HF bicubic-resizes up (grid 36x46). Now matched -> 442 prompt tokens identical to HF. Residual ~1% on non-square grids is the shared bicubic pos-emb interpolation kernel (MLX vs torch), which does not affect output correctness.

Implements PBD — the headline LocateAnything-3B feature — as an opt-in multi-token-prediction (MTP) block decoder on top of the existing AR port. - language.py: magi non-causal block-attention mask builder (build_magi_block_mask, dense equivalent of HF build_magi_ranges) plus an explicit-position RoPE path for the duplicated bridge token. The causal AR path is untouched (position_ids=None preserves original behaviour). - pbd.py: PBD decode loop (MTP forward -> sample block -> accept / AR fallback) with ported decode utils (decode_bbox_avg, decode_ref, handle_pattern, is_valid_box_frame). KV cache rewind via KVCache.trim after each block. - locateanything.py: Model.pbd_generate + make_cache entry points. - config.py: block_size, causal_attn, text_mask/null/switch token ids, n_future_tokens. - dispatch.py: additive, triple-gated opt-in hook routing locateanything fast/hybrid to pbd_generate; slow and every other model stay on default AR. Verified on COCO cats image (greedy): fast == hybrid == slow == AR oracle (byte-identical). PBD ~2x faster than slow. 16 unit tests pass.

…(review finding 2)

…flash path (#3) A single image's block mask is all-True (no-op), but passing it explicitly forced mx.fast.scaled_dot_product_attention off the flash kernel and materialized a dense [1,heads,S,S] fp32 score tensor -> 15.58GB / OOM on large frames (e.g. 2304x1296 -> 15604 patches). Now pass mask=None for a single image (flash, O(N) memory); multi-image batches keep the block-diagonal mask. Single- image output is unchanged (verified: identical COCO boxes); +3 mask-logic tests.

…ard + push)

- pbd: truncate generated tokens to max_tokens (fast/hybrid could overrun the budget by appending a full block past the limit, e.g. max_tokens<block_size). - image processor: convert mx.array -> PIL before HF validation (make_list_of_images rejected mx.array, making the advertised array path dead code); reject unknown types consistently. +3 regression tests (22 total).

Parity/oracle/upload helpers were local dev artifacts; they don't belong in the model port. Removed so the PR contains only the locateanything package + the additive prompt_utils/dispatch hooks + tests.

beshkenadze · 2026-05-30T11:23:22Z

Superseded by the upstream PR → Blaizzy#1242 (same branch). Continuing review there.

beshkenadze added 5 commits May 29, 2026 21:48

scaffold(locateanything): config dataclasses + package skeleton

972c4f7

feat(locateanything): MoonViT+Qwen2.5 AR port — vision, language, con…

a70913a

…nector/model, processor, tests

style(locateanything): black + isort

8244ef3

feat(locateanything): register LIST_WITH_IMAGE_FIRST prompt format

b862c6c

beshkenadze mentioned this pull request May 29, 2026

Align bicubic_interpolate with PyTorch (a=-0.75, not -0.5) for MoonViT pos-emb parity #2

Open

4 tasks

beshkenadze added 3 commits May 29, 2026 23:33

harden(locateanything): assert block_size==6 contract in PBD decoder …

8981e91

…(review finding 2)

beshkenadze force-pushed the feat/locateanything-3b branch from 5e471bb to 994806d Compare May 29, 2026 20:33

chore(locateanything): mlx-community upload helper (clean + LICENSE/c…

37adf5b

…ard + push)

beshkenadze mentioned this pull request May 30, 2026

LocateAnything-3B: dense O(N²) vision attention OOMs on large images (Metal single-buffer cap) #3

Closed

beshkenadze added 2 commits May 30, 2026 14:07

chore(locateanything): drop dev/verification scripts from the PR

53dea3f

Parity/oracle/upload helpers were local dev artifacts; they don't belong in the model port. Removed so the PR contains only the locateanything package + the additive prompt_utils/dispatch hooks + tests.

beshkenadze closed this May 30, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add NVIDIA LocateAnything-3B (MoonViT + Qwen2.5, autoregressive mode)#1

Add NVIDIA LocateAnything-3B (MoonViT + Qwen2.5, autoregressive mode)#1
beshkenadze wants to merge 11 commits into
mainfrom
feat/locateanything-3b

beshkenadze commented May 29, 2026 •

edited

Loading

Uh oh!

beshkenadze commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

beshkenadze commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's implemented

Decoding modes

Parity vs the PyTorch (CUDA) reference

Verification

Blast radius

Quantized weights

Uh oh!

beshkenadze commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

beshkenadze commented May 29, 2026 •

edited

Loading