Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
a08ea09
scaffold(locateanything): config dataclasses + package skeleton
beshkenadze May 29, 2026
f0b45dc
feat(locateanything): MoonViT+Qwen2.5 AR port — vision, language, con…
beshkenadze May 29, 2026
c6da8a3
style(locateanything): black + isort
beshkenadze May 29, 2026
1e980a3
feat(locateanything): register LIST_WITH_IMAGE_FIRST prompt format
beshkenadze May 29, 2026
213daec
fix(locateanything): match HF image preprocessing (bicubic ceil-resiz…
beshkenadze May 29, 2026
259c984
feat(locateanything): add Parallel Box Decoding (fast/hybrid/slow)
beshkenadze May 29, 2026
406f8f7
harden(locateanything): assert block_size==6 contract in PBD decoder …
beshkenadze May 29, 2026
d28e29e
fix(locateanything): drop dense vision mask for single image -> SDPA …
beshkenadze May 29, 2026
450445e
chore(locateanything): mlx-community upload helper (clean + LICENSE/c…
beshkenadze May 29, 2026
9678f20
fix(locateanything): address Codex review (P2)
beshkenadze May 30, 2026
55a7ec1
chore(locateanything): drop dev/verification scripts from the PR
beshkenadze May 30, 2026
a9dfcf1
address review (#1242): drop bundled chat_template.json + dispatch PB…
beshkenadze May 30, 2026
003aed5
feat(dispatch): generic capability-based PBD route + --generation-mod…
beshkenadze May 30, 2026
df7254d
style(locateanything): isort + autoflake (pre-commit CI fix)
beshkenadze May 30, 2026
4fa1385
Merge branch 'main' into feat/locateanything-3b
Blaizzy May 30, 2026
04591ef
Merge branch 'main' into feat/locateanything-3b
beshkenadze May 30, 2026
522e180
Revert LocateAnything generation dispatch
Blaizzy Jun 1, 2026
3c2f058
Merge branch 'main' into feat/locateanything-3b
Blaizzy Jun 1, 2026
bed2af8
Add LocateAnything processor save support
Blaizzy Jun 1, 2026
7108f3f
Align LocateAnything tests with model suite
Blaizzy Jun 1, 2026
0780e51
Remove LocateAnything comment noise
Blaizzy Jun 2, 2026
c328d2d
Add LocateAnything model README
Blaizzy Jun 2, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Some models have detailed documentation with prompt formats, examples, and best
| MiniCPM-o | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmo/README.md) |
| Phi-4 Multimodal | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4mm/README.md) |
| MolmoPoint | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/molmo_point/README.md) |
| LocateAnything | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/locateanything/README.md) |
| Moondream3 | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/moondream3/README.md) |
| Gemma 4 | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/gemma4/README.md) |
| Falcon-OCR | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/falcon_ocr/README.md) |
Expand Down
123 changes: 123 additions & 0 deletions mlx_vlm/models/locateanything/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# LocateAnything

LocateAnything is NVIDIA's 3B vision-language grounding model for locating objects and referred regions in an image. This MLX-VLM port supports the MoonViT vision tower, Qwen2.5 text backbone, custom image processor, and the model-specific Parallel Box Decoding path.

## Model

| | |
|---|---|
| **Model ID** | `nvidia/LocateAnything-3B` |
| **Architecture** | MoonViT vision encoder + MLP connector + Qwen2.5 language model |
| **Parameters** | 3B |
| **Modalities** | Image + text |
| **Primary Tasks** | Visual grounding, open-vocabulary object localization, referring expression localization |

## CLI

The standard CLI uses autoregressive generation.

```bash
mlx_vlm.generate \
--model nvidia/LocateAnything-3B \
--image examples/images/cats.jpg \
--prompt "Locate the cats." \
--max-tokens 128 \
--temperature 0.0
```

## Python

### Autoregressive generation

```python
from mlx_vlm import generate, load
from mlx_vlm.prompt_utils import apply_chat_template

model, processor = load("nvidia/LocateAnything-3B")

prompt = apply_chat_template(
processor,
model.config,
"Locate the cats.",
num_images=1,
)

result = generate(
model=model,
processor=processor,
prompt=prompt,
image="examples/images/cats.jpg",
max_tokens=128,
temperature=0.0,
)
print(result.text)
```

### Parallel Box Decoding

LocateAnything also exposes Parallel Box Decoding through `model.pbd_generate`. Use this direct model API for the `fast`, `hybrid`, and `slow` modes.

```python
from mlx_vlm import load
from mlx_vlm.prompt_utils import apply_chat_template
from mlx_vlm.utils import prepare_inputs

model, processor = load("nvidia/LocateAnything-3B")

prompt = apply_chat_template(
processor,
model.config,
"Locate the cats.",
num_images=1,
)
inputs = prepare_inputs(
processor,
images=["examples/images/cats.jpg"],
prompts=prompt,
)

input_ids = inputs.pop("input_ids")
inputs.pop("attention_mask", None)

tokens = model.pbd_generate(
input_ids,
generation_mode="hybrid",
max_tokens=128,
**inputs,
)
print(processor.decode(tokens, skip_special_tokens=False))
```

`generation_mode` accepts:

- `hybrid`: starts with Parallel Box Decoding and falls back to autoregressive decoding when needed.
- `fast`: uses Parallel Box Decoding only.
- `slow`: uses autoregressive decoding through the LocateAnything PBD wrapper.

## Architecture

- **Vision tower**: MoonViT image encoder with 14x14 patches, 2D RoPE, and patch merging.
- **Connector**: LayerNorm + two linear layers projecting merged vision features into the language hidden size.
- **Language model**: Qwen2-style decoder with tied embeddings.
- **Processor**: Expands `<image-N>` placeholders into `<img>...<IMG_CONTEXT>...</img>` spans based on the processed image grid.
- **PBD**: Generates fixed-size box token blocks for the model's coordinate format.

## Folder Structure

```text
mlx_vlm/models/locateanything/
__init__.py
config.py
image_processing_locateanything.py
language.py
locateanything.py
pbd.py
processing_locateanything.py
vision.py
```

## Notes

- For multi-image prompts, pass `num_images=len(images)` and provide the same number of images to `generate` or `prepare_inputs`.
- Increase `--max-tokens` for scenes with many objects.
- The custom processor supports `save_pretrained()` and writes `processor_config.json`, `preprocessor_config.json`, and `chat_template.json`.
6 changes: 6 additions & 0 deletions mlx_vlm/models/locateanything/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
from .config import ModelConfig, TextConfig, VisionConfig
from .image_processing_locateanything import (
LocateAnythingImageProcessor as ImageProcessor,
)
from .locateanything import LanguageModel, Model, VisionModel
from .processing_locateanything import LocateAnythingProcessor as Processor
71 changes: 71 additions & 0 deletions mlx_vlm/models/locateanything/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
from dataclasses import dataclass, field
from typing import List, Optional

from ..base import BaseModelConfig


@dataclass
class VisionConfig(BaseModelConfig):
model_type: str = "moonvit"
hidden_size: int = 1152
num_hidden_layers: int = 27
num_attention_heads: int = 16
intermediate_size: int = 4304
patch_size: int = 14
init_pos_emb_height: int = 64
init_pos_emb_width: int = 64
num_channels: int = 3
merge_kernel_size: List[int] = field(default_factory=lambda: [2, 2])

def __post_init__(self):
if self.merge_kernel_size is None:
self.merge_kernel_size = [2, 2]
self.depth = self.num_hidden_layers
self.num_heads = self.num_attention_heads
self.embed_dim = self.hidden_size
self.spatial_merge_size = self.merge_kernel_size[0]


@dataclass
class TextConfig(BaseModelConfig):
model_type: str = "qwen2"
hidden_size: int = 2048
num_hidden_layers: int = 36
intermediate_size: int = 11008
num_attention_heads: int = 16
num_key_value_heads: Optional[int] = 2
vocab_size: int = 152681
rms_norm_eps: float = 1e-6
rope_theta: float = 1000000.0
rope_traditional: bool = False
rope_scaling: Optional[dict] = None
max_position_embeddings: int = 32768
tie_word_embeddings: bool = True
block_size: int = 6
causal_attn: bool = False
text_mask_token_id: int = 151676
null_token_id: int = 152678
switch_token_id: int = 152679

def __post_init__(self):
if self.num_key_value_heads is None:
self.num_key_value_heads = self.num_attention_heads


@dataclass
class ModelConfig(BaseModelConfig):
text_config: TextConfig
vision_config: VisionConfig
model_type: str = "locateanything"
image_token_index: int = 151665
box_start_token_id: int = 151668
box_end_token_id: int = 151669
coord_start_token_id: int = 151677
coord_end_token_id: int = 152677
ref_start_token_id: int = 151672
ref_end_token_id: int = 151673
none_token_id: int = 4064
mlp_connector_layers: int = 2
vocab_size: int = 152681
eos_token_id: Optional[List[int]] = None
n_future_tokens: int = 6
154 changes: 154 additions & 0 deletions mlx_vlm/models/locateanything/image_processing_locateanything.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,154 @@
import math
from typing import List, Optional, Tuple, Union

import mlx.core as mx
from PIL import Image

_materialize = mx.eval

LOCATEANYTHING_IMAGE_MEAN = (0.5, 0.5, 0.5)
LOCATEANYTHING_IMAGE_STD = (0.5, 0.5, 0.5)


def _base_image_processor():
from transformers.image_processing_utils import BaseImageProcessor

return BaseImageProcessor


class LocateAnythingImageProcessor(_base_image_processor()):

model_input_names = ["pixel_values", "image_grid_hws"]

def __init__(
self,
patch_size: int = 14,
image_mean: Tuple[float, float, float] = LOCATEANYTHING_IMAGE_MEAN,
image_std: Tuple[float, float, float] = LOCATEANYTHING_IMAGE_STD,
in_token_limit: int = 25600,
merge_kernel_size: List[int] = None,
**kwargs,
):
super().__init__(**kwargs)
self.in_token_limit = in_token_limit
self.patch_size = patch_size
self.image_mean = image_mean
self.image_std = image_std
self.merge_kernel_size = (
merge_kernel_size if merge_kernel_size is not None else [2, 2]
)

def rescale(
self, image: Image.Image, merge_kernel_size: List[int] = None
) -> Image.Image:
if merge_kernel_size is None:
merge_kernel_size = self.merge_kernel_size

w, h = image.size
patch_size = self.patch_size

if (w // patch_size) * (h // patch_size) > self.in_token_limit:
scale = math.sqrt(
self.in_token_limit / ((w // patch_size) * (h // patch_size))
)
new_w, new_h = int(w * scale), int(h * scale)
image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)

new_w, new_h = image.size
pad_w = merge_kernel_size[1] * patch_size
pad_h = merge_kernel_size[0] * patch_size
target_w = math.ceil(new_w / pad_w) * pad_w
target_h = math.ceil(new_h / pad_h) * pad_h
if (target_w, target_h) != (new_w, new_h):
image = image.resize((target_w, target_h), Image.Resampling.BICUBIC)

w, h = image.size
if w // patch_size >= 512 or h // patch_size >= 512:
raise ValueError("Exceed pos emb")

return image

def to_mlx(self, image: Image.Image) -> mx.array:
image = image.convert("RGB")
w, h = image.size
arr = mx.array(list(image.getdata()), dtype=mx.float32).reshape(h, w, 3) / 255.0
return arr.transpose(2, 0, 1)

def normalize(self, image: mx.array) -> mx.array:
mean = mx.array(self.image_mean, dtype=mx.float32).reshape(3, 1, 1)
std = mx.array(self.image_std, dtype=mx.float32).reshape(3, 1, 1)
return (image - mean) / std

def patchify(self, image: mx.array) -> Tuple[mx.array, Tuple[int, int]]:
patch_size = self.patch_size
C, H, W = image.shape

patches = image.reshape(
C, H // patch_size, patch_size, W // patch_size, patch_size
)
patches = patches.transpose(1, 3, 0, 2, 4)
patches = patches.reshape(-1, C, patch_size, patch_size)

grid_hw = (H // patch_size, W // patch_size)
return patches, grid_hw

def _to_pil(self, image) -> Image.Image:
if isinstance(image, Image.Image):
return image
if isinstance(image, mx.array):
arr = image
if arr.ndim == 3 and arr.shape[0] in (1, 3, 4):
arr = arr.transpose(1, 2, 0)
if arr.dtype in (mx.float32, mx.float16, mx.bfloat16):
arr = (arr * 255).astype(mx.uint8)
h, w, _ = arr.shape
return Image.frombytes("RGB", (w, h), bytes(arr.reshape(-1).tolist()))
raise ValueError(
f"Invalid image type {type(image)}. Expected PIL.Image.Image or mx.array."
)

def _preprocess(self, image) -> Tuple[mx.array, Tuple[int, int]]:
image = self.rescale(image, self.merge_kernel_size)
image = self.to_mlx(image)
image = self.normalize(image)
return self.patchify(image)

def preprocess(
self,
images,
return_tensors: Optional[Union[str, object]] = None,
**kwargs,
):
from transformers.feature_extraction_utils import BatchFeature

if isinstance(images, (mx.array, Image.Image)):
images = [images]

pixel_values_list = []
image_grid_hws = []

for image in images:
patches, image_grid_hw = self._preprocess(self._to_pil(image))
pixel_values_list.append(patches)
image_grid_hws.append(image_grid_hw)

pixel_values = mx.concatenate(pixel_values_list, axis=0)
grid_shapes = [(int(h), int(w)) for h, w in image_grid_hws]
image_grid_hws = mx.array(image_grid_hws)
_materialize(pixel_values, image_grid_hws)

data = {
"pixel_values": pixel_values,
"image_grid_hws": image_grid_hws,
"_grid_shapes": grid_shapes,
}

return BatchFeature(data=data, tensor_type=return_tensors)

def __call__(
self,
images,
return_tensors: Optional[Union[str, object]] = None,
**kwargs,
):
return self.preprocess(images, return_tensors=return_tensors, **kwargs)
Loading
Loading