Blaizzy · beshkenadze · May 29, 2026 · May 29, 2026 · May 29, 2026 · May 29, 2026
diff --git a/README.md b/README.md
@@ -42,6 +42,7 @@ Some models have detailed documentation with prompt formats, examples, and best
 | MiniCPM-o | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/minicpmo/README.md) |
 | Phi-4 Multimodal | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/phi4mm/README.md) |
 | MolmoPoint | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/molmo_point/README.md) |
+| LocateAnything | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/locateanything/README.md) |
 | Moondream3 | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/moondream3/README.md) |
 | Gemma 4 | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/gemma4/README.md) |
 | Falcon-OCR | [Docs](https://github.com/Blaizzy/mlx-vlm/blob/main/mlx_vlm/models/falcon_ocr/README.md) |

diff --git a/mlx_vlm/models/locateanything/README.md b/mlx_vlm/models/locateanything/README.md
@@ -0,0 +1,123 @@
+# LocateAnything
+
+LocateAnything is NVIDIA's 3B vision-language grounding model for locating objects and referred regions in an image. This MLX-VLM port supports the MoonViT vision tower, Qwen2.5 text backbone, custom image processor, and the model-specific Parallel Box Decoding path.
+
+## Model
+
+| | |
+|---|---|
+| **Model ID** | `nvidia/LocateAnything-3B` |
+| **Architecture** | MoonViT vision encoder + MLP connector + Qwen2.5 language model |
+| **Parameters** | 3B |
+| **Modalities** | Image + text |
+| **Primary Tasks** | Visual grounding, open-vocabulary object localization, referring expression localization |
+
+## CLI
+
+The standard CLI uses autoregressive generation.
+
+```bash
+mlx_vlm.generate \
+  --model nvidia/LocateAnything-3B \
+  --image examples/images/cats.jpg \
+  --prompt "Locate the cats." \
+  --max-tokens 128 \
+  --temperature 0.0
+```
+
+## Python
+
+### Autoregressive generation
+
+```python
+from mlx_vlm import generate, load
+from mlx_vlm.prompt_utils import apply_chat_template
+
+model, processor = load("nvidia/LocateAnything-3B")
+
+prompt = apply_chat_template(
+    processor,
+    model.config,
+    "Locate the cats.",
+    num_images=1,
+)
+
+result = generate(
+    model=model,
+    processor=processor,
+    prompt=prompt,
+    image="examples/images/cats.jpg",
+    max_tokens=128,
+    temperature=0.0,
+)
+print(result.text)
+```
+
+### Parallel Box Decoding
+
+LocateAnything also exposes Parallel Box Decoding through `model.pbd_generate`. Use this direct model API for the `fast`, `hybrid`, and `slow` modes.
+
+```python
+from mlx_vlm import load
+from mlx_vlm.prompt_utils import apply_chat_template
+from mlx_vlm.utils import prepare_inputs
+
+model, processor = load("nvidia/LocateAnything-3B")
+
+prompt = apply_chat_template(
+    processor,
+    model.config,
+    "Locate the cats.",
+    num_images=1,
+)
+inputs = prepare_inputs(
+    processor,
+    images=["examples/images/cats.jpg"],
+    prompts=prompt,
+)
+
+input_ids = inputs.pop("input_ids")
+inputs.pop("attention_mask", None)
+
+tokens = model.pbd_generate(
+    input_ids,
+    generation_mode="hybrid",
+    max_tokens=128,
+    **inputs,
+)
+print(processor.decode(tokens, skip_special_tokens=False))
+```
+
+`generation_mode` accepts:
+
+- `hybrid`: starts with Parallel Box Decoding and falls back to autoregressive decoding when needed.
+- `fast`: uses Parallel Box Decoding only.
+- `slow`: uses autoregressive decoding through the LocateAnything PBD wrapper.
+
+## Architecture
+
+- **Vision tower**: MoonViT image encoder with 14x14 patches, 2D RoPE, and patch merging.
+- **Connector**: LayerNorm + two linear layers projecting merged vision features into the language hidden size.
+- **Language model**: Qwen2-style decoder with tied embeddings.
+- **Processor**: Expands `<image-N>` placeholders into `<img>...<IMG_CONTEXT>...</img>` spans based on the processed image grid.
+- **PBD**: Generates fixed-size box token blocks for the model's coordinate format.
+
+## Folder Structure
+
+```text
+mlx_vlm/models/locateanything/
+  __init__.py
+  config.py
+  image_processing_locateanything.py
+  language.py
+  locateanything.py
+  pbd.py
+  processing_locateanything.py
+  vision.py
+```
+
+## Notes
+
+- For multi-image prompts, pass `num_images=len(images)` and provide the same number of images to `generate` or `prepare_inputs`.
+- Increase `--max-tokens` for scenes with many objects.
+- The custom processor supports `save_pretrained()` and writes `processor_config.json`, `preprocessor_config.json`, and `chat_template.json`.
diff --git a/mlx_vlm/models/locateanything/__init__.py b/mlx_vlm/models/locateanything/__init__.py
@@ -0,0 +1,6 @@
+from .config import ModelConfig, TextConfig, VisionConfig
+from .image_processing_locateanything import (
+    LocateAnythingImageProcessor as ImageProcessor,
+)
+from .locateanything import LanguageModel, Model, VisionModel
+from .processing_locateanything import LocateAnythingProcessor as Processor
diff --git a/mlx_vlm/models/locateanything/config.py b/mlx_vlm/models/locateanything/config.py
@@ -0,0 +1,71 @@
+from dataclasses import dataclass, field
+from typing import List, Optional
+
+from ..base import BaseModelConfig
+
+
+@dataclass
+class VisionConfig(BaseModelConfig):
+    model_type: str = "moonvit"
+    hidden_size: int = 1152
+    num_hidden_layers: int = 27
+    num_attention_heads: int = 16
+    intermediate_size: int = 4304
+    patch_size: int = 14
+    init_pos_emb_height: int = 64
+    init_pos_emb_width: int = 64
+    num_channels: int = 3
+    merge_kernel_size: List[int] = field(default_factory=lambda: [2, 2])
+
+    def __post_init__(self):
+        if self.merge_kernel_size is None:
+            self.merge_kernel_size = [2, 2]
+        self.depth = self.num_hidden_layers
+        self.num_heads = self.num_attention_heads
+        self.embed_dim = self.hidden_size
+        self.spatial_merge_size = self.merge_kernel_size[0]
+
+
+@dataclass
+class TextConfig(BaseModelConfig):
+    model_type: str = "qwen2"
+    hidden_size: int = 2048
+    num_hidden_layers: int = 36
+    intermediate_size: int = 11008
+    num_attention_heads: int = 16
+    num_key_value_heads: Optional[int] = 2
+    vocab_size: int = 152681
+    rms_norm_eps: float = 1e-6
+    rope_theta: float = 1000000.0
+    rope_traditional: bool = False
+    rope_scaling: Optional[dict] = None
+    max_position_embeddings: int = 32768
+    tie_word_embeddings: bool = True
+    block_size: int = 6
+    causal_attn: bool = False
+    text_mask_token_id: int = 151676
+    null_token_id: int = 152678
+    switch_token_id: int = 152679
+
+    def __post_init__(self):
+        if self.num_key_value_heads is None:
+            self.num_key_value_heads = self.num_attention_heads
+
+
+@dataclass
+class ModelConfig(BaseModelConfig):
+    text_config: TextConfig
+    vision_config: VisionConfig
+    model_type: str = "locateanything"
+    image_token_index: int = 151665
+    box_start_token_id: int = 151668
+    box_end_token_id: int = 151669
+    coord_start_token_id: int = 151677
+    coord_end_token_id: int = 152677
+    ref_start_token_id: int = 151672
+    ref_end_token_id: int = 151673
+    none_token_id: int = 4064
+    mlp_connector_layers: int = 2
+    vocab_size: int = 152681
+    eos_token_id: Optional[List[int]] = None
+    n_future_tokens: int = 6
diff --git a/mlx_vlm/models/locateanything/image_processing_locateanything.py b/mlx_vlm/models/locateanything/image_processing_locateanything.py
@@ -0,0 +1,154 @@
+import math
+from typing import List, Optional, Tuple, Union
+
+import mlx.core as mx
+from PIL import Image
+
+_materialize = mx.eval
+
+LOCATEANYTHING_IMAGE_MEAN = (0.5, 0.5, 0.5)
+LOCATEANYTHING_IMAGE_STD = (0.5, 0.5, 0.5)
+
+
+def _base_image_processor():
+    from transformers.image_processing_utils import BaseImageProcessor
+
+    return BaseImageProcessor
+
+
+class LocateAnythingImageProcessor(_base_image_processor()):
+
+    model_input_names = ["pixel_values", "image_grid_hws"]
+
+    def __init__(
+        self,
+        patch_size: int = 14,
+        image_mean: Tuple[float, float, float] = LOCATEANYTHING_IMAGE_MEAN,
+        image_std: Tuple[float, float, float] = LOCATEANYTHING_IMAGE_STD,
+        in_token_limit: int = 25600,
+        merge_kernel_size: List[int] = None,
+        **kwargs,
+    ):
+        super().__init__(**kwargs)
+        self.in_token_limit = in_token_limit
+        self.patch_size = patch_size
+        self.image_mean = image_mean
+        self.image_std = image_std
+        self.merge_kernel_size = (
+            merge_kernel_size if merge_kernel_size is not None else [2, 2]
+        )
+
+    def rescale(
+        self, image: Image.Image, merge_kernel_size: List[int] = None
+    ) -> Image.Image:
+        if merge_kernel_size is None:
+            merge_kernel_size = self.merge_kernel_size
+
+        w, h = image.size
+        patch_size = self.patch_size
+
+        if (w // patch_size) * (h // patch_size) > self.in_token_limit:
+            scale = math.sqrt(
+                self.in_token_limit / ((w // patch_size) * (h // patch_size))
+            )
+            new_w, new_h = int(w * scale), int(h * scale)
+            image = image.resize((new_w, new_h), Image.Resampling.BICUBIC)
+
+        new_w, new_h = image.size
+        pad_w = merge_kernel_size[1] * patch_size
+        pad_h = merge_kernel_size[0] * patch_size
+        target_w = math.ceil(new_w / pad_w) * pad_w
+        target_h = math.ceil(new_h / pad_h) * pad_h
+        if (target_w, target_h) != (new_w, new_h):
+            image = image.resize((target_w, target_h), Image.Resampling.BICUBIC)
+
+        w, h = image.size
+        if w // patch_size >= 512 or h // patch_size >= 512:
+            raise ValueError("Exceed pos emb")
+
+        return image
+
+    def to_mlx(self, image: Image.Image) -> mx.array:
+        image = image.convert("RGB")
+        w, h = image.size
+        arr = mx.array(list(image.getdata()), dtype=mx.float32).reshape(h, w, 3) / 255.0
+        return arr.transpose(2, 0, 1)
+
+    def normalize(self, image: mx.array) -> mx.array:
+        mean = mx.array(self.image_mean, dtype=mx.float32).reshape(3, 1, 1)
+        std = mx.array(self.image_std, dtype=mx.float32).reshape(3, 1, 1)
+        return (image - mean) / std
+
+    def patchify(self, image: mx.array) -> Tuple[mx.array, Tuple[int, int]]:
+        patch_size = self.patch_size
+        C, H, W = image.shape
+
+        patches = image.reshape(
+            C, H // patch_size, patch_size, W // patch_size, patch_size
+        )
+        patches = patches.transpose(1, 3, 0, 2, 4)
+        patches = patches.reshape(-1, C, patch_size, patch_size)
+
+        grid_hw = (H // patch_size, W // patch_size)
+        return patches, grid_hw
+
+    def _to_pil(self, image) -> Image.Image:
+        if isinstance(image, Image.Image):
+            return image
+        if isinstance(image, mx.array):
+            arr = image
+            if arr.ndim == 3 and arr.shape[0] in (1, 3, 4):
+                arr = arr.transpose(1, 2, 0)
+            if arr.dtype in (mx.float32, mx.float16, mx.bfloat16):
+                arr = (arr * 255).astype(mx.uint8)
+            h, w, _ = arr.shape
+            return Image.frombytes("RGB", (w, h), bytes(arr.reshape(-1).tolist()))
+        raise ValueError(
+            f"Invalid image type {type(image)}. Expected PIL.Image.Image or mx.array."
+        )
+
+    def _preprocess(self, image) -> Tuple[mx.array, Tuple[int, int]]:
+        image = self.rescale(image, self.merge_kernel_size)
+        image = self.to_mlx(image)
+        image = self.normalize(image)
+        return self.patchify(image)
+
+    def preprocess(
+        self,
+        images,
+        return_tensors: Optional[Union[str, object]] = None,
+        **kwargs,
+    ):
+        from transformers.feature_extraction_utils import BatchFeature
+
+        if isinstance(images, (mx.array, Image.Image)):
+            images = [images]
+
+        pixel_values_list = []
+        image_grid_hws = []
+
+        for image in images:
+            patches, image_grid_hw = self._preprocess(self._to_pil(image))
+            pixel_values_list.append(patches)
+            image_grid_hws.append(image_grid_hw)
+
+        pixel_values = mx.concatenate(pixel_values_list, axis=0)
+        grid_shapes = [(int(h), int(w)) for h, w in image_grid_hws]
+        image_grid_hws = mx.array(image_grid_hws)
+        _materialize(pixel_values, image_grid_hws)
+
+        data = {
+            "pixel_values": pixel_values,
+            "image_grid_hws": image_grid_hws,
+            "_grid_shapes": grid_shapes,
+        }
+
+        return BatchFeature(data=data, tensor_type=return_tensors)
+
+    def __call__(
+        self,
+        images,
+        return_tensors: Optional[Union[str, object]] = None,
+        **kwargs,
+    ):
+        return self.preprocess(images, return_tensors=return_tensors, **kwargs)