A high-performance, MLX-native implementation of DINOv3 (the latest iteration of self-supervised ViT models from Meta) optimized for Apple Silicon.
This repository provides an implementation of the DinoVisionTransformer architecture, including modern features like Rotary Position Embeddings (RoPE), SwiGLU Feed-Forward Networks, and LayerScale, all built using the MLX framework.
- MLX-Native: Built from the ground up for Apple Silicon using MLX.
- Multiple Architectures: Supports various ViT scales:
vit_small(384 embed dim, 12 layers, 6 heads)vit_base(768 embed dim, 12 layers, 12 heads)vit_large(1024 embed dim, 24 layers, 16 heads)vit_giant2(1536 embed dim, 40 layers, 24 heads)vit_7b(4096 embed dim, 40 layers, 32 heads)
- Advanced Components:
- RoPE: Integrated Rotary Position Embeddings for 2D images.
- SwiGLU: Efficient SwiGLU FFN implementation.
- LayerScale: For improved training stability in deep transformers.
- Registers/Storage Tokens: Full support for additional storage tokens.
- Weight Conversion: Includes scripts to convert official PyTorch/HuggingFace checkpoints to MLX
.safetensors.
Ensure you have Python 3.12+ and the necessary dependencies installed:
pip install mlx torch transformers pillowTo use the model, you first need to convert an official checkpoint to MLX format. Use the provided conversion script:
python dinov3/checkpoints/convert.pyNote: You may need to update CHECKPOINT_PATH in the script to point to your downloaded .pth file or it will attempt to download/locate the default ViT-S/16 checkpoint.
Loading and running the model in MLX is straightforward:
import mlx.core as mx
from dinov3.models import vit_small
# Initialize model
model = vit_small(patch_size=16, n_storage_tokens=4)
# Load converted weights
model.load_weights("path/to/vit-small.safetensors")
# Forward pass
image = mx.random.uniform(shape=(1, 224, 224, 3)) # MLX uses NHWC
outputs = model(image, is_training=False) # Returns CLS token
print(outputs.shape)dinov3/models/: Core ViT architecture implementations.dinov3/layers/: Custom MLX layers (RoPE, SwiGLU, Attention, etc.).dinov3/checkpoints/: Weight conversion and utility scripts.main.py: Entry point for testing.
This implementation is based on the official DINOv3 research by Meta AI. Special thanks to the MLX team for providing the framework that makes this possible.