DocMMIR is a comprehensive framework for document-level multimodal information retrieval. This repository contains:
- Training and evaluation code for multimodal retrieval models
- Data processing scripts for ArXiv, Wikipedia, and Slide datasets
- Multiple baseline models (CLIP, BLIP, ColPali, VisualBERT, etc.)
- Evaluation metrics and analysis tools
- Question-answer generation pipeline
- Support for multiple multimodal encoders (CLIP, BLIP, VisualBERT, ColPali, etc.)
- Flexible fusion strategies (weighted sum, MLP, attention)
- Document-level retrieval with both text and image modalities
- Distributed training with PyTorch Lightning
- Comprehensive evaluation metrics (MRR, Recall@K, NDCG)
- Neptune.ai integration for experiment tracking
- Cumulative learning experiments
- Python 3.8+
- CUDA-capable GPU (recommended)
- Git LFS (for downloading large model checkpoints)
# Clone the repository
git clone https://github.com/J1mL1/DocMMIR.git
cd DocMMIR
# Create a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install the package in development mode
pip install -e .
# Or install dependencies directly
pip install -r requirements.txtDocMMIR/
├── src/ # Source code
│ ├── dataset.py # Dataset loading and preprocessing
│ ├── retrieval_model.py # Main retrieval model (PyTorch Lightning)
│ ├── fusion_module.py # Multimodal fusion strategies
│ ├── metrics.py # Evaluation metrics
│ ├── utils.py # Utility functions
│ ├── encoders/ # Text and image encoders
│ │ ├── text_model.py # Text encoder wrapper
│ │ └── image_model.py # Image encoder wrapper
│ ├── models/ # Baseline model implementations
│ │ ├── clip_model.py # CLIP-based model
│ │ ├── blip_model.py # BLIP-based model
│ │ ├── colpali_model.py # ColPali model
│ │ ├── visualBert_model.py # VisualBERT model
│ │ ├── e5v_model.py # E5V model
│ │ ├── vlm2vec_model.py # VLM2Vec model
│ │ └── ... # Other baselines
│ ├── train/ # Training scripts
│ │ ├── train.py # Standard training
│ │ └── train_cumulative.py # Cumulative learning experiments
│ └── test/ # Testing scripts
│ ├── test.py # Model evaluation
│ ├── compute_embeddings.py # Precompute document embeddings
│ └── batch_embedding.py # Batch embedding computation
├── scripts/ # Data processing scripts
│ ├── arxiv_scripts/ # ArXiv dataset processing
│ ├── wikipages_scripts/ # Wikipedia dataset processing
│ ├── slides_scripts/ # Slides dataset processing
│ └── ... # Utility scripts
├── qa_generation/ # Question-answer generation
│ ├── qwen7b_qa_generate.py # QA generation with Qwen
│ ├── qwen_perplexity.py # Quality evaluation
│ └── split_json.py # Data splitting
├── outputs/ # Model outputs and checkpoints
├── requirements.txt # Python dependencies
├── setup.py # Package setup
└── README.md # This file
Download the DocMMIR dataset from Hugging Face:
# Install Hugging Face CLI
pip install huggingface_hub
# Download dataset
huggingface-cli download Lord-Jim/DocMMIR --repo-type dataset --local-dir ./data/DocMMIR
# Extract images
cd data/DocMMIR
cat archives/image.tar.gz.part* > image.tar.gz
tar -xzf image.tar.gz
rm image.tar.gz
cd ../..# Example: Train a CLIP-based model
python src/train/train.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--batch_size 32 \
--max_epochs 10 \
--devices 4 \
--experiment_name clip_vit_l_14_fullKey Training Parameters:
--text_model: Text encoder (e.g., bert-base-uncased, roberta-base)--image_model: Image encoder (e.g., openai/clip-vit-large-patch14)--fusion_strategy: Fusion method (weighted_sum, mlp, attention)--batch_size: Training batch size--devices: Number of GPUs--experiment_name: Experiment identifier
For all training options:
python src/train/train.py --help# Evaluate on test set
python src/test/test.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--checkpoint_path outputs/checkpoints/your_model/best.ckpt \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--batch_size 64
# Zero-shot evaluation (no fine-tuning)
python src/test/test.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--zero_shot \
--batch_size 64The repository includes various multimodal baseline models:
| Model | File | Description |
|---|---|---|
| CLIP | clip_model.py |
Contrastive Language-Image Pre-training |
| BLIP | blip_model.py |
Bootstrapping Language-Image Pre-training |
| ColPali | colpali_model.py |
Contextualized Late Interaction for Documents |
| VisualBERT | visualBert_model.py |
Vision and Language BERT |
| E5V | e5v_model.py |
E5 with Vision capabilities |
| VLM2Vec | vlm2vec_model.py |
Vision-Language Model to Vector |
| MARVEL | MARVEL.py |
Multimodal Representation Learning |
| SigLIP | siglip2_model.py |
Sigmoid Loss for Language-Image Pre-training |
| BERT | bert_model.py |
Text-only BERT baseline |
| ALIGN | align_model.py |
Large-scale noisy image-text alignment |
Each model can be used as a backbone for the retrieval framework.
python src/train/train.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--batch_size 32 \
--max_epochs 10 \
--lr_text 2e-5 \
--lr_image 2e-5 \
--devices 4 \
--strategy ddp \
--experiment_name my_experiment| Argument | Description | Default |
|---|---|---|
--text_model |
Text encoder model name | Required |
--image_model |
Image encoder model name | Required |
--data_path |
Path to training JSON file | Required |
--image_dir |
Directory containing images | Required |
--fusion_strategy |
Fusion method (weighted_sum, mlp, attention) | weighted_sum |
--batch_size |
Training batch size | 32 |
--max_epochs |
Maximum training epochs | 10 |
--lr_text |
Learning rate for text encoder | 2e-5 |
--lr_image |
Learning rate for image encoder | 2e-5 |
--weight_decay |
Weight decay for optimization | 0.01 |
--loss_type |
Loss function (infonce, bce) | infonce |
--devices |
Number of GPUs | 1 |
--strategy |
Training strategy (ddp, dp, fsdp) | ddp |
--precision |
Training precision (32, 16, bf16) | 32 |
--experiment_name |
Experiment identifier | Required |
--output_dir |
Output directory for checkpoints | outputs/checkpoints |
python src/test/test.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--checkpoint_path outputs/checkpoints/my_experiment/best.ckpt \
--data_path data/DocMMIR/test.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--batch_size 64 \
--devices 1Evaluate pretrained models without fine-tuning:
python src/test/test.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/test.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--zero_shot \
--batch_size 64For large-scale retrieval across all documents:
# Step 1: Precompute document embeddings
python src/test/compute_embeddings.py \
--checkpoint_path outputs/checkpoints/my_experiment/best.ckpt \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--output_dir embeddings/my_experiment
# Step 2: Run global retrieval test
python src/test/test.py \
--checkpoint_path outputs/checkpoints/my_experiment/best.ckpt \
--is_global_test \
--doc_dir embeddings/my_experiment \
--batch_size 64Train and test on individual domains:
# ArXiv domain
python src/train/train.py \
--data_path data/DocMMIR/arxiv_train.json \
--experiment_name arxiv_only \
...
# Wikipedia domain
python src/train/train.py \
--data_path data/DocMMIR/wiki_train.json \
--experiment_name wiki_only \
...
# Slides domain
python src/train/train.py \
--data_path data/DocMMIR/slide_train.json \
--experiment_name slide_only \
...Test with single modality:
# Text-only retrieval
python src/train/train.py \
--modality text_only \
--text_model bert-base-uncased \
--data_path data/DocMMIR/full_dataset.json \
--experiment_name text_only_ablation
# Image-only retrieval
python src/train/train.py \
--modality image_only \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--experiment_name image_only_ablationThe framework supports various encoder models. Model selection is based on keywords in the model name:
| Model | Keyword Trigger | Example Model Names | Text Encoder | Image Encoder |
|---|---|---|---|---|
| CLIP | vit |
ViT-B-32, ViT-L-14, openai/clip-vit-base-patch32 |
Yes | Yes |
| SigLIP2 | siglip2 |
siglip2-base, google/siglip-base-patch16-224 |
Yes | Yes |
| BLIP | blip |
Salesforce/blip-image-captioning-base |
Yes | Yes |
| VLM2Vec | vlm2vec |
vlm2vec-base, vlm2vec-large |
Yes | Yes |
| ALIGN | align |
align-base, kakaobrain/align-base |
Yes | Yes |
| E5V | e5v |
e5v-base, e5v-large |
Yes | Yes |
| ColPali | colpali |
vidore/colpali |
Yes | Yes |
| Model | Keyword Trigger | Example Model Names | Notes |
|---|---|---|---|
| BERT | bert |
bert-base-uncased, bert-large-uncased, roberta-base |
Supports BERT and RoBERTa variants |
- Model Name Matching: The framework detects which model to use based on keywords in the model name string (case-insensitive)
- Priority Order: Checks are performed in the order listed in the code. If multiple keywords match, the first match is used
- Pretrained Weights: Use
--pretrained_weightsto specify custom checkpoint paths - Custom Models: To add new models, implement them in
src/models/and register in the encoder files
# CLIP models (triggered by "vit" keyword)
--text_model ViT-B-32 --image_model ViT-B-32
--text_model ViT-L-14 --image_model ViT-L-14
# BERT text encoder with CLIP image encoder
--text_model bert-base-uncased --image_model ViT-L-14
# SigLIP2 multimodal
--text_model siglip2-base --image_model siglip2-base
# BLIP multimodal
--text_model blip-base --image_model blip-base
# ColPali multimodal
--text_model colpali --image_model colpali
# E5V multimodal
--text_model e5v-base --image_model e5v-base
# VLM2Vec multimodal
--text_model vlm2vec-base --image_model vlm2vec-base
# ALIGN multimodal
--text_model align-base --image_model align-baseThe training/testing data should be in JSON format with the following structure:
[
{
"id": "arxiv_12345",
"title": "Deep Learning for Computer Vision",
"class": "computer_science",
"query": "What are the main applications of deep learning in computer vision?",
"texts": [
"Deep learning has revolutionized computer vision...",
"Convolutional neural networks are the backbone...",
"..."
],
"images": [
"arxiv/12345/figure1.jpg",
"arxiv/12345/figure2.jpg"
],
"num_images": 2,
"domain": "arxiv"
},
...
]id: Unique identifier for the documentquery: Query text for retrievaltexts: List of text chunks from the documentimages: List of image file paths (relative to dataset root)num_images: Number of images in the documentdomain: Data source (arxiv, wiki, slide)
title: Document titleclass: Document category/topic
# Data Parallel (DP)
python src/train/train.py --devices 4 --strategy dp ...
# Distributed Data Parallel (DDP) - Recommended
python src/train/train.py --devices 4 --strategy ddp ...
# Fully Sharded Data Parallel (FSDP) - For very large models
python src/train/train.py --devices 8 --strategy fsdp ...# FP16 (faster, less memory)
python src/train/train.py --precision 16 ...
# BF16 (better numerical stability, requires Ampere+ GPUs)
python src/train/train.py --precision bf16 ...- Data Loading: Use
--num_workers 8or higher for faster data loading - Caching: Precompute embeddings for frequent evaluation
- Mixed Precision: Use
--precision 16for 2x speedup - Batch Size: Maximize batch size within GPU memory limits
- Distributed Training: Use multiple GPUs with
--strategy ddp
python src/train/train.py \
--text_model bert-base-uncased \
--image_model openai/clip-vit-large-patch14 \
--data_path data/DocMMIR/full_dataset.json \
--image_dir data/DocMMIR/media \
--fusion_strategy weighted_sum \
--loss_type infonce \
--batch_size 32 \
--max_epochs 10 \
--devices 4 \
--strategy ddp \
--precision 16 \
--experiment_name clip_vit_l_14_fullset# Weighted sum fusion
python src/train/train.py --fusion_strategy weighted_sum --experiment_name fusion_ws ...
# MLP fusion
python src/train/train.py --fusion_strategy mlp --experiment_name fusion_mlp ...
# Attention fusion
python src/train/train.py --fusion_strategy attention --experiment_name fusion_attn ...If you use this code or dataset in your research, please cite our paper:
@inproceedings{li-etal-2025-docmmir,
title = "{D}oc{MMIR}: A Framework for Document Multi-modal Information Retrieval",
author = "Li, Zirui and
Wu, Siwei and
Li, Yizhi and
Wang, Xingyu and
Zhou, Yi and
Lin, Chenghua",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.705/",
pages = "13117--13130",
ISBN = "979-8-89176-335-7",
}- Paper: DocMMIR on arXiv
- Dataset: DocMMIR on Hugging Face
- Code: GitHub Repository
This project is licensed under the MIT License. See the LICENSE file for details.
This project builds upon several excellent open-source projects:
- PyTorch Lightning - Deep learning framework
- Hugging Face Transformers - Pretrained models
- OpenCLIP - CLIP implementations
- Neptune.ai - Experiment tracking
For questions or issues, please:
- Open an issue on GitHub
- Contact the authors via email (see paper)
