A reimplementation of the CLIP fine-tuning stage from PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? (Eslami et al., EACL 2023 Findings).
The model fine-tunes OpenAI's CLIP on the ROCO dataset (Radiology Objects in Context) — 65k+ radiology image–caption pairs — using symmetric contrastive learning to align medical images with their clinical descriptions.
Figure 1: Training objective (Panel A) and downstream VQA use-case (Panel B). Figure from Eslami et al., EACL 2023.
Panel A shows what this repo implements: a radiology image and its caption are encoded independently by the CLIP vision and text encoders, then a scaled pairwise cosine similarity loss aligns matching pairs and separates non-matching ones.
- Base model:
openai/clip-vit-base-patch32(via HuggingFace Transformers) - Training objective: symmetric cross-entropy over the image–text similarity matrix
loss = 0.5 * language_loss + 0.5 * vision_loss
- Data augmentation: random resized crop + horizontal flip (train); center crop (val/test)
- Optimizer: Adam, lr=1e-5, weight decay=5e-4
- LR scheduler: ReduceLROnPlateau
- Early stopping: patience=5 epochs
ROCO — radiology images sourced from PubMed with associated captions.
| Split | Samples |
|---|---|
| Train | 65,391 |
| Validation | 8,169 |
| Test | 8,172 |
roco-image-captioning/
├── config_generator.py # Generate config.yaml
├── requirements.txt
├── src/
│ ├── lib/
│ │ ├── dataset/
│ │ │ └── ROCODataset.py # PyTorch dataset (ROCO JSONL)
│ │ ├── data_transform/
│ │ │ └── collate.py # CLIP-aware collate function
│ │ └── model/
│ │ └── clip.py # MedCLIP Lightning module
│ └── main/
│ ├── train.py # Training entry point
│ └── create_json.py # CSV → JSONL converter
└── utils/
└── monitoring_tools.py # Output directory management
pip install -r requirements.txt1. Prepare the dataset JSONs (once, from the raw ROCO CSVs):
python src/main/create_json.py \
--train-path roco-dataset/data/train \
--validation-path roco-dataset/data/validation \
--test-path roco-dataset/data/test \
--json-output-path roco-dataset/json2. Generate a config file (edit paths inside first):
python config_generator.py3. Train:
python src/main/train.py --config config.yamlTraining logs are tracked with Weights & Biases. Checkpoints are saved to outputs/model_checkpoints/.
@inproceedings{eslami-etal-2023-pubmedclip,
title = "{P}ub{M}ed{CLIP}: How Much Does {CLIP} Benefit Visual Question Answering in the Medical Domain?",
author = "Eslami, Sedigheh and
de Melo, Gerard and
Meinel, Christoph",
booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
year = "2023",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2023.findings-eacl.88",
pages = "1181--1193",
}