Fine-tuning CLIP in medical images

A reimplementation of the CLIP fine-tuning stage from PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? (Eslami et al., EACL 2023 Findings).

The model fine-tunes OpenAI's CLIP on the ROCO dataset (Radiology Objects in Context) — 65k+ radiology image–caption pairs — using symmetric contrastive learning to align medical images with their clinical descriptions.

Overview

Figure 1: Training objective (Panel A) and downstream VQA use-case (Panel B). Figure from Eslami et al., EACL 2023.

Panel A shows what this repo implements: a radiology image and its caption are encoded independently by the CLIP vision and text encoders, then a scaled pairwise cosine similarity loss aligns matching pairs and separates non-matching ones.

Approach

Base model: openai/clip-vit-base-patch32 (via HuggingFace Transformers)
Training objective: symmetric cross-entropy over the image–text similarity matrix
- loss = 0.5 * language_loss + 0.5 * vision_loss
Data augmentation: random resized crop + horizontal flip (train); center crop (val/test)
Optimizer: Adam, lr=1e-5, weight decay=5e-4
LR scheduler: ReduceLROnPlateau
Early stopping: patience=5 epochs

Dataset

ROCO — radiology images sourced from PubMed with associated captions.

Split	Samples
Train	65,391
Validation	8,169
Test	8,172

Project Structure

roco-image-captioning/
├── config_generator.py          # Generate config.yaml
├── requirements.txt
├── src/
│   ├── lib/
│   │   ├── dataset/
│   │   │   └── ROCODataset.py   # PyTorch dataset (ROCO JSONL)
│   │   ├── data_transform/
│   │   │   └── collate.py       # CLIP-aware collate function
│   │   └── model/
│   │       └── clip.py          # MedCLIP Lightning module
│   └── main/
│       ├── train.py             # Training entry point
│       └── create_json.py       # CSV → JSONL converter
└── utils/
    └── monitoring_tools.py      # Output directory management

Setup

pip install -r requirements.txt

Usage

1. Prepare the dataset JSONs (once, from the raw ROCO CSVs):

python src/main/create_json.py \
  --train-path      roco-dataset/data/train \
  --validation-path roco-dataset/data/validation \
  --test-path       roco-dataset/data/test \
  --json-output-path roco-dataset/json

2. Generate a config file (edit paths inside first):

python config_generator.py

3. Train:

python src/main/train.py --config config.yaml

Training logs are tracked with Weights & Biases. Checkpoints are saved to outputs/model_checkpoints/.

Citation

@inproceedings{eslami-etal-2023-pubmedclip,
    title     = "{P}ub{M}ed{CLIP}: How Much Does {CLIP} Benefit Visual Question Answering in the Medical Domain?",
    author    = "Eslami, Sedigheh  and
                 de Melo, Gerard  and
                 Meinel, Christoph",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    year      = "2023",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2023.findings-eacl.88",
    pages     = "1181--1193",
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
roco-dataset		roco-dataset
src		src
utils		utils
README.md		README.md
config_generator.py		config_generator.py
diagram.png		diagram.png
pubmedclip-paper.pdf		pubmedclip-paper.pdf
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Fine-tuning CLIP in medical images

Overview

Approach

Dataset

Project Structure

Setup

Usage

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Fine-tuning CLIP in medical images

Overview

Approach

Dataset

Project Structure

Setup

Usage

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages