Skip to content

fcoelhomrc/roco-image-captioning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Fine-tuning CLIP in medical images

A reimplementation of the CLIP fine-tuning stage from PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? (Eslami et al., EACL 2023 Findings).

The model fine-tunes OpenAI's CLIP on the ROCO dataset (Radiology Objects in Context) — 65k+ radiology image–caption pairs — using symmetric contrastive learning to align medical images with their clinical descriptions.

Overview

Training objective (Panel A) from Eslami et al., EACL 2023 Figure 1: Training objective (Panel A) and downstream VQA use-case (Panel B). Figure from Eslami et al., EACL 2023.

Panel A shows what this repo implements: a radiology image and its caption are encoded independently by the CLIP vision and text encoders, then a scaled pairwise cosine similarity loss aligns matching pairs and separates non-matching ones.

Approach

  • Base model: openai/clip-vit-base-patch32 (via HuggingFace Transformers)
  • Training objective: symmetric cross-entropy over the image–text similarity matrix
    • loss = 0.5 * language_loss + 0.5 * vision_loss
  • Data augmentation: random resized crop + horizontal flip (train); center crop (val/test)
  • Optimizer: Adam, lr=1e-5, weight decay=5e-4
  • LR scheduler: ReduceLROnPlateau
  • Early stopping: patience=5 epochs

Dataset

ROCO — radiology images sourced from PubMed with associated captions.

Split Samples
Train 65,391
Validation 8,169
Test 8,172

Project Structure

roco-image-captioning/
├── config_generator.py          # Generate config.yaml
├── requirements.txt
├── src/
│   ├── lib/
│   │   ├── dataset/
│   │   │   └── ROCODataset.py   # PyTorch dataset (ROCO JSONL)
│   │   ├── data_transform/
│   │   │   └── collate.py       # CLIP-aware collate function
│   │   └── model/
│   │       └── clip.py          # MedCLIP Lightning module
│   └── main/
│       ├── train.py             # Training entry point
│       └── create_json.py       # CSV → JSONL converter
└── utils/
    └── monitoring_tools.py      # Output directory management

Setup

pip install -r requirements.txt

Usage

1. Prepare the dataset JSONs (once, from the raw ROCO CSVs):

python src/main/create_json.py \
  --train-path      roco-dataset/data/train \
  --validation-path roco-dataset/data/validation \
  --test-path       roco-dataset/data/test \
  --json-output-path roco-dataset/json

2. Generate a config file (edit paths inside first):

python config_generator.py

3. Train:

python src/main/train.py --config config.yaml

Training logs are tracked with Weights & Biases. Checkpoints are saved to outputs/model_checkpoints/.

Citation

@inproceedings{eslami-etal-2023-pubmedclip,
    title     = "{P}ub{M}ed{CLIP}: How Much Does {CLIP} Benefit Visual Question Answering in the Medical Domain?",
    author    = "Eslami, Sedigheh  and
                 de Melo, Gerard  and
                 Meinel, Christoph",
    booktitle = "Findings of the Association for Computational Linguistics: EACL 2023",
    year      = "2023",
    publisher = "Association for Computational Linguistics",
    url       = "https://aclanthology.org/2023.findings-eacl.88",
    pages     = "1181--1193",
}

About

VLM-based captioning and visual question answering for medical images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors