This repository contains a three-stage training pipeline:
- Baseline YOLO training (standard Ultralytics training)
- Contrastive pretraining (YOLO image encoder + LLM text encoder, CLIP-style)
- LLM-guided fine-tuning (modified Ultralytics trainer)
- Python 3.8+
- PyTorch
- Ultralytics (official + modified version for Stage 3)
Install dependencies (example):
pip install ultralytics torch torchvision torchaudioTrain a vanilla YOLO detector using the official Ultralytics pipeline.
from ultralytics import YOLO
import torch
model = YOLO("yolo11m.yaml")
train_results = model.train(
data="/chest.yaml",
epochs=500,
imgsz=640,
device="0",
)chest.yamlshould definetrain,val, andnames.yolo11m.yamlcan be replaced with other YOLO configs depending on compute resources.
We perform CLIP-style contrastive pretraining between image features (YOLO backbone) and text features (LLM encoder) using PadChest ROI-level data.
python pretraining.pypython pretraining.py \
--csv_path /root/autodl-tmp/dataset/PadChest-GR-yolo-6labels/roi256/roi256_box_sentence.csv \
--batch_size 16 \
--epochs 20 \
--lr 5e-5 \
--weight_decay 1e-2 \
--temperature 0.07 \
--device cuda:0 \
--textencoder llama2 \
--llama_rep Llama-2-7b-chat-hf \
--context True \
--context_length 8 \
--n_prompts 2Fine-tune the YOLO detector using LLM-guided features.
python train.py- The Ultralytics trainer has been modified to support text-guided modules.
- Pretrained weights from Stage 2 can be loaded for initialization.
- Baseline training
- Contrastive pretraining
- LLM-guided fine-tuning
python -c "from ultralytics import YOLO; YOLO('yolo11m.yaml').train(data='/chest.yaml', epochs=500, imgsz=640, device='0')"
python pretraining.py --device cuda:0
python train.py