Molecular Graph Captioning

This project studies molecular graph captioning: given a molecule’s atomic structure represented as a graph, the goal is to predict a coherent, human-readable textual description of the molecule.

We explore two complementary approaches:

Retrieval-based captioning, where molecular graphs and textual descriptions are embedded into a shared space and matched via nearest-neighbor search.
Generative captioning, where a language model directly generates a description conditioned on a graph embedding.

Repository Structure

.
├── configs/                         # Experiment and model configuration files
├── data/                            # Datasets (graphs, descriptions, splits)
├── output/                          # Trained models, embeddings, predictions
│
├── models/
│   ├── decoding_gpt/                # Graph → text decoding
│   ├── graph_utils/                 # Molecular graph construction & utilities
│   ├── molecular_gpt/               # Adapted MolecularGPT components
│   └── retrieval_system/            # Cross-modal graph–text retrieval models
│
├── graph_visualisation.ipynb        # Visualization of molecular graphs
├── plot_results.py                  # Plot retrieval and evaluation metrics
├── train_retrieval_system.py        # Train graph–text retrieval model
├── train_decoding_gpt.py            # Train BioGPT-based generative model
├── train_molecular_gpt.py           # MolecularGPT adaptation experiments
├── requirements.txt                 # Python dependencies
└── README.md

Core Components

Retrieval-Based Captioning (`models/retrieval_system/`)

Encodes molecular graphs using GNNs (GCN, GINE, GAT, and mixed variants).
Encodes textual descriptions using pretrained language models (BERT, PubMedBERT, OpenAI embeddings).
Aligns graph and text embeddings in a shared space using metric-learning objectives (MSE, Circle Loss, Multi-Similarity Loss).
Captions are retrieved via nearest-neighbor search in the embedding space.

This approach provides the best overall performance in our experiments.

Generative Captioning (`models/decoding_gpt/`)

Uses a trained graph encoder to produce molecular embeddings.
Projects graph embeddings into a prefix representation.
Fine-tunes a pretrained BioGPT model to generate molecular descriptions autoregressively.

This approach produces fluent and chemically plausible descriptions but is less reliable than retrieval under limited supervision.

MolecularGPT Adaptation (`models/molecular_gpt/`)

Experimental adaptation of instruction-tuned molecular LLMs to graph captioning.
Based on Torch Geometric implementations with custom modifications.
Ultimately underperforms retrieval-based methods and is included for completeness.

Run experiments

We provide of experiments. Each script is meant to be opened, read, and modified like a notebook.

1. Setup

Install the required dependencies:

pip install -r requirements.txt

2. Retrieval-based captioning (main experiment)

This is the core of the project.

Open train_retrieval_system.py and scroll through it:

Choose the text embedding model (BERT / PubMedBERT / OpenAI).
Choose the graph encoder (GCN, GINE, GAT, mixed).
Choose the loss (MSE, Circle Loss, MS Loss).
Run training and retrieval.

Outputs (checkpoints, retrieved descriptions, metrics) are written to output/.

3. Generative captioning with BioGPT

This part explores graph → text generation.

Open train_decoding_gpt.py:

It loads a trained graph encoder.
Builds a vector → text dataset.
Fine-tunes BioGPT with a learned prefix.
Generates descriptions for the test set.

This experiment is heavier and less stable than retrieval, but useful for qualitative analysis.

4. MolecularGPT-style experiments

This is an experimental attempt to adapt instruction-tuned molecular LLMs to graph captioning.

Open train_molecular_gpt.py:

Mostly exploratory code.
Kept for comparison and analysis.

5. Visual inspection & analysis

graph_visualisation.ipynb visualize molecular graphs and node features.
plot_results.py plot results of the analysis.

Outputs

All outputs are saved under output/, including:

Trained model checkpoints
Graph and text embeddings
Retrieved or generated molecular descriptions
Evaluation metrics and plots

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Molecular Graph Captioning

Repository Structure

Core Components

Retrieval-Based Captioning (`models/retrieval_system/`)

Generative Captioning (`models/decoding_gpt/`)

MolecularGPT Adaptation (`models/molecular_gpt/`)

Run experiments

1. Setup

2. Retrieval-based captioning (main experiment)

3. Generative captioning with BioGPT

4. MolecularGPT-style experiments

5. Visual inspection & analysis

Outputs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
configs		configs
models		models
output		output
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
graph_vizualisation.ipynb		graph_vizualisation.ipynb
plot_results.py		plot_results.py
requirements.txt		requirements.txt
train_decoding_gpt.py		train_decoding_gpt.py
train_molecular_gpt.py		train_molecular_gpt.py
train_retrieval_system.py		train_retrieval_system.py

Folders and files

Latest commit

History

Repository files navigation

Molecular Graph Captioning

Repository Structure

Core Components

Retrieval-Based Captioning (models/retrieval_system/)

Generative Captioning (models/decoding_gpt/)

MolecularGPT Adaptation (models/molecular_gpt/)

Run experiments

1. Setup

2. Retrieval-based captioning (main experiment)

3. Generative captioning with BioGPT

4. MolecularGPT-style experiments

5. Visual inspection & analysis

Outputs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Retrieval-Based Captioning (`models/retrieval_system/`)

Generative Captioning (`models/decoding_gpt/`)

MolecularGPT Adaptation (`models/molecular_gpt/`)

Packages