Skip to content

MaxCoppa/Molecular-Graph-Captioning

Repository files navigation

Molecular Graph Captioning

This project studies molecular graph captioning: given a molecule’s atomic structure represented as a graph, the goal is to predict a coherent, human-readable textual description of the molecule.

We explore two complementary approaches:

  • Retrieval-based captioning, where molecular graphs and textual descriptions are embedded into a shared space and matched via nearest-neighbor search.
  • Generative captioning, where a language model directly generates a description conditioned on a graph embedding.

Repository Structure

.
├── configs/                         # Experiment and model configuration files
├── data/                            # Datasets (graphs, descriptions, splits)
├── output/                          # Trained models, embeddings, predictions
│
├── models/
│   ├── decoding_gpt/                # Graph → text decoding
│   ├── graph_utils/                 # Molecular graph construction & utilities
│   ├── molecular_gpt/               # Adapted MolecularGPT components
│   └── retrieval_system/            # Cross-modal graph–text retrieval models
│
├── graph_visualisation.ipynb        # Visualization of molecular graphs
├── plot_results.py                  # Plot retrieval and evaluation metrics
├── train_retrieval_system.py        # Train graph–text retrieval model
├── train_decoding_gpt.py            # Train BioGPT-based generative model
├── train_molecular_gpt.py           # MolecularGPT adaptation experiments
├── requirements.txt                 # Python dependencies
└── README.md

Core Components

Retrieval-Based Captioning (models/retrieval_system/)

  • Encodes molecular graphs using GNNs (GCN, GINE, GAT, and mixed variants).
  • Encodes textual descriptions using pretrained language models (BERT, PubMedBERT, OpenAI embeddings).
  • Aligns graph and text embeddings in a shared space using metric-learning objectives (MSE, Circle Loss, Multi-Similarity Loss).
  • Captions are retrieved via nearest-neighbor search in the embedding space.

This approach provides the best overall performance in our experiments.

Generative Captioning (models/decoding_gpt/)

  • Uses a trained graph encoder to produce molecular embeddings.
  • Projects graph embeddings into a prefix representation.
  • Fine-tunes a pretrained BioGPT model to generate molecular descriptions autoregressively.

This approach produces fluent and chemically plausible descriptions but is less reliable than retrieval under limited supervision.

MolecularGPT Adaptation (models/molecular_gpt/)

  • Experimental adaptation of instruction-tuned molecular LLMs to graph captioning.
  • Based on Torch Geometric implementations with custom modifications.
  • Ultimately underperforms retrieval-based methods and is included for completeness.

Run experiments

We provide of experiments. Each script is meant to be opened, read, and modified like a notebook.

1. Setup

Install the required dependencies:

pip install -r requirements.txt

2. Retrieval-based captioning (main experiment)

This is the core of the project.

Open train_retrieval_system.py and scroll through it:

  • Choose the text embedding model (BERT / PubMedBERT / OpenAI).
  • Choose the graph encoder (GCN, GINE, GAT, mixed).
  • Choose the loss (MSE, Circle Loss, MS Loss).
  • Run training and retrieval.

Outputs (checkpoints, retrieved descriptions, metrics) are written to output/.

3. Generative captioning with BioGPT

This part explores graph → text generation.

Open train_decoding_gpt.py:

  • It loads a trained graph encoder.
  • Builds a vector → text dataset.
  • Fine-tunes BioGPT with a learned prefix.
  • Generates descriptions for the test set.

This experiment is heavier and less stable than retrieval, but useful for qualitative analysis.

4. MolecularGPT-style experiments

This is an experimental attempt to adapt instruction-tuned molecular LLMs to graph captioning.

Open train_molecular_gpt.py:

  • Mostly exploratory code.
  • Kept for comparison and analysis.

5. Visual inspection & analysis

  • graph_visualisation.ipynb visualize molecular graphs and node features.
  • plot_results.py plot results of the analysis.

Outputs

All outputs are saved under output/, including:

  • Trained model checkpoints
  • Graph and text embeddings
  • Retrieved or generated molecular descriptions
  • Evaluation metrics and plots

About

Develop a model that takes a molecule’s atomic structure, represented as a graph, and predicts a coherent, humann-readable text caption.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors