This project studies molecular graph captioning: given a molecule’s atomic structure represented as a graph, the goal is to predict a coherent, human-readable textual description of the molecule.
We explore two complementary approaches:
- Retrieval-based captioning, where molecular graphs and textual descriptions are embedded into a shared space and matched via nearest-neighbor search.
- Generative captioning, where a language model directly generates a description conditioned on a graph embedding.
.
├── configs/ # Experiment and model configuration files
├── data/ # Datasets (graphs, descriptions, splits)
├── output/ # Trained models, embeddings, predictions
│
├── models/
│ ├── decoding_gpt/ # Graph → text decoding
│ ├── graph_utils/ # Molecular graph construction & utilities
│ ├── molecular_gpt/ # Adapted MolecularGPT components
│ └── retrieval_system/ # Cross-modal graph–text retrieval models
│
├── graph_visualisation.ipynb # Visualization of molecular graphs
├── plot_results.py # Plot retrieval and evaluation metrics
├── train_retrieval_system.py # Train graph–text retrieval model
├── train_decoding_gpt.py # Train BioGPT-based generative model
├── train_molecular_gpt.py # MolecularGPT adaptation experiments
├── requirements.txt # Python dependencies
└── README.md
- Encodes molecular graphs using GNNs (GCN, GINE, GAT, and mixed variants).
- Encodes textual descriptions using pretrained language models (BERT, PubMedBERT, OpenAI embeddings).
- Aligns graph and text embeddings in a shared space using metric-learning objectives (MSE, Circle Loss, Multi-Similarity Loss).
- Captions are retrieved via nearest-neighbor search in the embedding space.
This approach provides the best overall performance in our experiments.
- Uses a trained graph encoder to produce molecular embeddings.
- Projects graph embeddings into a prefix representation.
- Fine-tunes a pretrained BioGPT model to generate molecular descriptions autoregressively.
This approach produces fluent and chemically plausible descriptions but is less reliable than retrieval under limited supervision.
- Experimental adaptation of instruction-tuned molecular LLMs to graph captioning.
- Based on Torch Geometric implementations with custom modifications.
- Ultimately underperforms retrieval-based methods and is included for completeness.
We provide of experiments. Each script is meant to be opened, read, and modified like a notebook.
Install the required dependencies:
pip install -r requirements.txtThis is the core of the project.
Open train_retrieval_system.py and scroll through it:
- Choose the text embedding model (BERT / PubMedBERT / OpenAI).
- Choose the graph encoder (GCN, GINE, GAT, mixed).
- Choose the loss (MSE, Circle Loss, MS Loss).
- Run training and retrieval.
Outputs (checkpoints, retrieved descriptions, metrics) are written to output/.
This part explores graph → text generation.
Open train_decoding_gpt.py:
- It loads a trained graph encoder.
- Builds a vector → text dataset.
- Fine-tunes BioGPT with a learned prefix.
- Generates descriptions for the test set.
This experiment is heavier and less stable than retrieval, but useful for qualitative analysis.
This is an experimental attempt to adapt instruction-tuned molecular LLMs to graph captioning.
Open train_molecular_gpt.py:
- Mostly exploratory code.
- Kept for comparison and analysis.
graph_visualisation.ipynbvisualize molecular graphs and node features.plot_results.pyplot results of the analysis.
All outputs are saved under output/, including:
- Trained model checkpoints
- Graph and text embeddings
- Retrieved or generated molecular descriptions
- Evaluation metrics and plots