Official implementation of MADGEN: Mass-Spec attends to De Novo Molecular generation by Yinkai Wang, Xiaohui Chen, Liping Liu, and Soha Hassoun.
- Overview
- Environment Setup
- Data Preparation
- Training
- Sampling
- Evaluation
- Predictive Retrieval
- Pre-trained Models
- Citation
- Contact
MADGEN is a novel approach for de novo molecular generation that leverages mass spectrometry data. The model uses a transformer-based architecture to generate molecular structures from mass spectra, incorporating both molecular and spectral information for improved generation quality.
- Mass-Spec Guided Generation: Utilizes mass spectrometry data to guide molecular generation
- Transformer Architecture: Leverages state-of-the-art transformer models for sequence generation
- Multi-Modal Integration: Combines molecular SMILES and spectral embeddings
- Comprehensive Evaluation: Includes both generation and retrieval evaluation metrics
- Python 3.9
- CUDA-compatible GPU (recommended)
- Conda package manager
- Create conda environment:
conda create --name madgen python=3.9 rdkit=2023.09.5 -c conda-forge -y
conda activate madgen- Install additional dependencies:
pip install -r requirements.txt- Verify installation:
python -c "import rdkit; print('RDKit version:', rdkit.__version__)"The processed datasets are available in our Zenodo repository: MADGEN Data
wget https://zenodo.org/records/15036069/files/msgym.pkl?download=1 -O ./data/msgym/raw/msgym.pklwget https://zenodo.org/records/15036069/files/canopus.pkl?download=1 -O ./data/canopus/raw/canopus.pklThe NIST dataset is commercially available and requires separate licensing. Please refer to the NIST website for acquisition details.
Each dataset contains:
- Molecular SMILES: Canonical SMILES representations
- Mass Spectra: Preprocessed spectral data
- Molecular Properties: Additional molecular descriptors
- Train/Validation/Test Splits: Predefined data splits
python train.py --config configs/{dataset_name}.yaml --model Madgenconfigs/msgym.yaml- MSGym dataset configurationconfigs/canopus.yaml- CANOPUS dataset configurationconfigs/nist.yaml- NIST dataset configuration
Key training parameters can be modified in the respective YAML configuration files:
batch_size: Training batch sizelearning_rate: Learning rate for optimizationnum_epochs: Number of training epochsmodel_dim: Model dimensionnum_layers: Number of transformer layers
# Train on MSGym dataset
python train.py --config configs/msgym.yaml --model Madgen
# Train on CANOPUS dataset with custom parameters
python train.py --config configs/canopus.yaml --model Madgen --batch_size 32 --learning_rate 1e-4CUDA_VISIBLE_DEVICES=1 python sample.py \
--config configs/{dataset_name}.yaml \
--checkpoint {checkpoint_path} \
--samples samples \
--model Madgen \
--mode test \
--n_samples 50 \
--n_steps 100 \
--table_name {table_name} \
--sampling_seed 42--n_samples: Number of molecules to generate--n_steps: Number of sampling steps--sampling_seed: Random seed for reproducibility--mode: Sampling mode (test/validation)--table_name: Output table name for results
# Generate 100 molecules with 200 steps
python sample.py \
--config configs/msgym.yaml \
--checkpoint checkpoints/msgym_best.pt \
--n_samples 100 \
--n_steps 200 \
--table_name msgym_generation
# Generate molecules with different seeds for diversity
python sample.py \
--config configs/canopus.yaml \
--checkpoint checkpoints/canopus_best.pt \
--n_samples 50 \
--sampling_seed 123 \
--table_name canopus_seed123Evaluate the quality of generated molecules:
python evaluation_generation.py --file_path /path/to/your/csvfile.csvThe evaluation includes the following metrics:
- Top-K Accuracy: Percentage of cases where the true molecule appears in the top-K predictions (based on InChI key matching)
- Maximum Tanimoto Similarity: Maximum Tanimoto similarity between the true molecule and top-K predictions using Morgan fingerprints (radius=2, 2048 bits)
- Scaffold Coverage: Percentage of cases with valid scaffold predictions (non-disconnected scaffolds)
The evaluation script expects a CSV file with the following columns:
true: True SMILES stringpred: Predicted SMILES string (can be repeated for multiple predictions)scaffold: Scaffold SMILES string
Download pre-computed retrieval results:
# MSGym retrieval results
wget https://zenodo.org/records/17115068/files/ranks_msgym_pred.pkl?download=1 -O ./data/msgym/raw/ranks_msgym_pred.pkl
# CANOPUS retrieval results
wget https://zenodo.org/records/17115068/files/ranks_canopus_pred.pkl?download=1 -O ./data/canopus/raw/ranks_canopus_pred.pklDownload the required datasets for scaffold retrieval experiments:
# Create data directories
mkdir -p scaffold_retrieval/data/canopus
mkdir -p scaffold_retrieval/data/massspecgym
# Download CANOPUS dataset
wget https://zenodo.org/records/17115068/files/canopus.zip?download=1 -O scaffold_retrieval/data/canopus.zip
cd scaffold_retrieval/data/canopus
unzip ../canopus.zip
cd ../..
# Download MassSpecGym dataset
wget https://zenodo.org/records/17115068/files/massspecgym.zip?download=1 -O scaffold_retrieval/data/massspecgym.zip
cd scaffold_retrieval/data/massspecgym
unzip ../massspecgym.zip
cd ../..- CANOPUS: A comprehensive dataset for molecular structure prediction from mass spectrometry data
- MassSpecGym: A benchmark dataset for mass spectrometry-based molecular identification
The scaffold_retrieval module provides a comprehensive framework for molecular scaffold-based retrieval using mass spectrometry data. This module implements contrastive learning approaches to learn molecular representations and enables retrieval of molecules based on their structural scaffolds.
-
Training: Use the training scripts for different datasets:
cd scaffold_retrieval/ python train_msgym.py # For MassSpecGym dataset python train_canopus.py # For Canopus dataset
-
Candidate Ranking: Generate ranked candidates:
python cand_rank_msg.py # For MassSpecGym python cand_rank_canopus.py # For Canopus
Modify params_msg.yaml or params_canopus.yaml to adjust model parameters, dataset settings, and training configurations.
Then,
cd data/msgym/raw/ # Or data/canopus/raw/
python preprocess_rank.pyPre-trained model checkpoints are available at: MADGEN Checkpoints
- MSGym Model: Trained on MSGym dataset
- CANOPUS Model: Trained on CANOPUS dataset
If you find this code useful for your research, please consider citing our paper:
@inproceedings{
wang2025madgen,
title={MADGEN: Mass-Spec attends to De Novo Molecular generation},
author={Yinkai Wang and Xiaohui Chen and Liping Liu and Soha Hassoun},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=78tc3EiUrN}
}We welcome contributions! Please feel free to submit issues and pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or need assistance, please contact:
- Yinkai Wang: yinkai.wang@tufts.edu
- Soha Hassoun: soha.hassoun@tufts.edu
We thank the developers of RDKit, PyTorch, and other open-source libraries that made this work possible.
