Skip to content

HassounLab/MADGEN

Repository files navigation

MADGEN: Mass-Spec attends to De Novo Molecular generation

Official implementation of MADGEN: Mass-Spec attends to De Novo Molecular generation by Yinkai Wang, Xiaohui Chen, Liping Liu, and Soha Hassoun.

MADGEN Architecture

πŸ“‹ Table of Contents

πŸ”¬ Overview

MADGEN is a novel approach for de novo molecular generation that leverages mass spectrometry data. The model uses a transformer-based architecture to generate molecular structures from mass spectra, incorporating both molecular and spectral information for improved generation quality.

Key Features

  • Mass-Spec Guided Generation: Utilizes mass spectrometry data to guide molecular generation
  • Transformer Architecture: Leverages state-of-the-art transformer models for sequence generation
  • Multi-Modal Integration: Combines molecular SMILES and spectral embeddings
  • Comprehensive Evaluation: Includes both generation and retrieval evaluation metrics

πŸ›  Environment Setup

Prerequisites

  • Python 3.9
  • CUDA-compatible GPU (recommended)
  • Conda package manager

Installation

  1. Create conda environment:
conda create --name madgen python=3.9 rdkit=2023.09.5 -c conda-forge -y
conda activate madgen
  1. Install additional dependencies:
pip install -r requirements.txt
  1. Verify installation:
python -c "import rdkit; print('RDKit version:', rdkit.__version__)"

πŸ“Š Data Preparation

Available Datasets

The processed datasets are available in our Zenodo repository: MADGEN Data

MSGym Dataset

wget https://zenodo.org/records/15036069/files/msgym.pkl?download=1 -O ./data/msgym/raw/msgym.pkl

CANOPUS Dataset

wget https://zenodo.org/records/15036069/files/canopus.pkl?download=1 -O ./data/canopus/raw/canopus.pkl

NIST Dataset

The NIST dataset is commercially available and requires separate licensing. Please refer to the NIST website for acquisition details.

Data Structure

Each dataset contains:

  • Molecular SMILES: Canonical SMILES representations
  • Mass Spectra: Preprocessed spectral data
  • Molecular Properties: Additional molecular descriptors
  • Train/Validation/Test Splits: Predefined data splits

πŸš€ Training

Basic Training Command

python train.py --config configs/{dataset_name}.yaml --model Madgen

Available Configurations

  • configs/msgym.yaml - MSGym dataset configuration
  • configs/canopus.yaml - CANOPUS dataset configuration
  • configs/nist.yaml - NIST dataset configuration

Training Parameters

Key training parameters can be modified in the respective YAML configuration files:

  • batch_size: Training batch size
  • learning_rate: Learning rate for optimization
  • num_epochs: Number of training epochs
  • model_dim: Model dimension
  • num_layers: Number of transformer layers

Example Training Session

# Train on MSGym dataset
python train.py --config configs/msgym.yaml --model Madgen

# Train on CANOPUS dataset with custom parameters
python train.py --config configs/canopus.yaml --model Madgen --batch_size 32 --learning_rate 1e-4

🎯 Sampling

Basic Sampling Command

CUDA_VISIBLE_DEVICES=1 python sample.py \
    --config configs/{dataset_name}.yaml \
    --checkpoint {checkpoint_path} \
    --samples samples \
    --model Madgen \
    --mode test \
    --n_samples 50 \
    --n_steps 100 \
    --table_name {table_name} \
    --sampling_seed 42

Sampling Parameters

  • --n_samples: Number of molecules to generate
  • --n_steps: Number of sampling steps
  • --sampling_seed: Random seed for reproducibility
  • --mode: Sampling mode (test/validation)
  • --table_name: Output table name for results

Example Sampling Sessions

# Generate 100 molecules with 200 steps
python sample.py \
    --config configs/msgym.yaml \
    --checkpoint checkpoints/msgym_best.pt \
    --n_samples 100 \
    --n_steps 200 \
    --table_name msgym_generation

# Generate molecules with different seeds for diversity
python sample.py \
    --config configs/canopus.yaml \
    --checkpoint checkpoints/canopus_best.pt \
    --n_samples 50 \
    --sampling_seed 123 \
    --table_name canopus_seed123

πŸ“ˆ Evaluation

Generation Evaluation

Evaluate the quality of generated molecules:

python evaluation_generation.py --file_path /path/to/your/csvfile.csv

Evaluation Metrics

The evaluation includes the following metrics:

  • Top-K Accuracy: Percentage of cases where the true molecule appears in the top-K predictions (based on InChI key matching)
  • Maximum Tanimoto Similarity: Maximum Tanimoto similarity between the true molecule and top-K predictions using Morgan fingerprints (radius=2, 2048 bits)
  • Scaffold Coverage: Percentage of cases with valid scaffold predictions (non-disconnected scaffolds)

Input Data Format

The evaluation script expects a CSV file with the following columns:

  • true: True SMILES string
  • pred: Predicted SMILES string (can be repeated for multiple predictions)
  • scaffold: Scaffold SMILES string

πŸ” Predictive Retrieval

Using Pre-computed Results

Download pre-computed retrieval results:

# MSGym retrieval results
wget https://zenodo.org/records/17115068/files/ranks_msgym_pred.pkl?download=1 -O ./data/msgym/raw/ranks_msgym_pred.pkl

# CANOPUS retrieval results
wget https://zenodo.org/records/17115068/files/ranks_canopus_pred.pkl?download=1 -O ./data/canopus/raw/ranks_canopus_pred.pkl

πŸ“₯ Data Download

Scaffold Retrieval Datasets

Download the required datasets for scaffold retrieval experiments:

# Create data directories
mkdir -p scaffold_retrieval/data/canopus
mkdir -p scaffold_retrieval/data/massspecgym

# Download CANOPUS dataset
wget https://zenodo.org/records/17115068/files/canopus.zip?download=1 -O scaffold_retrieval/data/canopus.zip
cd scaffold_retrieval/data/canopus
unzip ../canopus.zip
cd ../..

# Download MassSpecGym dataset
wget https://zenodo.org/records/17115068/files/massspecgym.zip?download=1 -O scaffold_retrieval/data/massspecgym.zip
cd scaffold_retrieval/data/massspecgym
unzip ../massspecgym.zip
cd ../..

Dataset Descriptions

  • CANOPUS: A comprehensive dataset for molecular structure prediction from mass spectrometry data
  • MassSpecGym: A benchmark dataset for mass spectrometry-based molecular identification

Running Custom Retrieval

The scaffold_retrieval module provides a comprehensive framework for molecular scaffold-based retrieval using mass spectrometry data. This module implements contrastive learning approaches to learn molecular representations and enables retrieval of molecules based on their structural scaffolds.

Usage:

  1. Training: Use the training scripts for different datasets:

    cd scaffold_retrieval/
    python train_msgym.py    # For MassSpecGym dataset
    python train_canopus.py  # For Canopus dataset
  2. Candidate Ranking: Generate ranked candidates:

    python cand_rank_msg.py   # For MassSpecGym
    python cand_rank_canopus.py # For Canopus

Configuration:

Modify params_msg.yaml or params_canopus.yaml to adjust model parameters, dataset settings, and training configurations.

Then,

cd data/msgym/raw/ # Or data/canopus/raw/
python preprocess_rank.py

πŸ† Pre-trained Models

Pre-trained model checkpoints are available at: MADGEN Checkpoints

Available Checkpoints

  • MSGym Model: Trained on MSGym dataset
  • CANOPUS Model: Trained on CANOPUS dataset

πŸ“š Citation

If you find this code useful for your research, please consider citing our paper:

@inproceedings{
    wang2025madgen,
    title={MADGEN: Mass-Spec attends to De Novo Molecular generation},
    author={Yinkai Wang and Xiaohui Chen and Liping Liu and Soha Hassoun},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=78tc3EiUrN}
}

🀝 Contributing

We welcome contributions! Please feel free to submit issues and pull requests.

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ“ž Contact

If you have any questions or need assistance, please contact:

πŸ™ Acknowledgments

We thank the developers of RDKit, PyTorch, and other open-source libraries that made this work possible.

About

MASS-SPEC ATTENDS TO DE NOVO MOLECULAR GENERATION

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages