Learning Continuous Solvent Effects from Transient Flow Data

Reference implementation for the paper “Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement” (arXiv: 2512.19530v1).

This repository provides code for:

GNN with GAT + DRFP + learned mixture encoding
DeepModel (Transformer-enhanced SwiGLU MLP)
GBDT baseline (multi-output)
Ensemble (inverse-variance weighted)

Method Summary (Paper Alignment)

GNN implementation matches the paper’s design:

4 molecular graphs per sample: SM, P2, P3, Solvent
GAT stack: 4 layers, 8 heads, hidden dim 256, residual connections
Global mean + max pooling
Learned mixture encoding: e_mix = MLP([eA; eB; %B; T; time])
DRFP features: 2048-dim
Final MLP head with sigmoid output for 3 yields
Training: AdamW (lr 3e-4, weight decay 1e-5), batch 128, max epochs 400, early stopping 50, dropout 0.15, head dropout 0.075, grad clip 1.0, ReduceLROnPlateau (factor=0.7, patience=30)

DeepModel matches the paper’s design:

Input projection to 384
Single 8-head self-attention block
4 residual SwiGLU blocks
2-layer MLP output head
Training: AdamW (lr 7e-4, weight decay 1e-5), batch 128, max epochs 400, early stopping 50, dropout 0.15, head dropout 0.075, grad clip 1.0

Installation

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Data Format

Required CSV columns (GNN)

smiles_sm, smiles_p2, smiles_p3
smiles_solvent_a, smiles_solvent_b (empty for pure solvents)
solvent_a_id, solvent_b_id (IDs used in LOSO)
percent_b, temperature, residence_time
y_sm, y_p2, y_p3
ramp_id (for LORO)

Optional DRFP matrix: .npy with shape (n_samples, 2048).

Required CSV columns (Baselines)

The CSV must already include numeric descriptor columns (Spange, ACS PCA, DRFP), plus temperature, residence_time, percent_b. All numeric columns except the 3 targets and these 3 condition columns are treated as descriptor features.

Usage

GNN (LOSO / LORO)

python3 train_gnn.py --csv /path/to/catechol.csv --split loso --drfp /path/to/drfp.npy
python3 train_gnn.py --csv /path/to/catechol.csv --split loro --drfp /path/to/drfp.npy

Baselines (GBDT / DeepModel / Ensemble)

python3 train_baselines.py --csv /path/to/catechol_tabular.csv --split loso
python3 train_baselines.py --csv /path/to/catechol_tabular.csv --split loro

Reproducibility

LOSO and LORO splits follow the paper’s protocol.
GNN and DeepModel training follow the paper’s hyperparameters and early stopping criteria.
DRFP features must be precomputed to reproduce reported metrics.

Dependencies

torch
torch-geometric
rdkit
numpy, pandas
scikit-learn

License

This repository is released under the license specified by the authors. If you plan to distribute or publish derived work, ensure you comply with the paper’s and dataset’s licenses.

Citation

If you use this code, please cite the paper:

@article{xing2025catechol,
  title={Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement},
  author={Xing, Hongsheng and Si, Qiuxin},
  journal={arXiv preprint arXiv:2512.19530v1},
  year={2025}
}

Contact

For questions about the dataset or methodology, please refer to the paper or contact the authors.

Email: starxsky@outlook.com

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
catechol_gnn		catechol_gnn
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
train_baselines.py		train_baselines.py
train_gnn.py		train_gnn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Learning Continuous Solvent Effects from Transient Flow Data

Contents

Method Summary (Paper Alignment)

Installation

Data Format

Required CSV columns (GNN)

Required CSV columns (Baselines)

Usage

GNN (LOSO / LORO)

Baselines (GBDT / DeepModel / Ensemble)

Reproducibility

Dependencies

License

Citation

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Learning Continuous Solvent Effects from Transient Flow Data

Contents

Method Summary (Paper Alignment)

Installation

Data Format

Required CSV columns (GNN)

Required CSV columns (Baselines)

Usage

GNN (LOSO / LORO)

Baselines (GBDT / DeepModel / Ensemble)

Reproducibility

Dependencies

License

Citation

Contact

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages