Reference implementation for the paper “Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement” (arXiv: 2512.19530v1).
This repository provides code for:
- GNN with GAT + DRFP + learned mixture encoding
- DeepModel (Transformer-enhanced SwiGLU MLP)
- GBDT baseline (multi-output)
- Ensemble (inverse-variance weighted)
catechol_gnn/core models and data utilitiestrain_gnn.pytraining/evaluation for GNNtrain_baselines.pytraining/evaluation for GBDT/DeepModel/Ensemblerequirements.txtdependencies
GNN implementation matches the paper’s design:
- 4 molecular graphs per sample: SM, P2, P3, Solvent
- GAT stack: 4 layers, 8 heads, hidden dim 256, residual connections
- Global mean + max pooling
- Learned mixture encoding:
e_mix = MLP([eA; eB; %B; T; time]) - DRFP features: 2048-dim
- Final MLP head with sigmoid output for 3 yields
- Training: AdamW (lr
3e-4, weight decay1e-5), batch128, max epochs400, early stopping50, dropout0.15, head dropout0.075, grad clip1.0, ReduceLROnPlateau (factor=0.7,patience=30)
DeepModel matches the paper’s design:
- Input projection to 384
- Single 8-head self-attention block
- 4 residual SwiGLU blocks
- 2-layer MLP output head
- Training: AdamW (lr
7e-4, weight decay1e-5), batch128, max epochs400, early stopping50, dropout0.15, head dropout0.075, grad clip1.0
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtsmiles_sm,smiles_p2,smiles_p3smiles_solvent_a,smiles_solvent_b(empty for pure solvents)solvent_a_id,solvent_b_id(IDs used in LOSO)percent_b,temperature,residence_timey_sm,y_p2,y_p3ramp_id(for LORO)
Optional DRFP matrix: .npy with shape (n_samples, 2048).
The CSV must already include numeric descriptor columns (Spange, ACS PCA, DRFP), plus
temperature, residence_time, percent_b. All numeric columns except the 3 targets and
these 3 condition columns are treated as descriptor features.
python3 train_gnn.py --csv /path/to/catechol.csv --split loso --drfp /path/to/drfp.npy
python3 train_gnn.py --csv /path/to/catechol.csv --split loro --drfp /path/to/drfp.npypython3 train_baselines.py --csv /path/to/catechol_tabular.csv --split loso
python3 train_baselines.py --csv /path/to/catechol_tabular.csv --split loro- LOSO and LORO splits follow the paper’s protocol.
- GNN and DeepModel training follow the paper’s hyperparameters and early stopping criteria.
- DRFP features must be precomputed to reproduce reported metrics.
torchtorch-geometricrdkitnumpy,pandasscikit-learn
This repository is released under the license specified by the authors. If you plan to distribute or publish derived work, ensure you comply with the paper’s and dataset’s licenses.
If you use this code, please cite the paper:
@article{xing2025catechol,
title={Learning Continuous Solvent Effects from Transient Flow Data: A Graph Neural Network Benchmark on Catechol Rearrangement},
author={Xing, Hongsheng and Si, Qiuxin},
journal={arXiv preprint arXiv:2512.19530v1},
year={2025}
}
For questions about the dataset or methodology, please refer to the paper or contact the authors.
- Email: starxsky@outlook.com