Benchmarking Hyperparameter Optimization Strategies for Graph Neural Networks on ADMET Prediction Under Scaffold-Split Evaluation
Authors: Martin Stamenov, Mila Gjurovska, Viktorija Vodilovska, Ilinka Ivanoska Paper:
paper_1/main.tex
Systematic HPO Benchmark for Molecular GNNs is a reproducible benchmarking framework that systematically evaluates seven hyperparameter optimization (HPO) strategies for Graph Neural Networks (GNNs) on six ADMET datasets from the Therapeutics Data Commons (TDC). It additionally compares optimized GNNs against frozen foundation model baselines (ChemBERTa, MolCLR, Morgan-FP, MolE-FP), and provides multi-seed statistical validation with confidence intervals.
The framework answers two core questions:
- Which HPO algorithm should practitioners choose for GNN-based molecular property prediction under scaffold-split evaluation?
- Can task-specific GNNs with systematic HPO match or exceed frozen pretrained foundation models?
| Metric | Value |
|---|---|
| Datasets | 6 (4 ADME regression + 2 Toxicity classification) |
| Total molecules | 11,805 |
| HPO algorithms | 7 (Random, PSO, ABC, GA, SA, HC, TPE) |
| Trials per run | 50 |
| Total HPO runs | 42 (6 datasets x 7 algorithms) |
| Total model evaluations | 2,100+ |
| Multi-seed validation | 5 seeds per dataset |
| Foundation model baselines | 4 (ChemBERTa, MolCLR, Morgan-FP, MolE-FP) |
| GNN backbone | GCN (GraphConv) |
| Evaluation protocol | Scaffold split (Bemis-Murcko, 80/10/10) |
| Hardware | NVIDIA RTX 3060, i7-8700K, 16 GB RAM |
| Total compute | ~45 hours |
- No universal optimizer exists. Random Search wins on 3/4 regression tasks; metaheuristics (SA, ABC) win on classification. Algorithm choice is task-dependent.
- Random Search is a strong baseline. No metaheuristic achieves statistically significant improvement over Random Search (Wilcoxon signed-rank, p > 0.05) under a 50-trial budget with scaffold split.
- Scaffold-split evaluation changes optimizer rankings. The noisy validation landscape induced by scaffold split reduces the advantage of adaptive metaheuristics compared to random-split settings.
- GNNs outperform frozen foundation models on toxicity. hERG: GNN AUC=0.825 vs ChemBERTa 0.770. Tox21: GNN AUC=0.742 vs ChemBERTa 0.728.
- Structure-only models fail on complex PK. Hepatocyte clearance R^2 = -1.02 (worse than predicting the mean). Foundation models provide a more stable starting point on this task.
- Dataset difficulty varies dramatically. From hERG (AUC=0.825, strong) to Hepatocyte clearance (R^2=-1.02, impossible).
| Dataset | PSO | ABC | GA | SA | HC | Random | TPE |
|---|---|---|---|---|---|---|---|
| Caco2_Wang | 0.0031 | 0.0029 | 0.0031 | 0.0029 | 0.0030 | 0.0027 | 0.0030 |
| Half_Life_Obach | 21.66 | 21.66 | 21.66 | 23.70 | 24.52 | 22.31 | 22.34 |
| Clearance_Hepatocyte_AZ | 70.21 | 72.04 | 71.34 | 72.04 | 72.04 | 68.22 | 52.16 |
| Clearance_Microsome_AZ | 42.76 | 42.29 | 42.29 | 40.94 | 41.63 | 38.75 | 44.34 |
| Dataset | PSO | ABC | GA | SA | HC | Random | TPE |
|---|---|---|---|---|---|---|---|
| Tox21 (NR-AR) | 0.692 | 0.735 | 0.735 | 0.742 | 0.652 | 0.713 | 0.705 |
| hERG | 0.747 | 0.825 | 0.747 | 0.802 | 0.821 | 0.747 | 0.772 |
Note: TPE uses Optuna and additionally searches over dropout (8-dim space), while NiaPy-based algorithms share a 7-dim search space.
| Dataset | Task | Metric | Mean +/- Std (95% CI) |
|---|---|---|---|
| Caco2_Wang | Regr. | RMSE | 0.0033 +/- 0.0005 (0.0027--0.0039) |
| Half_Life_Obach | Regr. | RMSE | 20.05 +/- 1.17 (18.61--21.50) |
| Clearance_Hepatocyte_AZ | Regr. | RMSE | 52.37 +/- 2.87 (48.81--55.93) |
| Clearance_Microsome_AZ | Regr. | RMSE | 53.46 +/- 13.56 (36.63--70.30) |
| Tox21 (NR-AR) | Class. | AUC | 0.711 +/- 0.012 (0.696--0.727) |
| hERG | Class. | AUC | 0.805 +/- 0.022 (0.778--0.832) |
| Model | Caco2 (R^2) | Half_Life (RMSE) | Clear_Hep (RMSE) | Clear_Micro (RMSE) | Tox21 (AUC) | hERG (AUC) |
|---|---|---|---|---|---|---|
| GNN-Best | 0.48 | 21.66 | 68.22 | 38.75 | 0.743 | 0.825 |
| Morgan-FP | -- | 22.12 | 48.36 | 40.36 | 0.722 | 0.611 |
| ChemBERTa | 0.48 | 27.39 | 47.31 | 42.56 | 0.728 | 0.770 |
| MolE-FP | -- | 25.01 | 47.22 | 41.79 | 0.675 | 0.672 |
| MolCLR | -- | 21.71 | 48.92 | 42.19 | 0.452 | 0.401 |
Caco2 comparison uses R^2 (scale-invariant) because GNN reports RMSE in original units while foundation models use z-score-normalized space.
git clone https://github.com/NitramVonemats/MANU_Project.git
cd MANU_Project
pip install -r requirements.txtpython scripts/run_hpo_50_trials.pypython scripts/run_tpe_benchmark.pypython scripts/run_complete_foundation_benchmark.py
python scripts/run_chemberta_finetune.pypython scripts/run_multi_seed_validation.pypython scripts/create_hpo_visualizations.py
python scripts/create_foundation_comparison_plots.pyMANU/
|-- paper_1/ # LaTeX paper
| |-- main.tex # Main manuscript
| |-- refs.bib # Bibliography
| `-- images/ # Paper figures (PNG)
|
|-- src/core/ # Core source code
| |-- optimized_gnn.py # GNN model, training, evaluation
| `-- model_comparison.py # Model comparison utilities
|
|-- optimization/ # HPO framework
| |-- space.py # 7-dim search space definition
| |-- problem.py # NiaPy problem wrapper
| |-- runner.py # HPO execution runner
| |-- foundation_problem.py # Foundation model HPO wrapper
| |-- foundation_runner.py # Foundation model HPO runner
| `-- algorithms/ # Algorithm implementations
| |-- pso.py # Particle Swarm Optimization
| |-- genetic.py # Genetic Algorithm
| |-- abc.py # Artificial Bee Colony
| |-- simulated_annealing.py # Simulated Annealing
| |-- hill_climbing.py # Hill Climbing
| `-- random_search.py # Random Search
|
|-- scripts/ # Execution and analysis scripts
| |-- run_hpo_50_trials.py # Main HPO runner (50 trials)
| |-- run_tpe_benchmark.py # TPE via Optuna
| |-- run_multi_seed_validation.py # 5-seed validation
| |-- run_chemberta_finetune.py # ChemBERTa fine-tuning
| |-- run_complete_foundation_benchmark.py
| |-- create_hpo_visualizations.py # HPO figures
| |-- create_foundation_comparison_plots.py
| |-- statistical_significance_tests.py
| `-- analyses/ # Detailed analysis scripts
|
|-- runs/ # HPO results (JSON, per dataset/algo)
| |-- Caco2_Wang/ # 6 algo result files
| |-- Half_Life_Obach/
| |-- Clearance_Hepatocyte_AZ/
| |-- Clearance_Microsome_AZ/
| |-- tox21/
| `-- herg/
|
|-- results/ # Processed results
| |-- multi_seed/ # 5-seed validation results
| |-- tpe_benchmark/ # TPE results (6 datasets)
| |-- foundation_benchmark/ # Foundation model comparison CSV
| |-- chemberta_finetune/ # ChemBERTa fine-tuning results
| |-- figures/ # Generated tables and figures
| `-- hpo/ # Processed HPO results
|
|-- datasets/ # Raw datasets (CSV)
| |-- adme/ # 4 ADME regression datasets
| `-- toxicity/ # Tox21, hERG, ClinTox
|
|-- external/MolCLR/ # MolCLR pretrained checkpoints
|-- figures/paper/ # Generated LaTeX tables
|-- archive/ # Old experiments and scripts
|-- requirements.txt # Python dependencies
`-- README.md # This file
All datasets are from the Therapeutics Data Commons (TDC) ADMET benchmark.
| Dataset | Task | Molecules | Primary Metric | Difficulty |
|---|---|---|---|---|
| Caco2_Wang | Permeability (regression) | 910 | RMSE, R^2 | Moderate (R^2=0.48) |
| Half_Life_Obach | Half-life (regression) | 667 | RMSE, R^2 | Very Hard (R^2=0.004) |
| Clearance_Hepatocyte_AZ | Clearance (regression) | 1,213 | RMSE, R^2 | Impossible (R^2=-1.02) |
| Clearance_Microsome_AZ | Clearance (regression) | 1,102 | RMSE, R^2 | Weak (R^2=0.19) |
| Tox21 (NR-AR) | Toxicity (classification) | 7,258 | AUC-ROC | Moderate (3.5% pos) |
| hERG | Cardiotoxicity (classification) | 655 | AUC-ROC | Good (AUC=0.825) |
Splitting: Bemis-Murcko scaffold split (80/10/10 train/val/test), seed 42.
| Algorithm | Type | Framework | Config |
|---|---|---|---|
| Random Search | Baseline | NiaPy | Uniform sampling |
| PSO | Swarm intelligence | NiaPy | pop=16, C1=2.0, C2=2.0, w=0.7 |
| ABC | Swarm intelligence | NiaPy | colony=16, limit=50 |
| GA | Evolutionary | NiaPy | pop=16, mutation=0.1, crossover=0.8 |
| SA | Probabilistic | NiaPy | T0=1.0, alpha=0.99 |
| HC | Local search | NiaPy | Greedy, single init |
| TPE | Bayesian | Optuna | 10 startup trials, median pruning (5 startup) |
| Hyperparameter | Range | Type |
|---|---|---|
| Hidden dimensions | {64, 96, 128, 192, 256, 384, 512} | Categorical |
| Number of layers | {3, 4, 5, 6, 7} | Categorical |
| MLP head layer 1 | {128, 192, 256, 384, 512} | Categorical |
| MLP head layer 2 | {64, 96, 128, 192, 256} | Categorical |
| MLP head layer 3 | {32, 48, 64, 96, 128} | Categorical |
| Learning rate | [1e-4, 1e-2] | Log-uniform |
| Weight decay | [1e-6, 1e-2] | Log-uniform |
| Dropout (TPE only) | [0.0, 0.5] | Uniform |
| Scenario | Recommended Algorithm | Reason |
|---|---|---|
| Regression (general) | Random Search or PSO | Fast, competitive; Random wins 3/4 ADME tasks |
| Classification / toxicity | SA or ABC | Better handles class imbalance; wins on both tox tasks |
| Complex metabolic endpoints | TPE (Optuna) | Best sample efficiency on Clearance_Hepatocyte |
| Quick baseline | Morgan-FP + MLP | Simple, interpretable, no GPU needed |
| Limited compute budget | Random Search | Zero optimizer overhead, competitive with 50 trials |
MIT License
- Therapeutics Data Commons (TDC) -- Datasets and benchmarks
- PyTorch Geometric -- GNN framework
- NiaPy -- Metaheuristic algorithms
- Optuna -- TPE optimization
- Hugging Face Transformers -- ChemBERTa
- RDKit -- Molecular featurization
Last updated: 2026-04-01