MANU -- Benchmarking HPO Strategies for GNNs on ADMET Prediction

Benchmarking Hyperparameter Optimization Strategies for Graph Neural Networks on ADMET Prediction Under Scaffold-Split Evaluation

Authors: Martin Stamenov, Mila Gjurovska, Viktorija Vodilovska, Ilinka Ivanoska Paper: paper_1/main.tex

Overview

Systematic HPO Benchmark for Molecular GNNs is a reproducible benchmarking framework that systematically evaluates seven hyperparameter optimization (HPO) strategies for Graph Neural Networks (GNNs) on six ADMET datasets from the Therapeutics Data Commons (TDC). It additionally compares optimized GNNs against frozen foundation model baselines (ChemBERTa, MolCLR, Morgan-FP, MolE-FP), and provides multi-seed statistical validation with confidence intervals.

The framework answers two core questions:

Which HPO algorithm should practitioners choose for GNN-based molecular property prediction under scaffold-split evaluation?
Can task-specific GNNs with systematic HPO match or exceed frozen pretrained foundation models?

Key Statistics

Metric	Value
Datasets	6 (4 ADME regression + 2 Toxicity classification)
Total molecules	11,805
HPO algorithms	7 (Random, PSO, ABC, GA, SA, HC, TPE)
Trials per run	50
Total HPO runs	42 (6 datasets x 7 algorithms)
Total model evaluations	2,100+
Multi-seed validation	5 seeds per dataset
Foundation model baselines	4 (ChemBERTa, MolCLR, Morgan-FP, MolE-FP)
GNN backbone	GCN (GraphConv)
Evaluation protocol	Scaffold split (Bemis-Murcko, 80/10/10)
Hardware	NVIDIA RTX 3060, i7-8700K, 16 GB RAM
Total compute	~45 hours

Key Findings

No universal optimizer exists. Random Search wins on 3/4 regression tasks; metaheuristics (SA, ABC) win on classification. Algorithm choice is task-dependent.
Random Search is a strong baseline. No metaheuristic achieves statistically significant improvement over Random Search (Wilcoxon signed-rank, p > 0.05) under a 50-trial budget with scaffold split.
Scaffold-split evaluation changes optimizer rankings. The noisy validation landscape induced by scaffold split reduces the advantage of adaptive metaheuristics compared to random-split settings.
GNNs outperform frozen foundation models on toxicity. hERG: GNN AUC=0.825 vs ChemBERTa 0.770. Tox21: GNN AUC=0.742 vs ChemBERTa 0.728.
Structure-only models fail on complex PK. Hepatocyte clearance R^2 = -1.02 (worse than predicting the mean). Foundation models provide a more stable starting point on this task.
Dataset difficulty varies dramatically. From hERG (AUC=0.825, strong) to Hepatocyte clearance (R^2=-1.02, impossible).

Results

HPO Algorithm Comparison (50 Trials, Seed 42)

ADME Regression (Test RMSE -- lower is better)

Dataset	PSO	ABC	GA	SA	HC	Random	TPE
Caco2_Wang	0.0031	0.0029	0.0031	0.0029	0.0030	0.0027	0.0030
Half_Life_Obach	21.66	21.66	21.66	23.70	24.52	22.31	22.34
Clearance_Hepatocyte_AZ	70.21	72.04	71.34	72.04	72.04	68.22	52.16
Clearance_Microsome_AZ	42.76	42.29	42.29	40.94	41.63	38.75	44.34

Toxicity Classification (Test AUC-ROC -- higher is better)

Dataset	PSO	ABC	GA	SA	HC	Random	TPE
Tox21 (NR-AR)	0.692	0.735	0.735	0.742	0.652	0.713	0.705
hERG	0.747	0.825	0.747	0.802	0.821	0.747	0.772

Note: TPE uses Optuna and additionally searches over dropout (8-dim space), while NiaPy-based algorithms share a 7-dim search space.

Multi-Seed Validation (5 Seeds)

Dataset	Task	Metric	Mean +/- Std (95% CI)
Caco2_Wang	Regr.	RMSE	0.0033 +/- 0.0005 (0.0027--0.0039)
Half_Life_Obach	Regr.	RMSE	20.05 +/- 1.17 (18.61--21.50)
Clearance_Hepatocyte_AZ	Regr.	RMSE	52.37 +/- 2.87 (48.81--55.93)
Clearance_Microsome_AZ	Regr.	RMSE	53.46 +/- 13.56 (36.63--70.30)
Tox21 (NR-AR)	Class.	AUC	0.711 +/- 0.012 (0.696--0.727)
hERG	Class.	AUC	0.805 +/- 0.022 (0.778--0.832)

Foundation Model Comparison

Model	Caco2 (R^2)	Half_Life (RMSE)	Clear_Hep (RMSE)	Clear_Micro (RMSE)	Tox21 (AUC)	hERG (AUC)
GNN-Best	0.48	21.66	68.22	38.75	0.743	0.825
Morgan-FP	--	22.12	48.36	40.36	0.722	0.611
ChemBERTa	0.48	27.39	47.31	42.56	0.728	0.770
MolE-FP	--	25.01	47.22	41.79	0.675	0.672
MolCLR	--	21.71	48.92	42.19	0.452	0.401

Caco2 comparison uses R^2 (scale-invariant) because GNN reports RMSE in original units while foundation models use z-score-normalized space.

Quick Start

Installation

git clone https://github.com/NitramVonemats/MANU_Project.git
cd MANU_Project
pip install -r requirements.txt

Run HPO Benchmark (50 trials, all algorithms, all datasets)

python scripts/run_hpo_50_trials.py

Run TPE Benchmark (Optuna)

python scripts/run_tpe_benchmark.py

Run Foundation Model Baselines

python scripts/run_complete_foundation_benchmark.py
python scripts/run_chemberta_finetune.py

Run Multi-Seed Validation

python scripts/run_multi_seed_validation.py

Generate Visualizations

python scripts/create_hpo_visualizations.py
python scripts/create_foundation_comparison_plots.py

Project Structure

MANU/
|-- paper_1/                          # LaTeX paper
|   |-- main.tex                      # Main manuscript
|   |-- refs.bib                      # Bibliography
|   `-- images/                       # Paper figures (PNG)
|
|-- src/core/                         # Core source code
|   |-- optimized_gnn.py              # GNN model, training, evaluation
|   `-- model_comparison.py           # Model comparison utilities
|
|-- optimization/                     # HPO framework
|   |-- space.py                      # 7-dim search space definition
|   |-- problem.py                    # NiaPy problem wrapper
|   |-- runner.py                     # HPO execution runner
|   |-- foundation_problem.py         # Foundation model HPO wrapper
|   |-- foundation_runner.py          # Foundation model HPO runner
|   `-- algorithms/                   # Algorithm implementations
|       |-- pso.py                    # Particle Swarm Optimization
|       |-- genetic.py                # Genetic Algorithm
|       |-- abc.py                    # Artificial Bee Colony
|       |-- simulated_annealing.py    # Simulated Annealing
|       |-- hill_climbing.py          # Hill Climbing
|       `-- random_search.py          # Random Search
|
|-- scripts/                          # Execution and analysis scripts
|   |-- run_hpo_50_trials.py          # Main HPO runner (50 trials)
|   |-- run_tpe_benchmark.py          # TPE via Optuna
|   |-- run_multi_seed_validation.py  # 5-seed validation
|   |-- run_chemberta_finetune.py     # ChemBERTa fine-tuning
|   |-- run_complete_foundation_benchmark.py
|   |-- create_hpo_visualizations.py  # HPO figures
|   |-- create_foundation_comparison_plots.py
|   |-- statistical_significance_tests.py
|   `-- analyses/                     # Detailed analysis scripts
|
|-- runs/                             # HPO results (JSON, per dataset/algo)
|   |-- Caco2_Wang/                   # 6 algo result files
|   |-- Half_Life_Obach/
|   |-- Clearance_Hepatocyte_AZ/
|   |-- Clearance_Microsome_AZ/
|   |-- tox21/
|   `-- herg/
|
|-- results/                          # Processed results
|   |-- multi_seed/                   # 5-seed validation results
|   |-- tpe_benchmark/                # TPE results (6 datasets)
|   |-- foundation_benchmark/         # Foundation model comparison CSV
|   |-- chemberta_finetune/           # ChemBERTa fine-tuning results
|   |-- figures/                      # Generated tables and figures
|   `-- hpo/                          # Processed HPO results
|
|-- datasets/                         # Raw datasets (CSV)
|   |-- adme/                         # 4 ADME regression datasets
|   `-- toxicity/                     # Tox21, hERG, ClinTox
|
|-- external/MolCLR/                  # MolCLR pretrained checkpoints
|-- figures/paper/                    # Generated LaTeX tables
|-- archive/                          # Old experiments and scripts
|-- requirements.txt                  # Python dependencies
`-- README.md                         # This file

Datasets

All datasets are from the Therapeutics Data Commons (TDC) ADMET benchmark.

Dataset	Task	Molecules	Primary Metric	Difficulty
Caco2_Wang	Permeability (regression)	910	RMSE, R^2	Moderate (R^2=0.48)
Half_Life_Obach	Half-life (regression)	667	RMSE, R^2	Very Hard (R^2=0.004)
Clearance_Hepatocyte_AZ	Clearance (regression)	1,213	RMSE, R^2	Impossible (R^2=-1.02)
Clearance_Microsome_AZ	Clearance (regression)	1,102	RMSE, R^2	Weak (R^2=0.19)
Tox21 (NR-AR)	Toxicity (classification)	7,258	AUC-ROC	Moderate (3.5% pos)
hERG	Cardiotoxicity (classification)	655	AUC-ROC	Good (AUC=0.825)

Splitting: Bemis-Murcko scaffold split (80/10/10 train/val/test), seed 42.

HPO Algorithms

Algorithm	Type	Framework	Config
Random Search	Baseline	NiaPy	Uniform sampling
PSO	Swarm intelligence	NiaPy	pop=16, C1=2.0, C2=2.0, w=0.7
ABC	Swarm intelligence	NiaPy	colony=16, limit=50
GA	Evolutionary	NiaPy	pop=16, mutation=0.1, crossover=0.8
SA	Probabilistic	NiaPy	T0=1.0, alpha=0.99
HC	Local search	NiaPy	Greedy, single init
TPE	Bayesian	Optuna	10 startup trials, median pruning (5 startup)

Search Space (7 dimensions for NiaPy, 8 for TPE)

Hyperparameter	Range	Type
Hidden dimensions	{64, 96, 128, 192, 256, 384, 512}	Categorical
Number of layers	{3, 4, 5, 6, 7}	Categorical
MLP head layer 1	{128, 192, 256, 384, 512}	Categorical
MLP head layer 2	{64, 96, 128, 192, 256}	Categorical
MLP head layer 3	{32, 48, 64, 96, 128}	Categorical
Learning rate	[1e-4, 1e-2]	Log-uniform
Weight decay	[1e-6, 1e-2]	Log-uniform
Dropout (TPE only)	[0.0, 0.5]	Uniform

Practitioner Recommendations

Scenario	Recommended Algorithm	Reason
Regression (general)	Random Search or PSO	Fast, competitive; Random wins 3/4 ADME tasks
Classification / toxicity	SA or ABC	Better handles class imbalance; wins on both tox tasks
Complex metabolic endpoints	TPE (Optuna)	Best sample efficiency on Clearance_Hepatocyte
Quick baseline	Morgan-FP + MLP	Simple, interpretable, no GPU needed
Limited compute budget	Random Search	Zero optimizer overhead, competitive with 50 trials

License

MIT License

Acknowledgments

Therapeutics Data Commons (TDC) -- Datasets and benchmarks
PyTorch Geometric -- GNN framework
NiaPy -- Metaheuristic algorithms
Optuna -- TPE optimization
Hugging Face Transformers -- ChemBERTa
RDKit -- Molecular featurization

Last updated: 2026-04-01

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.idea		.idea
adme_gnn.egg-info		adme_gnn.egg-info
archive		archive
config/benchmarking		config/benchmarking
datasets		datasets
docs		docs
external		external
figures/paper		figures/paper
optimization		optimization
paper_final		paper_final
runs		runs
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
check_progress.py		check_progress.py
config_benchmark.json		config_benchmark.json
config_benchmark.yaml		config_benchmark.yaml
config_foundation_benchmark.yaml		config_foundation_benchmark.yaml
environment.yml		environment.yml
model_comparison.py		model_comparison.py
optimized_gnn.py		optimized_gnn.py
pyproject.toml		pyproject.toml
reorganize_project.py		reorganize_project.py
requirements.txt		requirements.txt
run_foundation_benchmark.sh		run_foundation_benchmark.sh
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MANU -- Benchmarking HPO Strategies for GNNs on ADMET Prediction

Overview

Key Statistics

Key Findings

Results

HPO Algorithm Comparison (50 Trials, Seed 42)

ADME Regression (Test RMSE -- lower is better)

Toxicity Classification (Test AUC-ROC -- higher is better)

Multi-Seed Validation (5 Seeds)

Foundation Model Comparison

Quick Start

Installation

Run HPO Benchmark (50 trials, all algorithms, all datasets)

Run TPE Benchmark (Optuna)

Run Foundation Model Baselines

Run Multi-Seed Validation

Generate Visualizations

Project Structure

Datasets

HPO Algorithms

Search Space (7 dimensions for NiaPy, 8 for TPE)

Practitioner Recommendations

License

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MANU -- Benchmarking HPO Strategies for GNNs on ADMET Prediction

Overview

Key Statistics

Key Findings

Results

HPO Algorithm Comparison (50 Trials, Seed 42)

ADME Regression (Test RMSE -- lower is better)

Toxicity Classification (Test AUC-ROC -- higher is better)

Multi-Seed Validation (5 Seeds)

Foundation Model Comparison

Quick Start

Installation

Run HPO Benchmark (50 trials, all algorithms, all datasets)

Run TPE Benchmark (Optuna)

Run Foundation Model Baselines

Run Multi-Seed Validation

Generate Visualizations

Project Structure

Datasets

HPO Algorithms

Search Space (7 dimensions for NiaPy, 8 for TPE)

Practitioner Recommendations

License

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages