Target-aware baseline pipeline for ranking small-molecule ligands against kinase targets by predicted binding strength. The repository now runs end to end: download data, process it into a consistent schema, train a baseline model, evaluate ranking quality, and score new ligand-target pairs with uncertainty.
The current baseline is designed for candidate prioritization, not drug discovery claims.
It provides:
- A reproducible data pipeline for kinase-ligand affinity data
- A consistent
p_activitytarget acrossKd,Ki, andIC50measurements - A target-aware baseline model using Morgan fingerprints plus protein-sequence features
- Ranking metrics on held-out kinase targets
- Uncertainty estimates calibrated on validation data
Scientists use AI models to help decide which kinase-targeting compounds are worth testing in the lab. The problem is that many standard evaluations are too easy: proteins in the test set can still be very similar to proteins the model already saw during training. That can make a model look stronger than it really is.
An easy way to think about it is:
if you train someone to recognize dogs, but the test only shows them dogs that look almost identical to the ones they already studied, the test will overstate how well they generalize
KinBench-UQ exists to make that evaluation stricter and more realistic for kinase-specific candidate prioritization. In particular, it adds sequence-identity-aware splitting, mutation-family evaluation, external validation, and generated analysis artifacts that make leakage and split difficulty visible.
So the main contribution is not simply "a better model." It is also "a better exam" for evaluating kinase prioritization models.
raw affinity data
-> cleaning + canonical SMILES
-> target-aware train/val/test split
-> ligand fingerprints + protein sequence features
-> ridge ensemble regressor
-> ranked ligands + uncertainty
bio1/
README.md
requirements.txt
data/
raw/
processed/
docs/
benchmark_protocol.md
benchmark_results.md
methodology.md
project_plan.md
results_baseline.md
paper/
draft.md
models/
baseline/
results/
baseline/
scripts/
download_data.py
process_dataset.py
generate_benchmark_splits.py
train_baseline.py
predict_rank.py
evaluate_budgeted_policies.py
evaluate_conformal.py
run_benchmark_suite.py
src/kinase_ligand_ranking/
tests/
The intended paper-grade environment is pinned for Python 3.12.
conda env create -f environment.yml
conda activate kinbench-uqIf you prefer venv:
python3.12 -m venv .venv
source .venv/bin/activate
pip install -r requirements-lock.txtpip install -r requirements.txt
# 1. Download the Davis kinase dataset
python scripts/download_data.py --source davis
# 2. Clean and split the dataset
python scripts/process_dataset.py --source davis
# 3. Generate benchmark split families
python scripts/generate_benchmark_splits.py
# 4. Train and evaluate the baseline
python scripts/train_baseline.py
# 5. Evaluate budget-constrained decision policies
python scripts/evaluate_budgeted_policies.py
# 6. Evaluate split-conformal uncertainty
python scripts/evaluate_conformal.py
# 7. Download and build the external validation set
python scripts/download_data.py --source bindingdb
python scripts/build_external_validation.py
# 8. Run the full benchmark suite
python scripts/run_benchmark_suite.py \
--splits random cold_target cold_ligand scaffold both_new sequence_identity mutation_holdout \
--models ligand_only_ridge ridge_ensemble dual_tower_uq deepdta_exact graphdta_gcn_exact
# 9. Run external-source validation
python scripts/run_external_validation.py \
--models ligand_only_ridge ridge_ensemble dual_tower_uq deepdta_exact graphdta_gcn_exact
# 10. Generate manuscript figures and tables from results/*
python scripts/generate_paper_assets.py
# 11. Score new ligand-target pairs
python scripts/predict_rank.py --input path/to/candidates.csvOne-command submission pipeline:
python scripts/run_submission_pipeline.py --device cpuIf you only want the main files for reproducing the paper or comparing models, start here:
- benchmark runner: scripts/run_benchmark_suite.py
- external validation runner: scripts/run_external_validation.py
- split generation: scripts/generate_benchmark_splits.py
- manuscript asset generation: scripts/generate_paper_assets.py
- one-command pipeline: scripts/run_submission_pipeline.py
- new-candidate scoring: scripts/predict_rank.py
Main model code:
- repo-native model: neural_modeling.py
class:
DualTowerUncertaintyRanker - literature baselines: literature_models.py
models:
deepdta_exact,graphdta_gcn_exact - feature construction: features.py
- split definitions: splits.py
If you only want to compare methods under the main benchmark, run:
python scripts/run_benchmark_suite.py \
--splits random cold_target sequence_identity mutation_holdout \
--models ligand_only_ridge ridge_ensemble dual_tower_uq deepdta_exact graphdta_gcn_exact \
--results-dir results/benchmark \
--device cpuFor a step-by-step reproducibility path, see docs/reproduce_paper.md.
The processed dataset keeps the original assay family in affinity_type, but
the regression target is p_activity = -log10(affinity in molar units).
This matters because:
Kd,Ki, andIC50are not interchangeable assay labels- the previous
pic50naming was wrong for mixed-type data - the repo now preserves both the raw assay type and the generic transformed target
Processed CSV columns include:
smilestarget_idtarget_labeltarget_sequenceaffinity_typeactivity_labelaffinity_nmp_activitymeasurement_countsource
The current model is a bootstrap ensemble of ridge regressors trained on:
- Morgan fingerprints for ligands
- amino-acid composition and length-derived target sequence features
- assay-type one-hot features
- source one-hot features
Model selection uses the validation split to choose ridge regularization, then refits on train+validation before final evaluation on held-out test targets.
Artifacts were regenerated locally on March 10, 2026 with:
python scripts/download_data.py --source davis
python scripts/process_dataset.py --source davis
python scripts/train_baseline.pyHeld-out test metrics from metrics.json:
- RMSE:
0.779 - Global Spearman:
0.454 - ROC-AUC at
p_activity >= 6.0:0.797 - Mean per-target Spearman:
0.465 - Mean top-10% enrichment:
3.424x - 95% interval coverage after calibration:
0.952
Decision-oriented results from budget_policy_metrics.json:
- budget-1 hit rate:
0.785 - budget-3 hit rate:
0.646 - budget-5 hit rate:
0.600 - budget-10 hit rate:
0.506 - uncertainty/error Spearman:
0.431
The current calibrated uncertainty is informative about error, but the simple risk-adjusted selection policy does not yet beat mean-only ranking on the Davis test split. That is now part of the benchmark story rather than something being hidden.
Conformal uncertainty results from conformal_metrics.json:
- normalized conformal at
alpha=0.10: coverage0.895, mean width2.032 - normalized conformal at
alpha=0.05: coverage0.952, mean width3.289 - at
alpha=0.10, the model abstains on about69.5%of cases while making a large confident-inactive region
Benchmark comparison results:
- ridge baselines across
random,cold_target,cold_ligand,scaffold, andboth_neware summarized in results/benchmark_ridge/summary.csv - the interaction-aware
dual_tower_uqmodel is summarized in results/benchmark_dual_tower/summary.csv - the mutation-transfer benchmark is summarized in results/benchmark_mutation/summary.csv
- the split manifests live in data/benchmark/manifest.json
- exact-architecture DeepDTA and GraphDTA-style baselines run through scripts/run_benchmark_suite.py
- external-source validation is produced by scripts/run_external_validation.py
- manuscript evidence is generated by scripts/generate_paper_assets.py
Current takeaways:
- ligand/scaffold/both-new splits are materially harsher than random splits for the ridge baselines
- target-held-out is not automatically the hardest split on Davis
- the interaction-aware model materially outperforms the ridge baselines on the currently run
randomandcold_targetsplits - on the new
mutation_holdoutsplit,dual_tower_uqis much stronger than the ridge baselines
scripts/predict_rank.py expects a CSV with:
- required:
smiles,target_id - optional:
target_sequence,affinity_type,source
Predictions are written with:
predicted_p_activityprediction_std
python -m unittest discover -s tests- exact target-sequence-identity split generation using global pairwise alignment
- mutation-family analysis and leakage summaries as generated artifacts
- exact-architecture
deepdta_exactandgraphdta_gcn_exactcomparison baselines - Davis-to-BindingDB external validation
- generated manuscript tables and figures from
results/*