Skip to content

nyelidl/pKaNET_Cloud

Repository files navigation

pKaNET Cloud+ β€” Tautomer-Aware Protonation Engine

Heuristic microstate ranking Β· pH-adjusted SMILES Β· 3D structure generation Β· Google Colab & Streamlit

Open In Colab Open in Streamlit PyPI Python License: MIT

pKaNET Workflow


πŸ” What This Tool Does

pKaNET Cloud+ determines the dominant docking-relevant protonation state of a small molecule at a user-defined pH using a tautomer-aware Henderson–Hasselbalch microstate-ranking workflow.

The workflow is designed for ligand preparation before molecular docking, molecular mechanics parameterisation, and cheminformatics dataset curation.

Main functions:

  • Identifies ionisable sites using a calibrated SMARTS-based heuristic pKa table with context-aware rules for over 50 functional-group classes.
  • Uses Dimorphite-DL-assisted ionisation-state enumeration, followed by pKaNET Cloud+ tautomer-aware microstate filtering and re-ranking.
  • Ranks candidate microstates using a Henderson–Hasselbalch-inspired scoring function with multi-site charge-cap logic.
  • Optionally queries PubChem for experimental dissociation-constant evidence when available.
  • Returns the dominant microspecies as a pH-adjusted SMILES with formal charge.
  • Provides a fast sub-millisecond heuristic_net_charge() path and a smart predict_charge(mode='auto') dispatcher for large-scale screening.
  • Builds and minimises a 3D ligand structure using ETKDG followed by MMFF optimisation, with UFF fallback.
  • Exports docking- and parameterisation-ready PDB and SDF files.

πŸ“¦ Dependencies

Library Required Purpose
rdkit βœ… required SMARTS matching, molecule standardisation, tautomer handling, formal charge assignment, 3D conformer generation, and geometry optimisation
dimorphite-dl βœ… required Initial ionisation-state enumeration; pKaNET Cloud+ re-ranks the generated microstates using its heuristic pKa scoring model
requests βš™οΈ optional PubChem experimental pKa / dissociation-constant lookup
pkasolver βš™οΈ optional Optional ML-GNN pKa backend when available
propka βš™οΈ optional Optional semi-empirical pKa backend or fallback

py3Dmol is used only in the accompanying Colab notebook or visualisation interface. It is not required by the core pKaNET.py engine.


πŸš€ Installation

# Minimal (RDKit + requests only β€” heuristic mode)
pip install pkanet-cloud

# Recommended (adds Dimorphite-DL and pandas)
pip install "pkanet-cloud[recommended]"

# Full (adds propka and Streamlit web UI)
pip install "pkanet-cloud[all]"

RDKit note: rdkit is listed as a dependency but PyPI's rdkit wheel requires Python β‰₯ 3.9 on 64-bit platforms. If your environment uses a conda-managed RDKit, install without deps:

pip install pkanet-cloud --no-deps
pip install requests dimorphite-dl pandas   # then add these manually

πŸ’» Command-Line Interface (CLI)

After installation a pkanet command is available globally.

Basic usage

# Single SMILES string
pkanet --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --ph 7.4

# Compound name β€” resolved automatically via PubChem
pkanet --name "baicalein" --ph 7.4 --out-dir ./results

# SMILES file (one SMILES [name] per line) β€” also accepted via --file
pkanet --smi-file ligands.smi --formats PDB MOL2 --top-n 3

# Ligand structure file (.pdb / .mol2 / .sdf)
pkanet --file ligand.pdb --no-pubchem --keep-stereo

Fast heuristic charge estimation

For large libraries where 3D generation is not needed, --fast uses the sub-millisecond heuristic_net_charge() path and skips tautomer enumeration and Dimorphite-DL entirely.

# Single molecule
pkanet --smiles "NCC(=O)O" --ph 7.4 --fast

# Batch β€” prints TSV to stdout (name / SMILES / charge / mode)
pkanet --smi-file library.smi --fast --quiet

# .smi file passed via --file is auto-detected and works identically
pkanet --file library.smi --fast --quiet

All flags

Flag Default Description
--ph 7.4 Target pH
--out-dir ./pkanet_out Output directory
--out-name ligand Base name for output files
--formats PDB MOL2 SDF PDB 3D output format(s)
--ph-window 1.0 Dimorphite-DL enumeration window (Β±window/2)
--max-tautomers 8 Maximum tautomers to enumerate
--top-n 5 Top N microstates to rank
--top-k-3d 3 Write 3D structures for top-k microstates
--keep-stereo off Skip R/S stereoisomer enumeration
--no-pubchem off Disable PubChem experimental pKa lookup
--fast off Heuristic charge only β€” no 3D, no tautomers
--json-out FILE β€” Write full results as JSON
--quiet off TSV one-liner output, no banner

Example output (normal mode)

╔══════════════════════════════════════════════════════════╗
β•‘             pKaNET Cloud+  β€”  CLI                        β•‘
β•šβ•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•β•

  pKa backend : heuristic (SMARTS table)
  Target pH   : 7.4

════════════════════════════════════════════════════════════════
  πŸ†  aspirin  β€”  Rank-1 Microstate
════════════════════════════════════════════════════════════════
  Score                       : -0.312
  SMILES                      : CC(=O)Oc1ccccc1C(=O)[O-]
  Charge @ pH 7.4             : -1
  Zwitterion (strict)         : NO
  Amide preserved             : NO
  pKa source                  : heuristic
  PubChem CID                 : 2244

  πŸ“Š  Ranked microstates (top 5)
  Rank     Score  Charge  Rec  SMILES
  ────  ────────  ──────  ───  ────────────────────────────────────────
     1    -0.312      -1    β˜…  CC(=O)Oc1ccccc1C(=O)[O-]
     2    -1.562       0       CC(=O)Oc1ccccc1C(=O)O

⚑ Public API

Three public functions are available for programmatic and large-scale use:

from pkanet import predict_charge, heuristic_net_charge, batch_predict_charges

# Sub-millisecond heuristic estimate with multi-site charge caps
charge = heuristic_net_charge("CC(=O)O", ph=7.4)          # β†’ βˆ’1

# Smart dispatcher: fast normally, full pipeline for borderline pKa or tautomeric risk
charge, mode = predict_charge("CC(=O)O", ph=7.4, mode="auto")

# Batch prediction β€” returns a pandas DataFrame
df = batch_predict_charges(["CC(=O)O", "CN", "NCC(=O)O"], ph=7.4, mode="auto")

predict_charge(mode='auto') automatically escalates to the full tautomer + Dimorphite-DL + scoring pipeline when:

  • any detected site pKa is within 1.5 pH units of the target, or
  • the molecule has ring systems but no detectable ionisable sites on the parent form, indicating tautomeric enol risk, such as warfarin supplied as the keto form.

The heuristic_net_charge() function applies two charge-cap rules that suppress systematic over-charging in the fast path: a polyamine cap and a multi-acid cap.

Full pipeline via run_job

from pkanet.core import run_job

result = run_job(
    input_type     = "SMILES",
    smiles_text    = "CC(=O)OC1=CC=CC=C1C(=O)O",
    uploaded_bytes = None,
    uploaded_name  = None,
    target_pH      = 7.4,
    output_name    = "aspirin",
    out_dir        = "./output",
    output_formats = ["PDB"],
    use_pubchem    = True,
)

top = result["results"][0]
print(top["selected_microstate_smiles"])   # CC(=O)Oc1ccccc1C(=O)[O-]
print(top["formal_charge"])                # -1
print(top["minimized_pdb"])                # ./output/aspirin_micro1_min.pdb

πŸ§ͺ Internal Regression Test Suite

The internal regression suite contains 65 chemically curated test cases across 12 functional-group classes.

How to run

python3 test_pkanet.py pKaNET.py          # full suite
python3 test_pkanet.py pKaNET.py G8       # flavonoid group only
python3 test_pkanet.py pKaNET.py G12      # drug regression panel only

65 / 65 PASS (100%)

Group Description Pass Fail Status
G1 Imidazole-type N-H 10 0 βœ…
G2 Phosphonate / phosphate 7 0 βœ…
G3 Thiol ArSH / AlkSH 5 0 βœ…
G4 Carboxylic acid 5 0 βœ…
G5 Phenol variants, including warfarin enol acid 6 0 βœ…
G6 Amine bases 5 0 βœ…
G7 Sulfonamide / saccharin 4 0 βœ…
G8 Flavonoid regression β€” must not change 4 0 βœ…
G9 Zwitterion / multi-site 5 0 βœ…
G10 PubChem pKa guard 3 0 βœ…
G11 Truly neutral 4 0 βœ…
G12 Drug regression panel 7 0 βœ…
Total 65 0 βœ…

πŸ“Š External Benchmark β€” pKaHub-Derived Validation Subset

pKaNET Cloud+ was benchmarked against a docking-relevant subset derived from pKaHub, an experimental aqueous pKa database with macroscopic charge-state transition annotations.

Benchmark Endpoint

The benchmark endpoint is net-charge agreement at pH 7.4. This checks whether pKaNET Cloud+ predicts the same dominant net formal charge as the pKaHub-derived reference annotation at physiological pH.

This benchmark is not a numerical pKa prediction benchmark. The reported agreement values must not be interpreted as pKa MAE, RMSE, or quantitative pKa accuracy.

Drug-Like Screening Criteria

Property Threshold
Molecular weight 100–600 Da
H-bond donors ≀ 7
H-bond acceptors ≀ 12
logP βˆ’3 to +6.5
TPSA ≀ 180 Γ…Β²
Rotatable bonds ≀ 15
Heavy atoms β‰₯ 7

From 38,724 unique SMILES, 35,872 passed these criteria. The final validation subset contained 27,218 molecules prioritised for interpretable charge-state annotation and experimental pKa availability.

Overall Results

Dataset Correct Total Agreement rate
65-case curated regression set 65 65 100.00%
pKaHub-derived validation subset 18,850 27,183 69.34%

Agreement by Expected Charge State at pH 7.4

Expected charge Count Correct Agreement
βˆ’4 7 1 14.3%
βˆ’3 84 14 16.7%
βˆ’2 694 310 44.7%
βˆ’1 7,495 4,321 57.7%
0 12,146 9,240 76.1%
+1 6,386 4,923 77.1%
+2 309 41 13.3%
+3 or above 62 0 0.0%

Interpretation

For monoprotic drug-like molecules, which represent the majority of practical lead-optimisation cases, pKaNET Cloud+ assigns the same dominant net charge as the pKaHub-derived reference annotation for approximately three out of four compounds. Agreement is highest for neutral molecules (76.1%) and monocations (77.1%), and lower for polyprotic and zwitterionic molecules, as expected.

The 8,333 disagreements (30.66%) are concentrated in molecules where at least one predicted ionisable site has a heuristic pKa within Β±1.5 units of pH 7.4 and in strongly polyprotic species with charge magnitude β‰₯ 2, where multi-site pKa ordering is not recoverable from the heuristic table alone. The dominant failure modes are single-step over-prediction of basicity and single-step over-prediction of acidity.


πŸ”‘ pKa Backends (auto-selected by priority)

Priority Backend Install
1 pkasolver (GNN) pip install pkasolver
2 propka (semi-empirical) pip install "pkanet-cloud[propka]"
3 unipka CLI system install
4 heuristic SMARTS table built-in, always available

🧬 Supported Inputs

Format Extension
SMILES .smi or plain text
MDL Molfile .mol
Structure-data file .sdf
Tripos Mol2 .mol2
Protein Data Bank ligand .pdb

πŸ“€ Outputs

Output Description
pH-adjusted SMILES Dominant predicted microspecies at the target pH
Net formal charge Integer formal charge of the selected microspecies
minimized_ligand.pdb 3D ligand structure after geometry minimisation
minimized_ligand.sdf 3D ligand structure with explicit hydrogens and formal charge
Preparation log Site-level protonation decisions, pKa evidence, and ranking information

πŸ—‚οΈ Benchmark Files

File Description
pKaNET_pKahub_docking_relevant_subset_validation.csv Curated validation subset with pKaHub-derived reference charge labels and pKaNET predictions
pKaNET_pKahub_docking_relevant_failed_cases.csv Disagreement cases for manual review and future rule refinement
curated_regression_set.csv Internal 65-compound chemically curated regression set
validation_pass.csv Molecules with correct net-charge assignment
validation_fail.csv Disagreement cases
validation_summary_template.csv Template for recording new validation outputs
failed_cases_review_template.csv Template for manually classifying disagreement cases

🎯 Intended Use Cases

  • Ligand preparation before molecular docking with AutoDock Vina, VinaXB, GNINA, Glide, GOLD, or rDock.
  • GAFF2, CGenFF, or other force-field parameterisation workflows.
  • QSAR, ADMET, and virtual-screening dataset curation.
  • Teaching pKa, protonation state, microspecies, and docking-preparation concepts.

πŸ”— Integration with Anyone Can Dock

pKaNET Cloud+ is the default protonation engine in the Anyone Can Dock web application, replacing the previous Dimorphite-DL-only pipeline.

protonate_pkanet()
input ligand β†’ standardisation β†’ Dimorphite-DL enumeration β†’
pKaNET Cloud+ ranking β†’ dominant microspecies β†’ 3D generation β†’
minimisation β†’ docking-ready output

⚠️ Important Notes

  • pKaNET Cloud+ uses a calibrated heuristic pKa table and microstate-ranking workflow, not a quantitative experimental pKa predictor.
  • For borderline cases where one or more predicted site pKa values fall within Β±1.5 units of the target pH, the predicted charge should be treated as uncertain. Use predict_charge(mode='auto') to escalate these cases automatically to the full pipeline.
  • heuristic_net_charge() returns charge 0 for keto-form warfarin input because no OH group is detectable on the parent form; predict_charge(mode='auto') escalates to the full pipeline and returns βˆ’1.
  • Net-charge agreement does not guarantee that the exact ionised atom or tautomer is correct, especially for polyprotic or zwitterionic molecules.
  • The pKaHub-derived benchmark subset is a curated validation subset, not a redistribution of the complete raw pKaHub database.
  • In the G12 drug regression panel, Gefitinib and Imatinib are assigned as +1, whereas Erlotinib and Osimertinib are assigned as neutral. EGFR inhibitors should be evaluated compound by compound rather than assigned a uniform charge class.
  • The 65-case internal regression suite was run with Dimorphite-DL active. The large pKaHub-derived benchmark used the fast charge-estimation path. Using predict_charge(mode='auto') for the full benchmark may further improve agreement for borderline and polyprotic cases, at the cost of longer run time.

βœ… Recommended Wording for Manuscript or ESI

The protonation-state assignment module of pKaNET Cloud+ was evaluated using an internal chemically curated regression set (65 molecules, 100% net-charge agreement at pH 7.4) and a drug-like subset derived from the pKaHub experimental pKa database (27,218 molecules, 69.34% net-charge agreement at pH 7.4; Sipos-SzabΓ³ et al.). The benchmark endpoint was dominant net-charge agreement at pH 7.4, not numerical pKa prediction accuracy. The pKaHub-derived benchmark subset was curated to retain molecules with interpretable macroscopic charge-state annotations relevant to ligand docking. The complete raw pKaHub database was not redistributed; only curated validation outputs and disagreement summaries were provided for reproducibility. Full benchmark data and extended ligand preparation methodology are provided in the Supporting Information.


🚫 Wording to Avoid

Avoid Use instead
"pKaNET pKa accuracy is 69.34%" "pKaNET net-charge agreement at pH 7.4 is 69.34%"
"pKaNET predicts pKa correctly" "pKaNET assigns the correct dominant net charge"
"Fully validated against experimental data" "Benchmarked against pKaHub-derived charge-state annotations"
"All imidazole cases are fixed" "The reported imidazole N-H deprotonation issue is resolved in the regression set; residual failures may remain for complex imidazole-containing molecules"

πŸ™ Acknowledgements

This tool builds on:

  • RDKit β€” molecule standardisation, SMARTS matching, tautomer handling, formal charge assignment, ETKDG conformer generation, and MMFF/UFF geometry optimisation.
  • Dimorphite-DL β€” initial ionisation-state enumeration; pKaNET Cloud+ performs independent re-ranking.
  • pKaSolver β€” optional ML-GNN pKa backend.
  • PROPKA β€” optional semi-empirical pKa backend or fallback.
  • requests β€” optional HTTP client for PubChem lookup.
  • pKaHub β€” external experimental pKa reference resource used to derive the docking-relevant benchmark subset.

πŸ“– Citation

If you use pKaNET Cloud+ in your work, please cite:

Hengphasatporn, K. et al. DFDD: A Cloud-Ready Tool for Distance-Guided Fully Dynamic Docking in Host–Guest Complexation, Journal of Chemical Information and Modeling 2026, 66, 1955-1963. DOI: 10.1021/acs.jcim.5c02852.

For the pKaHub benchmark reference dataset, cite:

Sipos-SzabΓ³, L.; Bajusz, D.; Balogh, G. T.; KeserΕ±, G. M. Benchmarking pKa Prediction Algorithms against an Extensive, Public Data Set. Journal of Chemical Information and Modeling 2026, 66, 4607–4619. DOI: 10.1021/acs.jcim.6c00107.


πŸ“Œ Project Context

pKaNET Cloud+ is developed as part of the ligand-preparation workflow for Anyone Can Dock and related computational drug-discovery tools. The method improves docking-readiness by reducing common protonation-state errors caused by direct rule-based ionisation workflows, especially for imidazole-like motifs, flavonoids, phosphates/phosphonates, sulfonamide-like acids, zwitterions, warfarin-type enol acids, and drug-like polyprotic molecules.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors