Heuristic microstate ranking Β· pH-adjusted SMILES Β· 3D structure generation Β· Google Colab & Streamlit
pKaNET Cloud+ determines the dominant docking-relevant protonation state of a small molecule at a user-defined pH using a tautomer-aware HendersonβHasselbalch microstate-ranking workflow.
The workflow is designed for ligand preparation before molecular docking, molecular mechanics parameterisation, and cheminformatics dataset curation.
Main functions:
- Identifies ionisable sites using a calibrated SMARTS-based heuristic pKa table with context-aware rules for over 50 functional-group classes.
- Uses Dimorphite-DL-assisted ionisation-state enumeration, followed by pKaNET Cloud+ tautomer-aware microstate filtering and re-ranking.
- Ranks candidate microstates using a HendersonβHasselbalch-inspired scoring function with multi-site charge-cap logic.
- Optionally queries PubChem for experimental dissociation-constant evidence when available.
- Returns the dominant microspecies as a pH-adjusted SMILES with formal charge.
- Provides a fast sub-millisecond
heuristic_net_charge()path and a smartpredict_charge(mode='auto')dispatcher for large-scale screening. - Builds and minimises a 3D ligand structure using ETKDG followed by MMFF optimisation, with UFF fallback.
- Exports docking- and parameterisation-ready PDB and SDF files.
| Library | Required | Purpose |
|---|---|---|
rdkit |
β required | SMARTS matching, molecule standardisation, tautomer handling, formal charge assignment, 3D conformer generation, and geometry optimisation |
dimorphite-dl |
β required | Initial ionisation-state enumeration; pKaNET Cloud+ re-ranks the generated microstates using its heuristic pKa scoring model |
requests |
βοΈ optional | PubChem experimental pKa / dissociation-constant lookup |
pkasolver |
βοΈ optional | Optional ML-GNN pKa backend when available |
propka |
βοΈ optional | Optional semi-empirical pKa backend or fallback |
py3Dmolis used only in the accompanying Colab notebook or visualisation interface. It is not required by the corepKaNET.pyengine.
# Minimal (RDKit + requests only β heuristic mode)
pip install pkanet-cloud
# Recommended (adds Dimorphite-DL and pandas)
pip install "pkanet-cloud[recommended]"
# Full (adds propka and Streamlit web UI)
pip install "pkanet-cloud[all]"RDKit note:
rdkitis listed as a dependency but PyPI'srdkitwheel requires Python β₯ 3.9 on 64-bit platforms. If your environment uses a conda-managed RDKit, install without deps:pip install pkanet-cloud --no-deps pip install requests dimorphite-dl pandas # then add these manually
After installation a pkanet command is available globally.
# Single SMILES string
pkanet --smiles "CC(=O)OC1=CC=CC=C1C(=O)O" --ph 7.4
# Compound name β resolved automatically via PubChem
pkanet --name "baicalein" --ph 7.4 --out-dir ./results
# SMILES file (one SMILES [name] per line) β also accepted via --file
pkanet --smi-file ligands.smi --formats PDB MOL2 --top-n 3
# Ligand structure file (.pdb / .mol2 / .sdf)
pkanet --file ligand.pdb --no-pubchem --keep-stereoFor large libraries where 3D generation is not needed, --fast uses the
sub-millisecond heuristic_net_charge() path and skips tautomer enumeration
and Dimorphite-DL entirely.
# Single molecule
pkanet --smiles "NCC(=O)O" --ph 7.4 --fast
# Batch β prints TSV to stdout (name / SMILES / charge / mode)
pkanet --smi-file library.smi --fast --quiet
# .smi file passed via --file is auto-detected and works identically
pkanet --file library.smi --fast --quiet| Flag | Default | Description |
|---|---|---|
--ph |
7.4 |
Target pH |
--out-dir |
./pkanet_out |
Output directory |
--out-name |
ligand |
Base name for output files |
--formats PDB MOL2 SDF |
PDB |
3D output format(s) |
--ph-window |
1.0 |
Dimorphite-DL enumeration window (Β±window/2) |
--max-tautomers |
8 |
Maximum tautomers to enumerate |
--top-n |
5 |
Top N microstates to rank |
--top-k-3d |
3 |
Write 3D structures for top-k microstates |
--keep-stereo |
off | Skip R/S stereoisomer enumeration |
--no-pubchem |
off | Disable PubChem experimental pKa lookup |
--fast |
off | Heuristic charge only β no 3D, no tautomers |
--json-out FILE |
β | Write full results as JSON |
--quiet |
off | TSV one-liner output, no banner |
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β pKaNET Cloud+ β CLI β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
pKa backend : heuristic (SMARTS table)
Target pH : 7.4
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
π aspirin β Rank-1 Microstate
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Score : -0.312
SMILES : CC(=O)Oc1ccccc1C(=O)[O-]
Charge @ pH 7.4 : -1
Zwitterion (strict) : NO
Amide preserved : NO
pKa source : heuristic
PubChem CID : 2244
π Ranked microstates (top 5)
Rank Score Charge Rec SMILES
ββββ ββββββββ ββββββ βββ ββββββββββββββββββββββββββββββββββββββββ
1 -0.312 -1 β
CC(=O)Oc1ccccc1C(=O)[O-]
2 -1.562 0 CC(=O)Oc1ccccc1C(=O)O
Three public functions are available for programmatic and large-scale use:
from pkanet import predict_charge, heuristic_net_charge, batch_predict_charges
# Sub-millisecond heuristic estimate with multi-site charge caps
charge = heuristic_net_charge("CC(=O)O", ph=7.4) # β β1
# Smart dispatcher: fast normally, full pipeline for borderline pKa or tautomeric risk
charge, mode = predict_charge("CC(=O)O", ph=7.4, mode="auto")
# Batch prediction β returns a pandas DataFrame
df = batch_predict_charges(["CC(=O)O", "CN", "NCC(=O)O"], ph=7.4, mode="auto")predict_charge(mode='auto') automatically escalates to the full tautomer + Dimorphite-DL + scoring pipeline when:
- any detected site pKa is within 1.5 pH units of the target, or
- the molecule has ring systems but no detectable ionisable sites on the parent form, indicating tautomeric enol risk, such as warfarin supplied as the keto form.
The heuristic_net_charge() function applies two charge-cap rules that suppress systematic over-charging in the fast path: a polyamine cap and a multi-acid cap.
from pkanet.core import run_job
result = run_job(
input_type = "SMILES",
smiles_text = "CC(=O)OC1=CC=CC=C1C(=O)O",
uploaded_bytes = None,
uploaded_name = None,
target_pH = 7.4,
output_name = "aspirin",
out_dir = "./output",
output_formats = ["PDB"],
use_pubchem = True,
)
top = result["results"][0]
print(top["selected_microstate_smiles"]) # CC(=O)Oc1ccccc1C(=O)[O-]
print(top["formal_charge"]) # -1
print(top["minimized_pdb"]) # ./output/aspirin_micro1_min.pdbThe internal regression suite contains 65 chemically curated test cases across 12 functional-group classes.
python3 test_pkanet.py pKaNET.py # full suite
python3 test_pkanet.py pKaNET.py G8 # flavonoid group only
python3 test_pkanet.py pKaNET.py G12 # drug regression panel only65 / 65 PASS (100%)
| Group | Description | Pass | Fail | Status |
|---|---|---|---|---|
| G1 | Imidazole-type N-H | 10 | 0 | β |
| G2 | Phosphonate / phosphate | 7 | 0 | β |
| G3 | Thiol ArSH / AlkSH | 5 | 0 | β |
| G4 | Carboxylic acid | 5 | 0 | β |
| G5 | Phenol variants, including warfarin enol acid | 6 | 0 | β |
| G6 | Amine bases | 5 | 0 | β |
| G7 | Sulfonamide / saccharin | 4 | 0 | β |
| G8 | Flavonoid regression β must not change | 4 | 0 | β |
| G9 | Zwitterion / multi-site | 5 | 0 | β |
| G10 | PubChem pKa guard | 3 | 0 | β |
| G11 | Truly neutral | 4 | 0 | β |
| G12 | Drug regression panel | 7 | 0 | β |
| Total | 65 | 0 | β |
pKaNET Cloud+ was benchmarked against a docking-relevant subset derived from pKaHub, an experimental aqueous pKa database with macroscopic charge-state transition annotations.
The benchmark endpoint is net-charge agreement at pH 7.4. This checks whether pKaNET Cloud+ predicts the same dominant net formal charge as the pKaHub-derived reference annotation at physiological pH.
This benchmark is not a numerical pKa prediction benchmark. The reported agreement values must not be interpreted as pKa MAE, RMSE, or quantitative pKa accuracy.
| Property | Threshold |
|---|---|
| Molecular weight | 100β600 Da |
| H-bond donors | β€ 7 |
| H-bond acceptors | β€ 12 |
| logP | β3 to +6.5 |
| TPSA | β€ 180 Γ Β² |
| Rotatable bonds | β€ 15 |
| Heavy atoms | β₯ 7 |
From 38,724 unique SMILES, 35,872 passed these criteria. The final validation subset contained 27,218 molecules prioritised for interpretable charge-state annotation and experimental pKa availability.
| Dataset | Correct | Total | Agreement rate |
|---|---|---|---|
| 65-case curated regression set | 65 | 65 | 100.00% |
| pKaHub-derived validation subset | 18,850 | 27,183 | 69.34% |
| Expected charge | Count | Correct | Agreement |
|---|---|---|---|
| β4 | 7 | 1 | 14.3% |
| β3 | 84 | 14 | 16.7% |
| β2 | 694 | 310 | 44.7% |
| β1 | 7,495 | 4,321 | 57.7% |
| 0 | 12,146 | 9,240 | 76.1% |
| +1 | 6,386 | 4,923 | 77.1% |
| +2 | 309 | 41 | 13.3% |
| +3 or above | 62 | 0 | 0.0% |
For monoprotic drug-like molecules, which represent the majority of practical lead-optimisation cases, pKaNET Cloud+ assigns the same dominant net charge as the pKaHub-derived reference annotation for approximately three out of four compounds. Agreement is highest for neutral molecules (76.1%) and monocations (77.1%), and lower for polyprotic and zwitterionic molecules, as expected.
The 8,333 disagreements (30.66%) are concentrated in molecules where at least one predicted ionisable site has a heuristic pKa within Β±1.5 units of pH 7.4 and in strongly polyprotic species with charge magnitude β₯ 2, where multi-site pKa ordering is not recoverable from the heuristic table alone. The dominant failure modes are single-step over-prediction of basicity and single-step over-prediction of acidity.
| Priority | Backend | Install |
|---|---|---|
| 1 | pkasolver (GNN) | pip install pkasolver |
| 2 | propka (semi-empirical) | pip install "pkanet-cloud[propka]" |
| 3 | unipka CLI | system install |
| 4 | heuristic SMARTS table | built-in, always available |
| Format | Extension |
|---|---|
| SMILES | .smi or plain text |
| MDL Molfile | .mol |
| Structure-data file | .sdf |
| Tripos Mol2 | .mol2 |
| Protein Data Bank ligand | .pdb |
| Output | Description |
|---|---|
| pH-adjusted SMILES | Dominant predicted microspecies at the target pH |
| Net formal charge | Integer formal charge of the selected microspecies |
minimized_ligand.pdb |
3D ligand structure after geometry minimisation |
minimized_ligand.sdf |
3D ligand structure with explicit hydrogens and formal charge |
| Preparation log | Site-level protonation decisions, pKa evidence, and ranking information |
| File | Description |
|---|---|
pKaNET_pKahub_docking_relevant_subset_validation.csv |
Curated validation subset with pKaHub-derived reference charge labels and pKaNET predictions |
pKaNET_pKahub_docking_relevant_failed_cases.csv |
Disagreement cases for manual review and future rule refinement |
curated_regression_set.csv |
Internal 65-compound chemically curated regression set |
validation_pass.csv |
Molecules with correct net-charge assignment |
validation_fail.csv |
Disagreement cases |
validation_summary_template.csv |
Template for recording new validation outputs |
failed_cases_review_template.csv |
Template for manually classifying disagreement cases |
- Ligand preparation before molecular docking with AutoDock Vina, VinaXB, GNINA, Glide, GOLD, or rDock.
- GAFF2, CGenFF, or other force-field parameterisation workflows.
- QSAR, ADMET, and virtual-screening dataset curation.
- Teaching pKa, protonation state, microspecies, and docking-preparation concepts.
pKaNET Cloud+ is the default protonation engine in the Anyone Can Dock web application, replacing the previous Dimorphite-DL-only pipeline.
protonate_pkanet()input ligand β standardisation β Dimorphite-DL enumeration β
pKaNET Cloud+ ranking β dominant microspecies β 3D generation β
minimisation β docking-ready output
- pKaNET Cloud+ uses a calibrated heuristic pKa table and microstate-ranking workflow, not a quantitative experimental pKa predictor.
- For borderline cases where one or more predicted site pKa values fall within Β±1.5 units of the target pH, the predicted charge should be treated as uncertain. Use
predict_charge(mode='auto')to escalate these cases automatically to the full pipeline. heuristic_net_charge()returns charge 0 for keto-form warfarin input because no OH group is detectable on the parent form;predict_charge(mode='auto')escalates to the full pipeline and returns β1.- Net-charge agreement does not guarantee that the exact ionised atom or tautomer is correct, especially for polyprotic or zwitterionic molecules.
- The pKaHub-derived benchmark subset is a curated validation subset, not a redistribution of the complete raw pKaHub database.
- In the G12 drug regression panel, Gefitinib and Imatinib are assigned as +1, whereas Erlotinib and Osimertinib are assigned as neutral. EGFR inhibitors should be evaluated compound by compound rather than assigned a uniform charge class.
- The 65-case internal regression suite was run with Dimorphite-DL active. The large pKaHub-derived benchmark used the fast charge-estimation path. Using
predict_charge(mode='auto')for the full benchmark may further improve agreement for borderline and polyprotic cases, at the cost of longer run time.
The protonation-state assignment module of pKaNET Cloud+ was evaluated using an internal chemically curated regression set (65 molecules, 100% net-charge agreement at pH 7.4) and a drug-like subset derived from the pKaHub experimental pKa database (27,218 molecules, 69.34% net-charge agreement at pH 7.4; Sipos-SzabΓ³ et al.). The benchmark endpoint was dominant net-charge agreement at pH 7.4, not numerical pKa prediction accuracy. The pKaHub-derived benchmark subset was curated to retain molecules with interpretable macroscopic charge-state annotations relevant to ligand docking. The complete raw pKaHub database was not redistributed; only curated validation outputs and disagreement summaries were provided for reproducibility. Full benchmark data and extended ligand preparation methodology are provided in the Supporting Information.
| Avoid | Use instead |
|---|---|
| "pKaNET pKa accuracy is 69.34%" | "pKaNET net-charge agreement at pH 7.4 is 69.34%" |
| "pKaNET predicts pKa correctly" | "pKaNET assigns the correct dominant net charge" |
| "Fully validated against experimental data" | "Benchmarked against pKaHub-derived charge-state annotations" |
| "All imidazole cases are fixed" | "The reported imidazole N-H deprotonation issue is resolved in the regression set; residual failures may remain for complex imidazole-containing molecules" |
This tool builds on:
- RDKit β molecule standardisation, SMARTS matching, tautomer handling, formal charge assignment, ETKDG conformer generation, and MMFF/UFF geometry optimisation.
- Dimorphite-DL β initial ionisation-state enumeration; pKaNET Cloud+ performs independent re-ranking.
- pKaSolver β optional ML-GNN pKa backend.
- PROPKA β optional semi-empirical pKa backend or fallback.
- requests β optional HTTP client for PubChem lookup.
- pKaHub β external experimental pKa reference resource used to derive the docking-relevant benchmark subset.
If you use pKaNET Cloud+ in your work, please cite:
Hengphasatporn, K. et al. DFDD: A Cloud-Ready Tool for Distance-Guided Fully Dynamic Docking in HostβGuest Complexation, Journal of Chemical Information and Modeling 2026, 66, 1955-1963. DOI: 10.1021/acs.jcim.5c02852.
For the pKaHub benchmark reference dataset, cite:
Sipos-SzabΓ³, L.; Bajusz, D.; Balogh, G. T.; KeserΕ±, G. M. Benchmarking pKa Prediction Algorithms against an Extensive, Public Data Set. Journal of Chemical Information and Modeling 2026, 66, 4607β4619. DOI: 10.1021/acs.jcim.6c00107.
pKaNET Cloud+ is developed as part of the ligand-preparation workflow for Anyone Can Dock and related computational drug-discovery tools. The method improves docking-readiness by reducing common protonation-state errors caused by direct rule-based ionisation workflows, especially for imidazole-like motifs, flavonoids, phosphates/phosphonates, sulfonamide-like acids, zwitterions, warfarin-type enol acids, and drug-like polyprotic molecules.
