jam-sudo
diff --git a/‎CLAUDE.md‎
Lines changed: 31 additions & 19 deletions b/‎CLAUDE.md‎
Lines changed: 31 additions & 19 deletions
diff --git a/‎data/clinical/platinum_reference.json‎
Lines changed: 26 additions & 18 deletions b/‎data/clinical/platinum_reference.json‎
Lines changed: 26 additions & 18 deletions
@@ -22,22 +22,23 @@ The current PBPK module is the foundation layer: SMILES in → PK profile + unce
 SMILES → EnsembleADMEPredictor (XGBoost CLint/fup/rbp/VDss + polynomial)
        → _build_drug (IVIVE scaling + Berezhkovskiy Kp + renal CL + P-gp correction)
        → 35-state ODE simulation → raw C(t) curve
-       → Hybrid Cmax selector (adaptive-weight blend of ODE + analytical 1-cpt)
        → PBPK/ML ensemble (DirectCmaxPredictor, confidence-weighted)
        → Conformal UQ (90% prediction intervals)
        → SimulationResult
 ```
 
-### Honest Performance (2026-03-19, scaffold-stratified holdout)
-- **In-sample (24 drugs):** Cmax AAFE 1.646 [95% CI: 1.44, 1.92], 88% 2-fold — contaminated by CLint anchors
-- **Holdout ALL (71 drugs):** Cmax AAFE 2.897 [2.23, 3.92], 56% 2-fold — permanent scaffold-split
-- **Holdout IN-DOMAIN (53 drugs):** Cmax AAFE **1.847** [1.65, 2.09], 64% 2-fold, **89% 3-fold**
-- **Out-of-domain (17 drugs):** prodrugs, extreme lipophilic, P-gp efflux, DDI-boosted references
+### Honest Performance (2026-03-22, scaffold-stratified holdout)
+- **Core-24 (clinical ref, selector OFF):** Cmax AAFE **1.977** [1.65, 2.43], 58% 2-fold
+- **Holdout ALL (100 drugs):** Cmax AAFE **2.520** [2.14, 3.00], 49% 2-fold, 72% 3-fold
+- **Holdout IN-DOMAIN (79 drugs):** Cmax AAFE **1.987** [1.79, 2.22], 57% 2-fold, **82% 3-fold**
+- **Out-of-domain (21 drugs):** prodrugs, extreme lipophilic, P-gp efflux, DDI-boosted, HIGH_MW
+- **Spearman ρ = 0.9379** (in-domain ranking correlation) — excellent for screening
 - **CLint anchors NOT inflating:** ANCHORED 1.813 vs CLEAN 1.736 (delta +0.078)
 - **Data leakage:** 36/107 (34%) gold tier drugs overlap with ADME training set
 - **Error cancellation confirmed:** predicted ADME beats measured ADME — structural, not CLint-specific
-- **Data quality >> model improvements:** 14 reference fixes = AAFE 3.520→1.847 (-47%), zero model changes
+- **Data quality >> model improvements:** 19 reference fixes = AAFE 3.520→1.987 (-44%), zero model changes
 - **Benchmark CSVs are synthetic** (1-cpt generated, not real clinical C(t) data)
+- **UQ intervals over-wide:** Cmax CI 97% coverage but median width 4880x; AUC/t½ CI broken
 
 ## Workflow
 
@@ -59,18 +60,18 @@ Details + cross-review protocol: `.claude/commands/team.md`
 
 1. **ADMET-AI disabled in production** — fup/logP changes break warfarin/metformin/losartan via Kp/Vd
 2. **XGBoost CLint is primary** — 18 reference anchors at 50x weight (semi-supervised, partially circular)
-3. **Hybrid Cmax selector is essential** — ablation: +0.278 AAFE, 88% → 58% 2-fold (biggest single component)
+3. **~~REVOKED~~ Hybrid Cmax selector is DISABLED** — ablation on synthetic CSV showed +0.278 AAFE, but holdout 100 drugs shows Δ-0.284 (HARMFUL). Selector was overfitted to N=24 synthetic benchmark. `use_hybrid_selector=False` since 2026-03-22.
 4. **Don't replace ODE with pure ML** — v1-v5 GNN all failed (distillation ceiling)
 5. **Don't touch phase files** — build ML alongside, deprecate later
 6. **PK-DB + FDA labels** for clinical data; ChEMBL deprioritized
-7. **Analytical 1-cpt > raw ODE for Cmax** (67% vs 38% 2-fold) — but hybrid of both is best (79%)
+7. **~~REVISED~~ ODE >> Analytical 1-cpt for Cmax** — Holdout verification (2026-03-22): ODE AAFE 2.41 vs Analytical 11.64, ODE wins 81% of drugs. Original "analytical > ODE" (67% vs 38% 2-fold) was on synthetic CSV = overfitting. Pure ODE is default and correct.
 8. **All benchmark CSVs are synthetic** — generated from 1-cpt model, not real clinical data. Warfarin replaced with PK-DB data (2026-03-17)
 9. **External validation AAFE ~3.0** — in-sample 1.72 reflects tuning, not generalization
 10. **Error cancellation is real** — predicted ADME beats measured ADME (2.46 vs 2.69). Fix ODE structure BEFORE improving ADME.
 11. **Ridge correction is dead code** — not loaded in pipeline, zero contribution confirmed by ablation
 12. **Gut CLint drives Cmax, not hepatic CLint** — Sobol: gut CLint ST=0.470, hepatic CLint ST=0.000 for Cmax. Hepatic CLint only affects AUC.
 13. **Ridge correction is confirmed dead code** — ablation study (Phase 0.1) shows NO_RIDGE = FULL with Δ=0.000 AAFE. The ridge model file exists in models/correction/ but is never loaded at inference. Keep for reproducibility only.
-14. **Hybrid Cmax selector is the dominant component** — ablation Δ+0.278 AAFE without it (88%→58% 2-fold) with Phase 3a.1 + acid-Kp pipeline. Don't remove or simplify it.
+14. **~~REVOKED~~ Hybrid Cmax selector was overfitted** — The Δ+0.278 ablation was measured on synthetic CSV benchmark (N=24). On holdout 100 drugs: selector WORSENS AAFE by 0.284. Selector LOO-CV tuned on N=24 synthetic data = classic overfitting. See #3.
 15. **Error cancellation is systematic** — 79% of drugs (CI < 0.5), mean CI = 0.303. Fixing individual ADME params without joint balance will worsen aggregate AAFE.
 16. **Phase 3a blocker is fm_CYP3A4 false positives, NOT Fh** — Polynomial clint_3a4 assigns fm_CYP3A4=0.887 to propranolol, 0.939 to ibuprofen. Fix: threshold guard `clint_3a4 > 2.0 µL/min/pmol` → AAFE 1.747, 83% 2-fold (diagnostic: 2026-03-18). Combined with acid-Kp D-fix → AAFE 1.665, 88% 2-fold.
 17. **CLint_gut formula uses pre-inverted CLint** — `clint_L_per_h` is 22-223x larger than `CLh_target`; the 1.7× factor was calibrated for CLh_target. This is a known architectural bug, not a Phase 3a blocker. Fix in Phase 3b.
@@ -83,11 +84,21 @@ Details + cross-review protocol: `.claude/commands/team.md`
 20. **OATP correction disabled — wrong direction for atorvastatin** — Atorvastatin AUC is UNDER-predicted (fe=3.64×; pred=0.048 vs obs=0.176 mg*h/L), meaning CLint is already over-predicted. OATP adds more clearance → makes AUC worse. CLint already >>QH (near-complete extraction), so any CLint addition has minimal but harmful effect. Root cause: CLint over-prediction, not missing uptake transporter. Code archived in pipeline with `_ENABLE_OATP_CORRECTION = False`.
 25. **CLint anchors do NOT inflate gold-24 metrics** — Anchor contamination analysis: ANCHORED AAFE 1.813 vs CLEAN 1.736 (delta +0.078). Error cancellation is structural (pipeline architecture), not CLint-specific.
 26. **MLP cannot beat XGBoost at 1K-4K drug scale** — UDE Phase 1/2 (134K params MLP) achieved holdout AAFE 3.46-3.50 vs pipeline 3.52. Early stopping at epoch 5-8 = underfitting. Multi-dose data expansion (3.3x) WORSENED results due to noise. XGBoost remains superior.
-27. **Data quality >> model improvements** — 14 platinum reference fixes + AD filter achieved AAFE 3.520→1.847 (-47.5%) on holdout with ZERO model changes. This is the single highest-ROI intervention.
-28. **Applicability domain filter in pipeline** — `SimulationResult.in_applicability_domain` + `ad_flags`. SMARTS: val-ester, thienopyridine, pivoxil, quaternary amine, inorganic. Thresholds: logP>5.5, MW>700, P-gp efflux risk (MW>500+logP>3.5+TPSA>100).
-29. **Permanent scaffold-stratified holdout** — 76 train / 71 holdout from platinum 147 (Murcko generic, seed=42). `data/clinical/holdout_split.json`. 30 MMPK drugs SMILES-match holdout → excluded from UDE training.
+27. **Data quality >> model improvements** — 19 platinum reference fixes + AD filter achieved AAFE 3.520→1.987 (-44%) on holdout in-domain with ZERO model changes. Single highest-ROI intervention.
+28. **Applicability domain filter in pipeline** — `SimulationResult.in_applicability_domain` + `ad_flags`. SMARTS: val-ester, thienopyridine, pivoxil, nucleoside 5'-ester, quaternary amine, inorganic. Thresholds: logP>5.5, MW>700, P-gp efflux risk (MW>500+logP>3.5+TPSA>100). DDI-boosted flag in platinum reference.
+29. **Permanent scaffold-stratified holdout** — 76 train / 100 holdout from platinum 176 (Murcko generic, seed=42). `data/clinical/holdout_split.json`. 30 MMPK drugs SMILES-match holdout → excluded from UDE training.
 30. **torchdiffeq ODE is impractical for training** — 13-state PBPK ODE: 11s/drug forward+backward. 60 epochs = 160 hours. Need surrogate ODE approach for differentiable training.
 31. **Pipeline structural gaps** — flutamide (CYP1A2 172x), buspirone (F=4% 44x), pantoprazole (enteric coating 5x). These are genuine mechanistic limitations, not data errors.
+32. **All benchmarks must use clinical reference only** — Synthetic 1-cpt CSV benchmark deprecated (2026-03-22). Core-24 AAFE with clinical ref = 1.977 (was 1.502 on synthetic). Synthetic CSV inflated accuracy by ~0.5 AAFE.
+33. **Optuna E2E constants do not generalize** — 5 Optuna-tuned constants (gut_threshold=0.97, peff_min=0.76, pgp=0.34, gse=1.11, ka_scale) hurt holdout by +0.091 AAFE. MMPK optimization doesn't transfer to platinum holdout.
+34. **Overfitting has 3 layers** — (1) synthetic CSV benchmark, (2) hybrid selector LOO-CV on N=24, (3) Optuna E2E on MMPK. Removing all 3: holdout 3.064→2.690→2.520.
+35. **Spearman ρ = 0.94 (in-domain)** — Pipeline ranking is excellent despite AAFE ~2.0. Screening applications viable. Kendall τ = 0.80. Binary high/low accuracy = 88%.
+36. **UQ recalibrated (2026-03-22)** — Cmax: 93.7% coverage (in-domain), median width 20.6x (was 4880x). AUC/t½: heuristic scaling from Cmax q-value (q×1.35 for AUC, q×1.0 for t½). AdaptiveConformal recalibrated on 68 clean drugs, k=30.
+38. **CYP3A4 ML classifier deferred** — TDC 670 compounds, test AUROC 0.634 → too low. Multi-CYP normalization partially works but unreliable. Zero holdout drugs trigger gut wall fix → no holdout improvement possible. Model saved for future use.
+39. **ODE >> Analytical for Cmax on clinical data** — Holdout: ODE AAFE 2.41 vs Analytical 11.64, ODE wins 81%. KD#7 "analytical > ODE" was synthetic CSV artifact.
+37. **AD filter catches prodrugs + DDI-boosted** — SMARTS: val-ester, thienopyridine, pivoxil phosphonate, nucleoside 5'-ester. Flags: PRODRUG, DDI_BOOSTED, EXTREME_LIPOPHILIC, HIGH_MW, PGP_EFFLUX_RISK, QUATERNARY_AMINE, INORGANIC. 21/100 holdout excluded.
+40. **AUC AAFE 3.2 on holdout (32 drugs)** — 2x worse than Cmax (1.7). AUC Spearman ρ=0.77 (vs Cmax 0.94). Root cause: VDss over-prediction + CLint error compounding through ODE. AUC improvement requires better CL prediction, not Cmax tuning.
+41. **VDss systematically over-predicted** — Lombardo cross-validation (17 drugs): VDss AAFE 3.71, Spearman ρ=0.27. XGBoost VDss AAFE 1.31 (94% 2-fold) vs Berezhkovskiy 4.11. Fix: weighted geometric mean (XGB^0.7 × Berez^0.3) for t½ → Core-24 AUC improved 2.344→2.142 (-8.6%). Cmax unchanged (ODE Kp preserved).
 
 ## Codebase Rules
 
@@ -135,13 +146,14 @@ Details + cross-review protocol: `.claude/commands/team.md`
 
 | Level | Criteria | Status |
 |-------|---------|--------|
-| **1** | `omega predict <SMILES>` → PK profile. ADME AAFE<3.0. PK ≤2-fold for ≥70% of 20+ drugs | **PASS** (in-sample 1.665, 88%) |
-| **2** | SMILES→PK <500ms. AAFE<2.0 | **PASS** (73ms, 1.665 in-sample) |
+| **1** | `omega predict <SMILES>` → PK profile. ADME AAFE<3.0. PK ≤2-fold for ≥70% of 20+ drugs | **PASS** (core-24: 1.977) |
+| **2** | SMILES→PK <500ms. AAFE<2.0 | **PASS** (73ms, core-24 1.977 in-sample) |
 | **3** | Patient covariates. Few-shot (<5 obs) | **Prototype** (allometric + Bayesian) |
-| **4** | Batch screening 1000+ molecules with UQ | **Done** (batch_predict + conformal CI) |
-| **Ext** | External validation AAFE<2.5 on unseen drugs | **PASS** (holdout in-domain 53 drugs: AAFE 1.847) |
-| **v7** | Bootstrap CI, ER-stratified, N=50+ holdout, scaffold-split | **PASS** (71-drug holdout, scaffold-stratified, bootstrap CI) |
-| **v8** | Holdout in-domain AAFE<1.7, %2-fold>70% | **NOT MET** (1.847, 64.2%) |
+| **4** | Batch screening 1000+ molecules with UQ | **Done** (batch_predict + conformal CI, but UQ intervals over-wide) |
+| **Ext** | External validation AAFE<2.5 on unseen drugs | **PASS** (holdout in-domain 79 drugs: AAFE 1.987) |
+| **v7** | Bootstrap CI, ER-stratified, N=50+ holdout, scaffold-split | **PASS** (100-drug holdout, scaffold-stratified, bootstrap CI) |
+| **v8** | Holdout in-domain AAFE<1.7, %2-fold>70% | **NOT MET** (1.987, 57.0%) |
+| **v9** | Spearman ρ>0.90, UQ coverage 85-95% | **PARTIAL** (ρ=0.94 PASS, UQ 97% over-covered) |
 
 ## Tech Stack
 
 
@@ -760,7 +760,8 @@
       "tuning_contaminated": false,
       "nonlinear_pk": true,
       "data_quality": "fda_label_exact",
-      "notes": "WARNING: Cmax 7.32 mg/L likely includes ritonavir boosting (CYP3A4 inhibition). Pipeline predicts unboosted single-drug PK. Flag as DDI-contaminated."
+      "notes": "DDI-BOOSTED: Cmax 7.32 mg/L from ritonavir-boosted steady-state (Prezista 600mg + ritonavir 100mg BID). Dose=100mg is error (possibly from 100mg/mL suspension concentration). Pipeline predicts unboosted single-drug PK.",
+      "ddi_boosted": true
     },
     "dasatinib": {
       "smiles": "CC1=C(C(=CC=C1)Cl)NC(=O)C2=CN=C(S2)NC3=CC(=NC(=N3)C)N4CCN(CC4)CCO",
@@ -810,17 +811,18 @@
     "diclofenac": {
       "smiles": "C1=CC=C(C(=C1)CC(=O)O)NC2=C(C=CC=C2Cl)Cl",
       "dose_mg": 25.0,
-      "cmax_mg_L": 0.024,
+      "cmax_mg_L": 1.0,
       "source_type": "literature",
-      "source_id": "gold",
+      "source_id": "FDA label diclofenac sodium DR tablets",
       "fasted_confidence": "assumed_fasted",
       "formulation": "IR",
       "route": "oral",
       "population": "healthy",
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "clinical_exact"
+      "data_quality": "clinical_exact",
+      "notes": "FIXED 2026-03-22: Cmax was 0.024 (unit error). FDA label: ~1.0 mcg/mL for 25mg delayed-release tablet."
     },
     "efavirenz": {
       "smiles": "C1CC1C#CC2(C3=C(C=CC(=C3)Cl)NC(=O)O2)C(F)(F)F",
@@ -1095,18 +1097,19 @@
     },
     "lenacapavir": {
       "smiles": "CC(C)(C#CC1=NC(=C(C=C1)C2=C3C(=C(C=C2)Cl)C(=NN3CC(F)(F)F)NS(=O)(=O)C)C(CC4=CC(=CC(=C4)F)F)NC(=O)CN5C6=C(C7CC7C6(F)F)C(=N5)C(F)(F)F)S(=O)(=O)C",
-      "dose_mg": 927.0,
-      "cmax_mg_L": 0.0738,
+      "dose_mg": 600.0,
+      "cmax_mg_L": 0.101,
       "source_type": "fda_label",
-      "source_id": "fda_expanded",
+      "source_id": "FDA Sunlenca label (oral loading dose)",
       "fasted_confidence": "assumed_fasted",
       "formulation": "IR",
       "route": "oral",
       "population": "healthy",
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "fda_label_exact"
+      "data_quality": "fda_label_exact",
+      "notes": "FIXED 2026-03-22: Was 927mg SubQ dose (Cmax 0.0738). Corrected to 600mg oral loading (2x300mg tablets). Cmax ~101.4 ng/mL from Sunlenca FDA label. Low oral F (6-10%)."
     },
     "levocetirizine": {
       "smiles": "C1CN(CCN1CCOCC(=O)O)C(C2=CC=CC=C2)C3=CC=C(C=C3)Cl",
@@ -1136,7 +1139,9 @@
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "fda_label_exact"
+      "data_quality": "fda_label_exact",
+      "ddi_boosted": true,
+      "notes": "DDI-BOOSTED: Cmax 9.8 mg/L is from Kaletra (lopinavir 400mg + ritonavir 100mg) steady-state. Unboosted lopinavir has ~1/10 exposure. Never used without ritonavir clinically. Dose also wrong (200 vs 400mg in label). Pipeline predicts unboosted single-drug PK."
     },
     "lorlatinib": {
       "smiles": "CC1C2=C(C=CC(=C2)F)C(=O)N(CC3=NN(C(=C3C4=CC(=C(N=C4)N)O1)C#N)C)C",
@@ -1398,17 +1403,18 @@
     "posaconazole": {
       "smiles": "CCC(C(C)O)N1C(=O)N(C=N1)C2=CC=C(C=C2)N3CCN(CC3)C4=CC=C(C=C4)OCC5CC(OC5)(CN6C=NC=N6)C7=C(C=C(C=C7)F)F",
       "dose_mg": 300.0,
-      "cmax_mg_L": 0.014,
+      "cmax_mg_L": 0.893,
       "source_type": "fda_label",
-      "source_id": "fda_expanded",
+      "source_id": "PMC4432131 Fang 2015",
       "fasted_confidence": "assumed_fasted",
       "formulation": "IR",
       "route": "oral",
       "population": "healthy",
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "fda_label_exact"
+      "data_quality": "fda_label_exact",
+      "notes": "FIXED 2026-03-22: Cmax was 0.014 (extraction error). Correct: 893 ng/mL fasted, DR tablet 300mg single dose (PMC4432131 Fang 2015)."
     },
     "primaquine": {
       "smiles": "CC(CCCN)NC1=C2C(=CC(=C1)OC)C=CC=N2",
@@ -1701,17 +1707,18 @@
     "valganciclovir": {
       "smiles": "CC(C)C(C(=O)OCC(CO)OCN1C=NC2=C1N=C(NC2=O)N)N",
       "dose_mg": 900.0,
-      "cmax_mg_L": 58.0,
+      "cmax_mg_L": 5.61,
       "source_type": "literature",
-      "source_id": "gold",
+      "source_id": "PMID 10496303 (Valcyte PK)",
       "fasted_confidence": "assumed_fasted",
       "formulation": "IR",
       "route": "oral",
       "population": "healthy",
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "clinical_exact"
+      "data_quality": "clinical_exact",
+      "notes": "FIXED 2026-03-22: Cmax was 58.0 (decimal error, 10x too high). Correct: ganciclovir Cmax 5.61 mcg/mL after 900mg valganciclovir oral with food (PMID 10496303). PRODRUG: valganciclovir\u2192ganciclovir. Pipeline predicts parent."
     },
     "vilazodone": {
       "smiles": "C1CN(CCN1CCCCC2=CNC3=C2C=C(C=C3)C#N)C4=CC5=C(C=C4)OC(=C5)C(=O)N",
@@ -2499,17 +2506,18 @@
     "pindolol": {
       "smiles": "CC(C)NCC(O)COc1cccc2[nH]ccc12",
       "dose_mg": 20.0,
-      "cmax_mg_L": 0.167,
+      "cmax_mg_L": 0.087,
       "source_type": "fda_label",
-      "source_id": "OpenFDA automated extraction (2026-03-20)",
+      "source_id": "FDA label pindolol (DailyMed) geometric mean",
       "fasted_confidence": "assumed_fasted",
       "formulation": "IR",
       "route": "oral",
       "population": "healthy",
       "single_dose": true,
       "tuning_contaminated": false,
       "nonlinear_pk": false,
-      "data_quality": "fda_label_exact"
+      "data_quality": "fda_label_exact",
+      "notes": "FIXED 2026-03-22: Cmax was 0.167 (upper bound of 4-fold range). FDA label: 45-167 ng/mL for 20mg. Geometric mean sqrt(45*167) = 87 ng/mL = 0.087 mg/L."
     },
     "amantadine": {
       "smiles": "NC12CC3CC(CC(C3)C1)C2",