-
Notifications
You must be signed in to change notification settings - Fork 0
Home
A leakage-safe tabular model-selection forge that measures how much CV lies, and by how much fixing it helps.
MLForge searches a grid of from-scratch estimators and reports a cross-validated accuracy — but the point is to measure how much that reported number overestimates true held-out accuracy, and to fix it. Two common malpractices inflate the score you would put in a report; MLForge isolates and corrects each one:
Proper pipelining buys a preprocessing-leakage correction. Nested cross-validation buys a selection-bias correction.
On a high-dimensional task, the leaky-but-common protocol reports 0.871 accuracy for a model that truly generalizes at 0.715 — a +0.155 lie. Pipelining removes most of it (+0.050 optimism); nested CV removes the rest (+0.013 optimism). On random-label data, the leaky protocol manufactures 0.763 accuracy from pure noise; nested CV correctly refuses, reporting 0.491.
flowchart LR
G[synthetic generator<br/>known signal] --> TR[train split]
G --> OR[oracle split<br/>large, held-out]
TR --> F{forge: pick a protocol}
F -->|leaky| L[fit preprocessing on ALL data<br/>+ best-of-grid flat CV]
F -->|pipeline| P[refit pipeline in each fold<br/>+ best-of-grid flat CV]
F -->|nested| N[inner CV selects<br/>outer CV estimates]
L & P & N --> R[reported score]
L & P & N --> SHIP[ship: refit chosen<br/>pipeline on all data]
SHIP --> SC[score on oracle]
OR --> SC
SC --> O[oracle score]
R --> D[optimism = reported - oracle]
O --> D
Three protocols search the same grid over the same data and ship the same way. They differ by exactly one structural choice each, so each optimism-gap reduction is attributable to a single fix.
python -m venv .venv && source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q # tests, all offline
mlforge compare --dataset highdim # all three protocols side by side- Architecture — leaky / pipeline / nested protocol design, oracle, optimism gap, null and low-dim controls
- Evaluation — benchmark setup, results tables, dissociation proof, reproduce commands
-
Configuration — env vars, backend options,
.env.example - Development — setup, code structure, how to add a new protocol or dataset generator