Home

MLForge

A leakage-safe tabular model-selection forge that measures how much CV lies, and by how much fixing it helps.

MLForge searches a grid of from-scratch estimators and reports a cross-validated accuracy — but the point is to measure how much that reported number overestimates true held-out accuracy, and to fix it. Two common malpractices inflate the score you would put in a report; MLForge isolates and corrects each one:

Proper pipelining buys a preprocessing-leakage correction. Nested cross-validation buys a selection-bias correction.

On a high-dimensional task, the leaky-but-common protocol reports 0.871 accuracy for a model that truly generalizes at 0.715 — a +0.155 lie. Pipelining removes most of it (+0.050 optimism); nested CV removes the rest (+0.013 optimism). On random-label data, the leaky protocol manufactures 0.763 accuracy from pure noise; nested CV correctly refuses, reporting 0.491.

Architecture overview

flowchart LR
    G[synthetic generator<br/>known signal] --> TR[train split]
    G --> OR[oracle split<br/>large, held-out]
    TR --> F{forge: pick a protocol}
    F -->|leaky| L[fit preprocessing on ALL data<br/>+ best-of-grid flat CV]
    F -->|pipeline| P[refit pipeline in each fold<br/>+ best-of-grid flat CV]
    F -->|nested| N[inner CV selects<br/>outer CV estimates]
    L & P & N --> R[reported score]
    L & P & N --> SHIP[ship: refit chosen<br/>pipeline on all data]
    SHIP --> SC[score on oracle]
    OR --> SC
    SC --> O[oracle score]
    R --> D[optimism = reported - oracle]
    O --> D

Three protocols search the same grid over the same data and ship the same way. They differ by exactly one structural choice each, so each optimism-gap reduction is attributable to a single fix.

Quick start

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                                            # tests, all offline
mlforge compare --dataset highdim                   # all three protocols side by side

Wiki pages

Architecture — leaky / pipeline / nested protocol design, oracle, optimism gap, null and low-dim controls
Evaluation — benchmark setup, results tables, dissociation proof, reproduce commands
Configuration — env vars, backend options, .env.example
Development — setup, code structure, how to add a new protocol or dataset generator

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

MLForge

Architecture overview

Quick start

Wiki pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally