Skip to content
Rana Faraz edited this page Jun 23, 2026 · 1 revision

MLForge

CI Live demo License: MIT

A leakage-safe tabular model-selection forge that measures how much CV lies, and by how much fixing it helps.

MLForge searches a grid of from-scratch estimators and reports a cross-validated accuracy — but the point is to measure how much that reported number overestimates true held-out accuracy, and to fix it. Two common malpractices inflate the score you would put in a report; MLForge isolates and corrects each one:

Proper pipelining buys a preprocessing-leakage correction. Nested cross-validation buys a selection-bias correction.

On a high-dimensional task, the leaky-but-common protocol reports 0.871 accuracy for a model that truly generalizes at 0.715 — a +0.155 lie. Pipelining removes most of it (+0.050 optimism); nested CV removes the rest (+0.013 optimism). On random-label data, the leaky protocol manufactures 0.763 accuracy from pure noise; nested CV correctly refuses, reporting 0.491.

Architecture overview

flowchart LR
    G[synthetic generator<br/>known signal] --> TR[train split]
    G --> OR[oracle split<br/>large, held-out]
    TR --> F{forge: pick a protocol}
    F -->|leaky| L[fit preprocessing on ALL data<br/>+ best-of-grid flat CV]
    F -->|pipeline| P[refit pipeline in each fold<br/>+ best-of-grid flat CV]
    F -->|nested| N[inner CV selects<br/>outer CV estimates]
    L & P & N --> R[reported score]
    L & P & N --> SHIP[ship: refit chosen<br/>pipeline on all data]
    SHIP --> SC[score on oracle]
    OR --> SC
    SC --> O[oracle score]
    R --> D[optimism = reported - oracle]
    O --> D
Loading

Three protocols search the same grid over the same data and ship the same way. They differ by exactly one structural choice each, so each optimism-gap reduction is attributable to a single fix.

Quick start

python -m venv .venv && source .venv/bin/activate   # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                                            # tests, all offline
mlforge compare --dataset highdim                   # all three protocols side by side

Wiki pages

  • Architecture — leaky / pipeline / nested protocol design, oracle, optimism gap, null and low-dim controls
  • Evaluation — benchmark setup, results tables, dissociation proof, reproduce commands
  • Configuration — env vars, backend options, .env.example
  • Development — setup, code structure, how to add a new protocol or dataset generator

Clone this wiki locally