Development

Setup

git clone https://github.com/ranafaraz/MLForge.git
cd MLForge
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                        # all tests should pass

Running the benchmark

python -m evals.harness          # full benchmark, writes evals/RESULTS.md
python -m evals.gate             # CI quality gate (asserts dissociation shape)

Linting

ruff check .
ruff format .
mypy src/mlforge

Code structure

src/mlforge/
    types.py           -- Dataset, Protocol, EstimatorConfig, Result dataclasses
    config.py          -- Settings (reads env vars), SALT constant
    metrics.py         -- accuracy, optimism computation, pooling utilities
    preprocessing.py   -- StandardScaler, SelectKBest (from-scratch numpy implementations)
    estimators/
        base.py        -- BaseEstimator interface + clone()
        logistic.py    -- LogisticRegression (SGD, numpy)
        knn.py         -- KNeighborsClassifier (numpy)
        naive_bayes.py -- GaussianNaiveBayes (numpy)
        factory.py     -- estimator_zoo(), param_grid(), make_grid()
    pipeline.py        -- Pipeline([preprocessor, estimator]): fit/predict/clone
    cv.py              -- stratified_folds(), cross_val_score()
    selection.py       -- leaky_select(), pipeline_select(), nested_select()
    data.py            -- synthetic_dataset(), oracle_split(), dataset_factory()
    cli.py             -- mlforge data | select | compare | eval

evals/
    metrics.py         -- per-regime aggregation utilities
    harness.py         -- runs all protocols × regimes × seeds, writes RESULTS.md
    gate.py            -- CI gate: asserts ordered dissociation, control, null collapse

tests/               -- pytest tests (unit + integration, all offline)
examples/
    run_forge.py       -- minimal usage example
docs/
    ARCHITECTURE.md    -- design write-up
    DECISIONS.md       -- key decisions and rationale

How to add a new protocol

A protocol is a function that takes a training dataset and a parameter grid and returns a (reported_score, chosen_config) pair. It must also ship a model (refit on all training data with the chosen config) so the oracle can score it.

Add a function in selection.py with the signature:

def myprotocol_select(
    X_train: np.ndarray,
    y_train: np.ndarray,
    grid: list[EstimatorConfig],
    settings: Settings,
) -> tuple[float, EstimatorConfig, BaseEstimator]:
    """Returns (reported_cv_score, chosen_config, shipped_estimator)."""

Register the protocol in config.py's PROTOCOL_REGISTRY and in the CLI choices in cli.py.
Ensure it uses cv.stratified_folds / cross_val_score from the shared CV module. This is what keeps all protocols comparable: the same splitting engine, the same fold sizes, the same random seed.
Add tests in tests/ covering: the protocol runs without error, the reported score is in [0, 1], and the shipped estimator can predict on held-out data.
Add to evals/harness.py so it runs in the full benchmark, and update evals/gate.py if you want CI to assert a specific property about the new protocol.

How to add a new dataset generator

A dataset generator is a function that returns (X_train, y_train, X_oracle, y_oracle) for a given seed and settings.

Add a function in data.py:

def myregime_dataset(settings: Settings, seed: int) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
    """Returns (X_train, y_train, X_oracle, y_oracle)."""

Use settings.rng(SALT, seed, "myregime") for all random draws — never numpy.random.seed() globally or hash().

Register the regime in data.py's DATASET_REGISTRY and in the CLI choices.
Document the Bayes ceiling if it is analytically computable; add it to mlforge data --dataset myregime output.
Add to the benchmark: add the new regime to evals/harness.py and decide what shape assertion (if any) belongs in evals/gate.py.

Key invariants

One splitting engine, one independent variable. All protocols must call cv.stratified_folds / cross_val_score. The comparison is fair only if the fold structure and random seeds are identical across protocols.
The oracle is never shown to the forge. X_oracle / y_oracle must only appear in the scoring step after the chosen pipeline is shipped. Never pass oracle rows to any protocol function.
pipeline and nested ship the same model. Both must refit using the config that a flat honest selection (no leaked data) on all training data would pick. If they ship different models, their oracle scores differ and the selection-bias correction is no longer isolable.
Determinism. Seed everything from Settings.rng(SALT, seed, scope). Never use numpy.random.seed() globally or hash().
Estimators are honest, not strawmen. The task's Bayes ceiling is below 1.0 by design (label noise). Do not increase signal to make optimism gaps look larger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Development

Development

Setup

Running the benchmark

Linting

Code structure

How to add a new protocol

How to add a new dataset generator

Key invariants

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally