Skip to content

Development

Rana Faraz edited this page Jun 23, 2026 · 1 revision

Development

Setup

git clone https://github.com/ranafaraz/MLForge.git
cd MLForge
python -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q                        # all tests should pass

Running the benchmark

python -m evals.harness          # full benchmark, writes evals/RESULTS.md
python -m evals.gate             # CI quality gate (asserts dissociation shape)

Linting

ruff check .
ruff format .
mypy src/mlforge

Code structure

src/mlforge/
    types.py           -- Dataset, Protocol, EstimatorConfig, Result dataclasses
    config.py          -- Settings (reads env vars), SALT constant
    metrics.py         -- accuracy, optimism computation, pooling utilities
    preprocessing.py   -- StandardScaler, SelectKBest (from-scratch numpy implementations)
    estimators/
        base.py        -- BaseEstimator interface + clone()
        logistic.py    -- LogisticRegression (SGD, numpy)
        knn.py         -- KNeighborsClassifier (numpy)
        naive_bayes.py -- GaussianNaiveBayes (numpy)
        factory.py     -- estimator_zoo(), param_grid(), make_grid()
    pipeline.py        -- Pipeline([preprocessor, estimator]): fit/predict/clone
    cv.py              -- stratified_folds(), cross_val_score()
    selection.py       -- leaky_select(), pipeline_select(), nested_select()
    data.py            -- synthetic_dataset(), oracle_split(), dataset_factory()
    cli.py             -- mlforge data | select | compare | eval

evals/
    metrics.py         -- per-regime aggregation utilities
    harness.py         -- runs all protocols × regimes × seeds, writes RESULTS.md
    gate.py            -- CI gate: asserts ordered dissociation, control, null collapse

tests/               -- pytest tests (unit + integration, all offline)
examples/
    run_forge.py       -- minimal usage example
docs/
    ARCHITECTURE.md    -- design write-up
    DECISIONS.md       -- key decisions and rationale

How to add a new protocol

A protocol is a function that takes a training dataset and a parameter grid and returns a (reported_score, chosen_config) pair. It must also ship a model (refit on all training data with the chosen config) so the oracle can score it.

  1. Add a function in selection.py with the signature:

    def myprotocol_select(
        X_train: np.ndarray,
        y_train: np.ndarray,
        grid: list[EstimatorConfig],
        settings: Settings,
    ) -> tuple[float, EstimatorConfig, BaseEstimator]:
        """Returns (reported_cv_score, chosen_config, shipped_estimator)."""
  2. Register the protocol in config.py's PROTOCOL_REGISTRY and in the CLI choices in cli.py.

  3. Ensure it uses cv.stratified_folds / cross_val_score from the shared CV module. This is what keeps all protocols comparable: the same splitting engine, the same fold sizes, the same random seed.

  4. Add tests in tests/ covering: the protocol runs without error, the reported score is in [0, 1], and the shipped estimator can predict on held-out data.

  5. Add to evals/harness.py so it runs in the full benchmark, and update evals/gate.py if you want CI to assert a specific property about the new protocol.

How to add a new dataset generator

A dataset generator is a function that returns (X_train, y_train, X_oracle, y_oracle) for a given seed and settings.

  1. Add a function in data.py:

    def myregime_dataset(settings: Settings, seed: int) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
        """Returns (X_train, y_train, X_oracle, y_oracle)."""

    Use settings.rng(SALT, seed, "myregime") for all random draws — never numpy.random.seed() globally or hash().

  2. Register the regime in data.py's DATASET_REGISTRY and in the CLI choices.

  3. Document the Bayes ceiling if it is analytically computable; add it to mlforge data --dataset myregime output.

  4. Add to the benchmark: add the new regime to evals/harness.py and decide what shape assertion (if any) belongs in evals/gate.py.

Key invariants

  • One splitting engine, one independent variable. All protocols must call cv.stratified_folds / cross_val_score. The comparison is fair only if the fold structure and random seeds are identical across protocols.
  • The oracle is never shown to the forge. X_oracle / y_oracle must only appear in the scoring step after the chosen pipeline is shipped. Never pass oracle rows to any protocol function.
  • pipeline and nested ship the same model. Both must refit using the config that a flat honest selection (no leaked data) on all training data would pick. If they ship different models, their oracle scores differ and the selection-bias correction is no longer isolable.
  • Determinism. Seed everything from Settings.rng(SALT, seed, scope). Never use numpy.random.seed() globally or hash().
  • Estimators are honest, not strawmen. The task's Bayes ceiling is below 1.0 by design (label noise). Do not increase signal to make optimism gaps look larger.

Clone this wiki locally