-
Notifications
You must be signed in to change notification settings - Fork 0
Development
git clone https://github.com/ranafaraz/MLForge.git
cd MLForge
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -e ".[dev]"
pytest -q # all tests should passpython -m evals.harness # full benchmark, writes evals/RESULTS.md
python -m evals.gate # CI quality gate (asserts dissociation shape)ruff check .
ruff format .
mypy src/mlforgesrc/mlforge/
types.py -- Dataset, Protocol, EstimatorConfig, Result dataclasses
config.py -- Settings (reads env vars), SALT constant
metrics.py -- accuracy, optimism computation, pooling utilities
preprocessing.py -- StandardScaler, SelectKBest (from-scratch numpy implementations)
estimators/
base.py -- BaseEstimator interface + clone()
logistic.py -- LogisticRegression (SGD, numpy)
knn.py -- KNeighborsClassifier (numpy)
naive_bayes.py -- GaussianNaiveBayes (numpy)
factory.py -- estimator_zoo(), param_grid(), make_grid()
pipeline.py -- Pipeline([preprocessor, estimator]): fit/predict/clone
cv.py -- stratified_folds(), cross_val_score()
selection.py -- leaky_select(), pipeline_select(), nested_select()
data.py -- synthetic_dataset(), oracle_split(), dataset_factory()
cli.py -- mlforge data | select | compare | eval
evals/
metrics.py -- per-regime aggregation utilities
harness.py -- runs all protocols × regimes × seeds, writes RESULTS.md
gate.py -- CI gate: asserts ordered dissociation, control, null collapse
tests/ -- pytest tests (unit + integration, all offline)
examples/
run_forge.py -- minimal usage example
docs/
ARCHITECTURE.md -- design write-up
DECISIONS.md -- key decisions and rationale
A protocol is a function that takes a training dataset and a parameter grid and returns a (reported_score, chosen_config) pair. It must also ship a model (refit on all training data with the chosen config) so the oracle can score it.
-
Add a function in
selection.pywith the signature:def myprotocol_select( X_train: np.ndarray, y_train: np.ndarray, grid: list[EstimatorConfig], settings: Settings, ) -> tuple[float, EstimatorConfig, BaseEstimator]: """Returns (reported_cv_score, chosen_config, shipped_estimator)."""
-
Register the protocol in
config.py'sPROTOCOL_REGISTRYand in the CLI choices incli.py. -
Ensure it uses
cv.stratified_folds/cross_val_scorefrom the shared CV module. This is what keeps all protocols comparable: the same splitting engine, the same fold sizes, the same random seed. -
Add tests in
tests/covering: the protocol runs without error, the reported score is in [0, 1], and the shipped estimator can predict on held-out data. -
Add to
evals/harness.pyso it runs in the full benchmark, and updateevals/gate.pyif you want CI to assert a specific property about the new protocol.
A dataset generator is a function that returns (X_train, y_train, X_oracle, y_oracle) for a given seed and settings.
-
Add a function in
data.py:def myregime_dataset(settings: Settings, seed: int) -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]: """Returns (X_train, y_train, X_oracle, y_oracle)."""
Use
settings.rng(SALT, seed, "myregime")for all random draws — nevernumpy.random.seed()globally orhash(). -
Register the regime in
data.py'sDATASET_REGISTRYand in the CLI choices. -
Document the Bayes ceiling if it is analytically computable; add it to
mlforge data --dataset myregimeoutput. -
Add to the benchmark: add the new regime to
evals/harness.pyand decide what shape assertion (if any) belongs inevals/gate.py.
-
One splitting engine, one independent variable. All protocols must call
cv.stratified_folds/cross_val_score. The comparison is fair only if the fold structure and random seeds are identical across protocols. -
The oracle is never shown to the forge.
X_oracle/y_oraclemust only appear in the scoring step after the chosen pipeline is shipped. Never pass oracle rows to any protocol function. -
pipelineandnestedship the same model. Both must refit using the config that a flat honest selection (no leaked data) on all training data would pick. If they ship different models, their oracle scores differ and the selection-bias correction is no longer isolable. -
Determinism. Seed everything from
Settings.rng(SALT, seed, scope). Never usenumpy.random.seed()globally orhash(). - Estimators are honest, not strawmen. The task's Bayes ceiling is below 1.0 by design (label noise). Do not increase signal to make optimism gaps look larger.