-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Rana Faraz edited this page Jun 23, 2026
·
1 revision
A benchmark for when ensembling helps and which combining trick buys which kind of robustness.
EnsembleKit synthesizes base-learner predictions as log-odds of a known Bayes label, then scores four combiners on how well they recover it. The result is a clean 2x2 dissociation: competence weighting and robust aggregation each buy robustness to a different failure mode, and you need both to be robust everywhere.
- Heterogeneous learner competence (dead learners) breaks a uniform average; competence weighting fixes it.
- Intermittent per-sample corruption breaks any fixed-weight combiner; robust aggregation (median) fixes it.
- A diversity sweep shows ensemble gain collapses to zero as learner correlation rho approaches 1.
- A scrambled-label null confirms every AUROC signal is real.
- No models to train, no datasets, no API keys -- numpy only.
flowchart LR
subgraph input["Base learner synthesis"]
bayes["Known Bayes label\n(exact ground truth)"]
learners["Base learners\nz_k = a_k * s + noise"]
end
subgraph combiners["Combiners = weighting x aggregation"]
average["average\nuniform + mean"]
weighted["weighted\ncompetence + mean"]
robust["robust\nuniform + median"]
full["full\ncompetence + median"]
end
subgraph eval["Evaluation"]
auroc["AUROC vs Bayes label"]
gate["Dissociation gate (CI)"]
end
bayes --> learners --> combiners --> auroc --> gate
pip install -e ".[dev]"
ensemblekit compare --regime het_competence
ensemblekit compare --regime corrupted
ensemblekit diversity
python -m evals.harness
pytest -q- Architecture -- log-odds formulation, base learner synthesis, combiner implementations, 2x2 design
- Evaluation -- benchmark setup, results table, diversity sweep
- Configuration -- env vars, .env.example
- Development -- setup, tests, how to add a combiner or learner regime