Skip to content
Rana Faraz edited this page Jun 23, 2026 · 1 revision

EnsembleKit

CI Live demo License: MIT

A benchmark for when ensembling helps and which combining trick buys which kind of robustness.

EnsembleKit synthesizes base-learner predictions as log-odds of a known Bayes label, then scores four combiners on how well they recover it. The result is a clean 2x2 dissociation: competence weighting and robust aggregation each buy robustness to a different failure mode, and you need both to be robust everywhere.

  • Heterogeneous learner competence (dead learners) breaks a uniform average; competence weighting fixes it.
  • Intermittent per-sample corruption breaks any fixed-weight combiner; robust aggregation (median) fixes it.
  • A diversity sweep shows ensemble gain collapses to zero as learner correlation rho approaches 1.
  • A scrambled-label null confirms every AUROC signal is real.
  • No models to train, no datasets, no API keys -- numpy only.
flowchart LR
    subgraph input["Base learner synthesis"]
        bayes["Known Bayes label\n(exact ground truth)"]
        learners["Base learners\nz_k = a_k * s + noise"]
    end
    subgraph combiners["Combiners = weighting x aggregation"]
        average["average\nuniform + mean"]
        weighted["weighted\ncompetence + mean"]
        robust["robust\nuniform + median"]
        full["full\ncompetence + median"]
    end
    subgraph eval["Evaluation"]
        auroc["AUROC vs Bayes label"]
        gate["Dissociation gate (CI)"]
    end
    bayes --> learners --> combiners --> auroc --> gate
Loading

Quick start

pip install -e ".[dev]"
ensemblekit compare --regime het_competence
ensemblekit compare --regime corrupted
ensemblekit diversity
python -m evals.harness
pytest -q

Wiki pages

  • Architecture -- log-odds formulation, base learner synthesis, combiner implementations, 2x2 design
  • Evaluation -- benchmark setup, results table, diversity sweep
  • Configuration -- env vars, .env.example
  • Development -- setup, tests, how to add a combiner or learner regime

Clone this wiki locally