fireflyframework · ancongui · Jun 25, 2026 · Jun 25, 2026
diff --git a/README.md b/README.md
@@ -189,8 +189,9 @@ Propose → execute (sandboxed) → observe → **verify** (correctness ≠ ran)
 | [Architecture](docs/architecture.md) | layers, hexagonal ports, auto-configuration, the DI container |
 | [Configuration](docs/configuration.md) | env / `.env` / YAML / profiles precedence |
 | [Datasets](docs/datasets.md) | the `Dataset` container and loaders |
-| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics |
-| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate |
+| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics, calibration, ensembling, PR-AUC selection & CV strategies |
+| [Explainability](docs/explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
+| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate; the persisted audit trail |
 | [Agentic ML-Engineering Loop](docs/agentic-loop.md) | propose → verify → reflect → select |
 | [Deep Learning & TabFM](docs/deep-learning.md) | MLP, TabPFN, the PyTorch integration point |
 | [Serving & Lineage](docs/serving.md) | in-process and gated servers, lineage |

diff --git a/docs/README.md b/docs/README.md
@@ -24,8 +24,9 @@
 |---|---|
 | [Architecture](architecture.md) | the five layers, hexagonal ports/adapters, the DI container, auto-configuration |
 | [Datasets](datasets.md) | the `Dataset` container, loaders, `train_test_split`, task inference |
-| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard |
-| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate` |
+| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard, calibration, ensembling, PR-AUC selection & CV strategies |
+| [Explainability](explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
+| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate`; the persisted audit trail |
 | [Agentic ML-Engineering Loop](agentic-loop.md) | propose → train → verify → reflect → select |
 | [Deep Learning & TabFM](deep-learning.md) | sklearn-MLP, PyTorch Lightning, HuggingFace, TabPFN |
 | [Serving & Lineage](serving.md) | the in-process server, gated backends, lineage |

diff --git a/docs/brief/firefly-datascience-complete-guide.pdf b/docs/brief/firefly-datascience-complete-guide.pdf
diff --git a/docs/index.md b/docs/index.md
@@ -102,6 +102,15 @@ swappability, and security by default.
 
     [:octicons-arrow-right-24: Security model](security.md)
 
+-   :material-lightbulb-on-outline:{ .lg .middle } __Explainable & trustworthy__
+
+    ---
+
+    Deterministic global + local feature importances (permutation, SHAP) and **calibrated**
+    probabilities — so every model can be explained, and its scores trusted for real decisions.
+
+    [:octicons-arrow-right-24: Explainability](explainability.md)
+
 </div>
 
 ## Get started in 30 seconds
@@ -175,7 +184,7 @@ app = FireflyDataScienceApplication.run(config=config)
 -   :material-flask-outline:{ .middle } __[Classical AutoML](automl.md)__
 
     ---
-    The classical-first engine: train, score, select.
+    Train, score, select — with calibration, stacking ensembles, PR-AUC selection & CV strategies.
 
 -   :material-lightbulb-on-outline:{ .middle } __[Explainability](explainability.md)__
 

diff --git a/docs/samples.md b/docs/samples.md
@@ -2,10 +2,11 @@
 
 **Every sample is a single, runnable script that is covered by a test — so each one is guaranteed to work.**
 
-The four scripts live in
+The five scripts live in
 [`samples/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/samples).
-Three run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); one
-calls a **real LLM**. Pick the card that matches what you want to see, copy its run command, and go.
+Most run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); two use a
+**real LLM** when a key is present. Pick the card that matches what you want to see, copy its run
+command, and go.
 
 !!! firefly "The pattern every sample demonstrates — the LLM proposes; the classical engine decides"
 
@@ -14,7 +15,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
     swap the LLM for a fixed proposer so the exact same gate runs without a key — see
     [GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md).
 
-## The four samples
+## The five samples
 
 <div class="grid cards" markdown>
 
@@ -47,6 +48,22 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
 
     [:octicons-arrow-right-24: Use case: Lumen](use-case-lumen.md)
 
+-   :material-tune-vertical:{ .lg .middle } __`advanced_automl.py` — production-grade selection & trust__
+
+    ---
+
+    Every modeling/trust feature on the **real** `breast_cancer` data: a `StratifiedKFold` splitter,
+    **PR-AUC** selection, a **calibrated stacking ensemble**, deterministic **explainability** (global
+    importances), and a persisted **audit trail** of every GenAI gate decision. Runs **offline**, and
+    automatically uses a real LLM to propose features when a key is present. Needs the `tabular` extra
+    (`explain` adds SHAP).
+
+    ```bash
+    uv run python samples/advanced_automl.py
+    ```
+
+    [:octicons-arrow-right-24: Classical AutoML](automl.md)
+
 -   :material-database-outline:{ .lg .middle } __`industry_showcase.py` — real public data__
 
     ---
@@ -91,6 +108,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
     ```bash
     uv run python samples/tutorial.py            # the full guided tour
     uv run python samples/lumen_credit_risk.py   # focused credit-risk use case
+    uv run python samples/advanced_automl.py     # calibration, ensembling, PR-AUC, explainability, audit
     uv run python samples/industry_showcase.py   # real OpenML data (needs network)
     ```
 
@@ -99,7 +117,8 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
     ```bash
     export ANTHROPIC_API_KEY=sk-ant-...                                       # or OPENAI_API_KEY=...
     export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5  # optional; this is the default
-    uv run python samples/genai_llm_showcase.py
+    uv run python samples/genai_llm_showcase.py   # real LLM: feature engineering + agentic loop
+    uv run python samples/advanced_automl.py      # real LLM proposes features; calibrated ensemble + explainability
     ```
 
 !!! tip "The offline samples are the place to start"

diff --git a/samples/advanced_automl.py b/samples/advanced_automl.py
@@ -0,0 +1,152 @@
+# Copyright 2026 Firefly Software Foundation.
+"""Advanced AutoML — production-grade selection, trust, and governance, on real data.
+
+One runnable script that exercises every modeling/trust feature added in the latest release, on the
+real ``breast_cancer`` dataset (no synthetic data):
+
+  1. **Governed feature engineering with a persisted audit trail.** The LLM (or, offline, a
+     deterministic stand-in) proposes features; the cost/benefit gate keeps only measured wins; and
+     *every* decision — accepted or rejected — is appended to a JSONL audit log.
+  2. **Robust model selection.** A scikit-learn ``StratifiedKFold`` splitter drives cross-validation,
+     and the winner is chosen on **PR-AUC** (``average_precision``) — the right target for imbalanced
+     or cost-sensitive binary problems.
+  3. **A calibrated stacking ensemble.** The top-k candidates are stacked, then calibrated so the
+     predicted probabilities are trustworthy (reported via the Brier score).
+  4. **Explainability.** The winner reports deterministic global feature importances on the holdout.
+
+The GenAI step uses a real LLM when a key is present (``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` /
+``GEMINI_API_KEY``), and a deterministic ``StaticFeatureProposer`` otherwise — so the same sample runs
+offline in CI and against a live model when credentials are available.
+
+Run it:  ``python samples/advanced_automl.py``   (needs the ``tabular`` extra; ``explain`` for SHAP)
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import tempfile
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from fireflyframework_datascience.automl import AutoML
+from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
+from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
+from fireflyframework_datascience.features.audit import JsonlAuditLog
+from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
+
+DEFAULT_MODEL = os.getenv("FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL", "anthropic:claude-haiku-4-5")
+
+
+def _has_llm_key() -> bool:
+    return bool(os.getenv("ANTHROPIC_API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY"))
+
+
+def _make_proposer() -> tuple[Any, str]:
+    """A real LLM proposer when a key is present; otherwise a deterministic, offline stand-in."""
+    if _has_llm_key():
+        from fireflyframework_datascience.features.genai import AgentFeatureProposer
+
+        return AgentFeatureProposer(model=DEFAULT_MODEL), "llm"
+    # Offline: real feature code over real columns. The gate decides what (if anything) earns its keep.
+    proposer = StaticFeatureProposer(
+        [
+            FeatureProposal(
+                "area_to_perimeter",
+                "df['area_to_perimeter'] = df['worst area'] / (df['worst perimeter'] + 1)",
+                "compactness proxy",
+            ),
+            FeatureProposal(
+                "concavity_interaction",
+                "df['concavity_interaction'] = df['mean concavity'] * df['mean concave points']",
+                "interaction term",
+            ),
+            FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected — adds nothing"),
+        ]
+    )
+    return proposer, "static"
+
+
+def _apply(engineered: Any, X: pd.DataFrame) -> pd.DataFrame:
+    """Apply the *accepted* feature code to a frame, keeping train and test consistent."""
+    from fireflyframework_datascience.features.executor import FeatureCodeExecutor
+
+    executor = FeatureCodeExecutor()
+    working = X.copy()
+    for accepted in engineered.accepted:
+        working = executor.execute(accepted.proposal.code, working)
+    return working
+
+
+def run(audit_path: str | Path | None = None) -> dict[str, Any]:
+    """Run the advanced AutoML pipeline end-to-end and return a report dict."""
+    from sklearn.model_selection import StratifiedKFold
+
+    ds = SklearnDatasetLoader().load("breast_cancer")  # real data
+    train, test = ds.train_test_split(test_size=0.25, random_state=0)
+
+    # 1. Governed GenAI feature engineering with a persisted, append-only audit trail.
+    cleanup = False
+    if audit_path is None:
+        audit_path = Path(tempfile.mkdtemp(prefix="firefly-audit-")) / "genai-decisions.jsonl"
+        cleanup = True
+    proposer, proposer_kind = _make_proposer()
+    audit = JsonlAuditLog(audit_path)
+    engineer = GenAIFeatureEngineer(proposer, audit_log=audit, cv=4)
+    engineered = engineer.engineer(train)
+
+    # 2. Robust selection: a StratifiedKFold splitter + PR-AUC, with a calibrated stacking ensemble.
+    splitter = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)
+    automl = AutoML(cv=splitter, n_trials=1, calibrate=True, ensemble=True, ensemble_size=3, random_state=0)
+    result = automl.fit(engineered.dataset, metric="average_precision")
+
+    test_engineered = test.with_features(_apply(engineered, test.X))
+    evaluation = result.evaluate(test_engineered)
+
+    # 3. Explainability — deterministic global feature importances on the holdout.
+    explanation = result.explain(test_engineered)
+
+    # 4. Read the persisted audit trail back (one JSON line per gate decision).
+    audit_records = [json.loads(line) for line in Path(audit_path).read_text(encoding="utf-8").splitlines()]
+    if cleanup:
+        Path(audit_path).unlink(missing_ok=True)
+
+    return {
+        "proposer": proposer_kind,
+        "accepted_features": [a.proposal.name for a in engineered.accepted],
+        "rejected_features": [r.proposal.name for r in engineered.rejected],
+        "fe_lift": engineered.lift,
+        "winner": result.best_model.name,
+        "selection_metric": result.metric,
+        "cv_scoring": result.cv_scoring,
+        "leaderboard": result.leaderboard_table(),
+        "holdout": evaluation.metrics,
+        "explanation_method": explanation.method,
+        "top_features": explanation.top(8),
+        "audit_trail": audit_records,
+    }
+
+
+def main() -> None:
+    report = run()
+    print("=== Advanced AutoML — calibrated stacking ensemble, PR-AUC selection, explainability ===")
+    print(f"proposer          : {report['proposer']}")
+    print(f"accepted features : {report['accepted_features']}")
+    print(f"rejected features : {report['rejected_features']}")
+    print(f"winning model     : {report['winner']}")
+    print(f"selected on       : {report['selection_metric']} (cv scorer: {report['cv_scoring']})")
+    print("leaderboard:")
+    print(report["leaderboard"])
+    print(f"holdout metrics   : {report['holdout']}")
+    print(f"explanation       : {report['explanation_method']}")
+    for name, importance in report["top_features"]:
+        print(f"  {name:<26} {importance:+.4f}")
+    print(f"audit trail       : {len(report['audit_trail'])} decisions persisted")
+    for record in report["audit_trail"]:
+        print(f"  {record['decision']:<9} {record['feature']:<24} ({record['detail']})")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/tests/samples/test_advanced_automl.py b/tests/samples/test_advanced_automl.py
@@ -0,0 +1,71 @@
+# Copyright 2026 Firefly Software Foundation.
+"""Smoke tests for the advanced-AutoML sample (calibration, ensembling, PR-AUC, explainability, audit).
+
+The offline test forces the deterministic proposer (so it runs in CI with no key); the integration
+test runs the *same* sample against a live LLM when ``ANTHROPIC_API_KEY`` is present.
+"""
+
+from __future__ import annotations
+
+import os
+
+import pytest
+
+
+def _load_sample():  # type: ignore[no-untyped-def]
+    import pathlib
+    import sys
+
+    sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "samples"))
+    import advanced_automl
+
+    return advanced_automl
+
+
+def test_advanced_automl_offline(monkeypatch: pytest.MonkeyPatch) -> None:
+    # Force the offline (deterministic) path regardless of the ambient environment.
+    for var in ("ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GEMINI_API_KEY"):
+        monkeypatch.delenv(var, raising=False)
+
+    report = _load_sample().run()
+
+    # Offline, deterministic feature engineering with a full audit trail (one record per proposal).
+    assert report["proposer"] == "static"
+    assert "noise" in report["rejected_features"]  # the no-information feature is always rejected
+    assert len(report["audit_trail"]) == 3
+    assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])
+    assert {r["feature"] for r in report["audit_trail"]} == {"area_to_perimeter", "concavity_interaction", "noise"}
+
+    # Winner is a stacking ensemble, selected on PR-AUC.
+    assert report["winner"] == "stacking_ensemble"
+    assert report["selection_metric"] == "average_precision"
+    assert report["cv_scoring"] == "average_precision"
+
+    # Richer metrics are reported, including calibration quality (Brier) and PR-AUC, on a strong model.
+    holdout = report["holdout"]
+    assert {"roc_auc", "average_precision", "brier_score"} <= set(holdout)
+    assert holdout["roc_auc"] > 0.95
+    assert 0.0 <= holdout["brier_score"] <= 0.25  # well-calibrated probabilities
+
+    # Explainability produced ranked global importances.
+    assert report["explanation_method"] in ("permutation_importance", "shap")
+    assert len(report["top_features"]) > 0
+
+
+@pytest.mark.integration
+def test_advanced_automl_with_real_llm() -> None:
+    if not os.getenv("ANTHROPIC_API_KEY"):
+        pytest.skip("needs ANTHROPIC_API_KEY")
+
+    report = _load_sample().run()
+
+    # The real LLM proposed features and every gate decision was persisted to the audit trail.
+    assert report["proposer"] == "llm"
+    assert len(report["audit_trail"]) >= 1
+    assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])
+
+    # The robust selection pipeline still produced a calibrated stacking ensemble selected on PR-AUC.
+    assert report["winner"] == "stacking_ensemble"
+    assert report["selection_metric"] == "average_precision"
+    assert report["holdout"]["roc_auc"] > 0.9
+    assert len(report["top_features"]) > 0