diff --git a/README.md b/README.md
index b17bf8e..a7d982f 100644
--- a/README.md
+++ b/README.md
@@ -189,8 +189,9 @@ Propose → execute (sandboxed) → observe → **verify** (correctness ≠ ran)
| [Architecture](docs/architecture.md) | layers, hexagonal ports, auto-configuration, the DI container |
| [Configuration](docs/configuration.md) | env / `.env` / YAML / profiles precedence |
| [Datasets](docs/datasets.md) | the `Dataset` container and loaders |
-| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics |
-| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate |
+| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics, calibration, ensembling, PR-AUC selection & CV strategies |
+| [Explainability](docs/explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
+| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate; the persisted audit trail |
| [Agentic ML-Engineering Loop](docs/agentic-loop.md) | propose → verify → reflect → select |
| [Deep Learning & TabFM](docs/deep-learning.md) | MLP, TabPFN, the PyTorch integration point |
| [Serving & Lineage](docs/serving.md) | in-process and gated servers, lineage |
diff --git a/docs/README.md b/docs/README.md
index b477b90..5d8ed0e 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -24,8 +24,9 @@
|---|---|
| [Architecture](architecture.md) | the five layers, hexagonal ports/adapters, the DI container, auto-configuration |
| [Datasets](datasets.md) | the `Dataset` container, loaders, `train_test_split`, task inference |
-| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard |
-| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate` |
+| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard, calibration, ensembling, PR-AUC selection & CV strategies |
+| [Explainability](explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
+| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate`; the persisted audit trail |
| [Agentic ML-Engineering Loop](agentic-loop.md) | propose → train → verify → reflect → select |
| [Deep Learning & TabFM](deep-learning.md) | sklearn-MLP, PyTorch Lightning, HuggingFace, TabPFN |
| [Serving & Lineage](serving.md) | the in-process server, gated backends, lineage |
diff --git a/docs/brief/firefly-datascience-complete-guide.pdf b/docs/brief/firefly-datascience-complete-guide.pdf
index 79ee13d..8c050b7 100644
Binary files a/docs/brief/firefly-datascience-complete-guide.pdf and b/docs/brief/firefly-datascience-complete-guide.pdf differ
diff --git a/docs/index.md b/docs/index.md
index 2f33cba..4a1eb64 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -102,6 +102,15 @@ swappability, and security by default.
[:octicons-arrow-right-24: Security model](security.md)
+- :material-lightbulb-on-outline:{ .lg .middle } __Explainable & trustworthy__
+
+ ---
+
+ Deterministic global + local feature importances (permutation, SHAP) and **calibrated**
+ probabilities — so every model can be explained, and its scores trusted for real decisions.
+
+ [:octicons-arrow-right-24: Explainability](explainability.md)
+
## Get started in 30 seconds
@@ -175,7 +184,7 @@ app = FireflyDataScienceApplication.run(config=config)
- :material-flask-outline:{ .middle } __[Classical AutoML](automl.md)__
---
- The classical-first engine: train, score, select.
+ Train, score, select — with calibration, stacking ensembles, PR-AUC selection & CV strategies.
- :material-lightbulb-on-outline:{ .middle } __[Explainability](explainability.md)__
diff --git a/docs/samples.md b/docs/samples.md
index c042597..551f514 100644
--- a/docs/samples.md
+++ b/docs/samples.md
@@ -2,10 +2,11 @@
**Every sample is a single, runnable script that is covered by a test — so each one is guaranteed to work.**
-The four scripts live in
+The five scripts live in
[`samples/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/samples).
-Three run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); one
-calls a **real LLM**. Pick the card that matches what you want to see, copy its run command, and go.
+Most run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); two use a
+**real LLM** when a key is present. Pick the card that matches what you want to see, copy its run
+command, and go.
!!! firefly "The pattern every sample demonstrates — the LLM proposes; the classical engine decides"
@@ -14,7 +15,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
swap the LLM for a fixed proposer so the exact same gate runs without a key — see
[GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md).
-## The four samples
+## The five samples
@@ -47,6 +48,22 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
[:octicons-arrow-right-24: Use case: Lumen](use-case-lumen.md)
+- :material-tune-vertical:{ .lg .middle } __`advanced_automl.py` — production-grade selection & trust__
+
+ ---
+
+ Every modeling/trust feature on the **real** `breast_cancer` data: a `StratifiedKFold` splitter,
+ **PR-AUC** selection, a **calibrated stacking ensemble**, deterministic **explainability** (global
+ importances), and a persisted **audit trail** of every GenAI gate decision. Runs **offline**, and
+ automatically uses a real LLM to propose features when a key is present. Needs the `tabular` extra
+ (`explain` adds SHAP).
+
+ ```bash
+ uv run python samples/advanced_automl.py
+ ```
+
+ [:octicons-arrow-right-24: Classical AutoML](automl.md)
+
- :material-database-outline:{ .lg .middle } __`industry_showcase.py` — real public data__
---
@@ -91,6 +108,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
```bash
uv run python samples/tutorial.py # the full guided tour
uv run python samples/lumen_credit_risk.py # focused credit-risk use case
+ uv run python samples/advanced_automl.py # calibration, ensembling, PR-AUC, explainability, audit
uv run python samples/industry_showcase.py # real OpenML data (needs network)
```
@@ -99,7 +117,8 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
```bash
export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5 # optional; this is the default
- uv run python samples/genai_llm_showcase.py
+ uv run python samples/genai_llm_showcase.py # real LLM: feature engineering + agentic loop
+ uv run python samples/advanced_automl.py # real LLM proposes features; calibrated ensemble + explainability
```
!!! tip "The offline samples are the place to start"
diff --git a/samples/advanced_automl.py b/samples/advanced_automl.py
new file mode 100644
index 0000000..107e2d9
--- /dev/null
+++ b/samples/advanced_automl.py
@@ -0,0 +1,152 @@
+# Copyright 2026 Firefly Software Foundation.
+"""Advanced AutoML — production-grade selection, trust, and governance, on real data.
+
+One runnable script that exercises every modeling/trust feature added in the latest release, on the
+real ``breast_cancer`` dataset (no synthetic data):
+
+ 1. **Governed feature engineering with a persisted audit trail.** The LLM (or, offline, a
+ deterministic stand-in) proposes features; the cost/benefit gate keeps only measured wins; and
+ *every* decision — accepted or rejected — is appended to a JSONL audit log.
+ 2. **Robust model selection.** A scikit-learn ``StratifiedKFold`` splitter drives cross-validation,
+ and the winner is chosen on **PR-AUC** (``average_precision``) — the right target for imbalanced
+ or cost-sensitive binary problems.
+ 3. **A calibrated stacking ensemble.** The top-k candidates are stacked, then calibrated so the
+ predicted probabilities are trustworthy (reported via the Brier score).
+ 4. **Explainability.** The winner reports deterministic global feature importances on the holdout.
+
+The GenAI step uses a real LLM when a key is present (``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` /
+``GEMINI_API_KEY``), and a deterministic ``StaticFeatureProposer`` otherwise — so the same sample runs
+offline in CI and against a live model when credentials are available.
+
+Run it: ``python samples/advanced_automl.py`` (needs the ``tabular`` extra; ``explain`` for SHAP)
+"""
+
+from __future__ import annotations
+
+import json
+import os
+import tempfile
+from pathlib import Path
+from typing import Any
+
+import pandas as pd
+
+from fireflyframework_datascience.automl import AutoML
+from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
+from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
+from fireflyframework_datascience.features.audit import JsonlAuditLog
+from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
+
+DEFAULT_MODEL = os.getenv("FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL", "anthropic:claude-haiku-4-5")
+
+
+def _has_llm_key() -> bool:
+ return bool(os.getenv("ANTHROPIC_API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY"))
+
+
+def _make_proposer() -> tuple[Any, str]:
+ """A real LLM proposer when a key is present; otherwise a deterministic, offline stand-in."""
+ if _has_llm_key():
+ from fireflyframework_datascience.features.genai import AgentFeatureProposer
+
+ return AgentFeatureProposer(model=DEFAULT_MODEL), "llm"
+ # Offline: real feature code over real columns. The gate decides what (if anything) earns its keep.
+ proposer = StaticFeatureProposer(
+ [
+ FeatureProposal(
+ "area_to_perimeter",
+ "df['area_to_perimeter'] = df['worst area'] / (df['worst perimeter'] + 1)",
+ "compactness proxy",
+ ),
+ FeatureProposal(
+ "concavity_interaction",
+ "df['concavity_interaction'] = df['mean concavity'] * df['mean concave points']",
+ "interaction term",
+ ),
+ FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected — adds nothing"),
+ ]
+ )
+ return proposer, "static"
+
+
+def _apply(engineered: Any, X: pd.DataFrame) -> pd.DataFrame:
+ """Apply the *accepted* feature code to a frame, keeping train and test consistent."""
+ from fireflyframework_datascience.features.executor import FeatureCodeExecutor
+
+ executor = FeatureCodeExecutor()
+ working = X.copy()
+ for accepted in engineered.accepted:
+ working = executor.execute(accepted.proposal.code, working)
+ return working
+
+
+def run(audit_path: str | Path | None = None) -> dict[str, Any]:
+ """Run the advanced AutoML pipeline end-to-end and return a report dict."""
+ from sklearn.model_selection import StratifiedKFold
+
+ ds = SklearnDatasetLoader().load("breast_cancer") # real data
+ train, test = ds.train_test_split(test_size=0.25, random_state=0)
+
+ # 1. Governed GenAI feature engineering with a persisted, append-only audit trail.
+ cleanup = False
+ if audit_path is None:
+ audit_path = Path(tempfile.mkdtemp(prefix="firefly-audit-")) / "genai-decisions.jsonl"
+ cleanup = True
+ proposer, proposer_kind = _make_proposer()
+ audit = JsonlAuditLog(audit_path)
+ engineer = GenAIFeatureEngineer(proposer, audit_log=audit, cv=4)
+ engineered = engineer.engineer(train)
+
+ # 2. Robust selection: a StratifiedKFold splitter + PR-AUC, with a calibrated stacking ensemble.
+ splitter = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)
+ automl = AutoML(cv=splitter, n_trials=1, calibrate=True, ensemble=True, ensemble_size=3, random_state=0)
+ result = automl.fit(engineered.dataset, metric="average_precision")
+
+ test_engineered = test.with_features(_apply(engineered, test.X))
+ evaluation = result.evaluate(test_engineered)
+
+ # 3. Explainability — deterministic global feature importances on the holdout.
+ explanation = result.explain(test_engineered)
+
+ # 4. Read the persisted audit trail back (one JSON line per gate decision).
+ audit_records = [json.loads(line) for line in Path(audit_path).read_text(encoding="utf-8").splitlines()]
+ if cleanup:
+ Path(audit_path).unlink(missing_ok=True)
+
+ return {
+ "proposer": proposer_kind,
+ "accepted_features": [a.proposal.name for a in engineered.accepted],
+ "rejected_features": [r.proposal.name for r in engineered.rejected],
+ "fe_lift": engineered.lift,
+ "winner": result.best_model.name,
+ "selection_metric": result.metric,
+ "cv_scoring": result.cv_scoring,
+ "leaderboard": result.leaderboard_table(),
+ "holdout": evaluation.metrics,
+ "explanation_method": explanation.method,
+ "top_features": explanation.top(8),
+ "audit_trail": audit_records,
+ }
+
+
+def main() -> None:
+ report = run()
+ print("=== Advanced AutoML — calibrated stacking ensemble, PR-AUC selection, explainability ===")
+ print(f"proposer : {report['proposer']}")
+ print(f"accepted features : {report['accepted_features']}")
+ print(f"rejected features : {report['rejected_features']}")
+ print(f"winning model : {report['winner']}")
+ print(f"selected on : {report['selection_metric']} (cv scorer: {report['cv_scoring']})")
+ print("leaderboard:")
+ print(report["leaderboard"])
+ print(f"holdout metrics : {report['holdout']}")
+ print(f"explanation : {report['explanation_method']}")
+ for name, importance in report["top_features"]:
+ print(f" {name:<26} {importance:+.4f}")
+ print(f"audit trail : {len(report['audit_trail'])} decisions persisted")
+ for record in report["audit_trail"]:
+ print(f" {record['decision']:<9} {record['feature']:<24} ({record['detail']})")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/tests/samples/test_advanced_automl.py b/tests/samples/test_advanced_automl.py
new file mode 100644
index 0000000..47166fe
--- /dev/null
+++ b/tests/samples/test_advanced_automl.py
@@ -0,0 +1,71 @@
+# Copyright 2026 Firefly Software Foundation.
+"""Smoke tests for the advanced-AutoML sample (calibration, ensembling, PR-AUC, explainability, audit).
+
+The offline test forces the deterministic proposer (so it runs in CI with no key); the integration
+test runs the *same* sample against a live LLM when ``ANTHROPIC_API_KEY`` is present.
+"""
+
+from __future__ import annotations
+
+import os
+
+import pytest
+
+
+def _load_sample(): # type: ignore[no-untyped-def]
+ import pathlib
+ import sys
+
+ sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "samples"))
+ import advanced_automl
+
+ return advanced_automl
+
+
+def test_advanced_automl_offline(monkeypatch: pytest.MonkeyPatch) -> None:
+ # Force the offline (deterministic) path regardless of the ambient environment.
+ for var in ("ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GEMINI_API_KEY"):
+ monkeypatch.delenv(var, raising=False)
+
+ report = _load_sample().run()
+
+ # Offline, deterministic feature engineering with a full audit trail (one record per proposal).
+ assert report["proposer"] == "static"
+ assert "noise" in report["rejected_features"] # the no-information feature is always rejected
+ assert len(report["audit_trail"]) == 3
+ assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])
+ assert {r["feature"] for r in report["audit_trail"]} == {"area_to_perimeter", "concavity_interaction", "noise"}
+
+ # Winner is a stacking ensemble, selected on PR-AUC.
+ assert report["winner"] == "stacking_ensemble"
+ assert report["selection_metric"] == "average_precision"
+ assert report["cv_scoring"] == "average_precision"
+
+ # Richer metrics are reported, including calibration quality (Brier) and PR-AUC, on a strong model.
+ holdout = report["holdout"]
+ assert {"roc_auc", "average_precision", "brier_score"} <= set(holdout)
+ assert holdout["roc_auc"] > 0.95
+ assert 0.0 <= holdout["brier_score"] <= 0.25 # well-calibrated probabilities
+
+ # Explainability produced ranked global importances.
+ assert report["explanation_method"] in ("permutation_importance", "shap")
+ assert len(report["top_features"]) > 0
+
+
+@pytest.mark.integration
+def test_advanced_automl_with_real_llm() -> None:
+ if not os.getenv("ANTHROPIC_API_KEY"):
+ pytest.skip("needs ANTHROPIC_API_KEY")
+
+ report = _load_sample().run()
+
+ # The real LLM proposed features and every gate decision was persisted to the audit trail.
+ assert report["proposer"] == "llm"
+ assert len(report["audit_trail"]) >= 1
+ assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])
+
+ # The robust selection pipeline still produced a calibrated stacking ensemble selected on PR-AUC.
+ assert report["winner"] == "stacking_ensemble"
+ assert report["selection_metric"] == "average_precision"
+ assert report["holdout"]["roc_auc"] > 0.9
+ assert len(report["top_features"]) > 0