diff --git a/README.md b/README.md index b17bf8e..a7d982f 100644 --- a/README.md +++ b/README.md @@ -189,8 +189,9 @@ Propose → execute (sandboxed) → observe → **verify** (correctness ≠ ran) | [Architecture](docs/architecture.md) | layers, hexagonal ports, auto-configuration, the DI container | | [Configuration](docs/configuration.md) | env / `.env` / YAML / profiles precedence | | [Datasets](docs/datasets.md) | the `Dataset` container and loaders | -| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics | -| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate | +| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics, calibration, ensembling, PR-AUC selection & CV strategies | +| [Explainability](docs/explainability.md) | deterministic global + local feature importances (permutation, SHAP) | +| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate; the persisted audit trail | | [Agentic ML-Engineering Loop](docs/agentic-loop.md) | propose → verify → reflect → select | | [Deep Learning & TabFM](docs/deep-learning.md) | MLP, TabPFN, the PyTorch integration point | | [Serving & Lineage](docs/serving.md) | in-process and gated servers, lineage | diff --git a/docs/README.md b/docs/README.md index b477b90..5d8ed0e 100644 --- a/docs/README.md +++ b/docs/README.md @@ -24,8 +24,9 @@ |---|---| | [Architecture](architecture.md) | the five layers, hexagonal ports/adapters, the DI container, auto-configuration | | [Datasets](datasets.md) | the `Dataset` container, loaders, `train_test_split`, task inference | -| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard | -| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate` | +| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard, calibration, ensembling, PR-AUC selection & CV strategies | +| [Explainability](explainability.md) | deterministic global + local feature importances (permutation, SHAP) | +| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate`; the persisted audit trail | | [Agentic ML-Engineering Loop](agentic-loop.md) | propose → train → verify → reflect → select | | [Deep Learning & TabFM](deep-learning.md) | sklearn-MLP, PyTorch Lightning, HuggingFace, TabPFN | | [Serving & Lineage](serving.md) | the in-process server, gated backends, lineage | diff --git a/docs/brief/firefly-datascience-complete-guide.pdf b/docs/brief/firefly-datascience-complete-guide.pdf index 79ee13d..8c050b7 100644 Binary files a/docs/brief/firefly-datascience-complete-guide.pdf and b/docs/brief/firefly-datascience-complete-guide.pdf differ diff --git a/docs/index.md b/docs/index.md index 2f33cba..4a1eb64 100644 --- a/docs/index.md +++ b/docs/index.md @@ -102,6 +102,15 @@ swappability, and security by default. [:octicons-arrow-right-24: Security model](security.md) +- :material-lightbulb-on-outline:{ .lg .middle } __Explainable & trustworthy__ + + --- + + Deterministic global + local feature importances (permutation, SHAP) and **calibrated** + probabilities — so every model can be explained, and its scores trusted for real decisions. + + [:octicons-arrow-right-24: Explainability](explainability.md) + ## Get started in 30 seconds @@ -175,7 +184,7 @@ app = FireflyDataScienceApplication.run(config=config) - :material-flask-outline:{ .middle } __[Classical AutoML](automl.md)__ --- - The classical-first engine: train, score, select. + Train, score, select — with calibration, stacking ensembles, PR-AUC selection & CV strategies. - :material-lightbulb-on-outline:{ .middle } __[Explainability](explainability.md)__ diff --git a/docs/samples.md b/docs/samples.md index c042597..551f514 100644 --- a/docs/samples.md +++ b/docs/samples.md @@ -2,10 +2,11 @@ **Every sample is a single, runnable script that is covered by a test — so each one is guaranteed to work.** -The four scripts live in +The five scripts live in [`samples/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/samples). -Three run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); one -calls a **real LLM**. Pick the card that matches what you want to see, copy its run command, and go. +Most run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); two use a +**real LLM** when a key is present. Pick the card that matches what you want to see, copy its run +command, and go. !!! firefly "The pattern every sample demonstrates — the LLM proposes; the classical engine decides" @@ -14,7 +15,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its swap the LLM for a fixed proposer so the exact same gate runs without a key — see [GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md). -## The four samples +## The five samples
@@ -47,6 +48,22 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its [:octicons-arrow-right-24: Use case: Lumen](use-case-lumen.md) +- :material-tune-vertical:{ .lg .middle } __`advanced_automl.py` — production-grade selection & trust__ + + --- + + Every modeling/trust feature on the **real** `breast_cancer` data: a `StratifiedKFold` splitter, + **PR-AUC** selection, a **calibrated stacking ensemble**, deterministic **explainability** (global + importances), and a persisted **audit trail** of every GenAI gate decision. Runs **offline**, and + automatically uses a real LLM to propose features when a key is present. Needs the `tabular` extra + (`explain` adds SHAP). + + ```bash + uv run python samples/advanced_automl.py + ``` + + [:octicons-arrow-right-24: Classical AutoML](automl.md) + - :material-database-outline:{ .lg .middle } __`industry_showcase.py` — real public data__ --- @@ -91,6 +108,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its ```bash uv run python samples/tutorial.py # the full guided tour uv run python samples/lumen_credit_risk.py # focused credit-risk use case + uv run python samples/advanced_automl.py # calibration, ensembling, PR-AUC, explainability, audit uv run python samples/industry_showcase.py # real OpenML data (needs network) ``` @@ -99,7 +117,8 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its ```bash export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY=... export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5 # optional; this is the default - uv run python samples/genai_llm_showcase.py + uv run python samples/genai_llm_showcase.py # real LLM: feature engineering + agentic loop + uv run python samples/advanced_automl.py # real LLM proposes features; calibrated ensemble + explainability ``` !!! tip "The offline samples are the place to start" diff --git a/samples/advanced_automl.py b/samples/advanced_automl.py new file mode 100644 index 0000000..107e2d9 --- /dev/null +++ b/samples/advanced_automl.py @@ -0,0 +1,152 @@ +# Copyright 2026 Firefly Software Foundation. +"""Advanced AutoML — production-grade selection, trust, and governance, on real data. + +One runnable script that exercises every modeling/trust feature added in the latest release, on the +real ``breast_cancer`` dataset (no synthetic data): + + 1. **Governed feature engineering with a persisted audit trail.** The LLM (or, offline, a + deterministic stand-in) proposes features; the cost/benefit gate keeps only measured wins; and + *every* decision — accepted or rejected — is appended to a JSONL audit log. + 2. **Robust model selection.** A scikit-learn ``StratifiedKFold`` splitter drives cross-validation, + and the winner is chosen on **PR-AUC** (``average_precision``) — the right target for imbalanced + or cost-sensitive binary problems. + 3. **A calibrated stacking ensemble.** The top-k candidates are stacked, then calibrated so the + predicted probabilities are trustworthy (reported via the Brier score). + 4. **Explainability.** The winner reports deterministic global feature importances on the holdout. + +The GenAI step uses a real LLM when a key is present (``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` / +``GEMINI_API_KEY``), and a deterministic ``StaticFeatureProposer`` otherwise — so the same sample runs +offline in CI and against a live model when credentials are available. + +Run it: ``python samples/advanced_automl.py`` (needs the ``tabular`` extra; ``explain`` for SHAP) +""" + +from __future__ import annotations + +import json +import os +import tempfile +from pathlib import Path +from typing import Any + +import pandas as pd + +from fireflyframework_datascience.automl import AutoML +from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader +from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer +from fireflyframework_datascience.features.audit import JsonlAuditLog +from fireflyframework_datascience.features.genai import GenAIFeatureEngineer + +DEFAULT_MODEL = os.getenv("FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL", "anthropic:claude-haiku-4-5") + + +def _has_llm_key() -> bool: + return bool(os.getenv("ANTHROPIC_API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY")) + + +def _make_proposer() -> tuple[Any, str]: + """A real LLM proposer when a key is present; otherwise a deterministic, offline stand-in.""" + if _has_llm_key(): + from fireflyframework_datascience.features.genai import AgentFeatureProposer + + return AgentFeatureProposer(model=DEFAULT_MODEL), "llm" + # Offline: real feature code over real columns. The gate decides what (if anything) earns its keep. + proposer = StaticFeatureProposer( + [ + FeatureProposal( + "area_to_perimeter", + "df['area_to_perimeter'] = df['worst area'] / (df['worst perimeter'] + 1)", + "compactness proxy", + ), + FeatureProposal( + "concavity_interaction", + "df['concavity_interaction'] = df['mean concavity'] * df['mean concave points']", + "interaction term", + ), + FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected — adds nothing"), + ] + ) + return proposer, "static" + + +def _apply(engineered: Any, X: pd.DataFrame) -> pd.DataFrame: + """Apply the *accepted* feature code to a frame, keeping train and test consistent.""" + from fireflyframework_datascience.features.executor import FeatureCodeExecutor + + executor = FeatureCodeExecutor() + working = X.copy() + for accepted in engineered.accepted: + working = executor.execute(accepted.proposal.code, working) + return working + + +def run(audit_path: str | Path | None = None) -> dict[str, Any]: + """Run the advanced AutoML pipeline end-to-end and return a report dict.""" + from sklearn.model_selection import StratifiedKFold + + ds = SklearnDatasetLoader().load("breast_cancer") # real data + train, test = ds.train_test_split(test_size=0.25, random_state=0) + + # 1. Governed GenAI feature engineering with a persisted, append-only audit trail. + cleanup = False + if audit_path is None: + audit_path = Path(tempfile.mkdtemp(prefix="firefly-audit-")) / "genai-decisions.jsonl" + cleanup = True + proposer, proposer_kind = _make_proposer() + audit = JsonlAuditLog(audit_path) + engineer = GenAIFeatureEngineer(proposer, audit_log=audit, cv=4) + engineered = engineer.engineer(train) + + # 2. Robust selection: a StratifiedKFold splitter + PR-AUC, with a calibrated stacking ensemble. + splitter = StratifiedKFold(n_splits=4, shuffle=True, random_state=0) + automl = AutoML(cv=splitter, n_trials=1, calibrate=True, ensemble=True, ensemble_size=3, random_state=0) + result = automl.fit(engineered.dataset, metric="average_precision") + + test_engineered = test.with_features(_apply(engineered, test.X)) + evaluation = result.evaluate(test_engineered) + + # 3. Explainability — deterministic global feature importances on the holdout. + explanation = result.explain(test_engineered) + + # 4. Read the persisted audit trail back (one JSON line per gate decision). + audit_records = [json.loads(line) for line in Path(audit_path).read_text(encoding="utf-8").splitlines()] + if cleanup: + Path(audit_path).unlink(missing_ok=True) + + return { + "proposer": proposer_kind, + "accepted_features": [a.proposal.name for a in engineered.accepted], + "rejected_features": [r.proposal.name for r in engineered.rejected], + "fe_lift": engineered.lift, + "winner": result.best_model.name, + "selection_metric": result.metric, + "cv_scoring": result.cv_scoring, + "leaderboard": result.leaderboard_table(), + "holdout": evaluation.metrics, + "explanation_method": explanation.method, + "top_features": explanation.top(8), + "audit_trail": audit_records, + } + + +def main() -> None: + report = run() + print("=== Advanced AutoML — calibrated stacking ensemble, PR-AUC selection, explainability ===") + print(f"proposer : {report['proposer']}") + print(f"accepted features : {report['accepted_features']}") + print(f"rejected features : {report['rejected_features']}") + print(f"winning model : {report['winner']}") + print(f"selected on : {report['selection_metric']} (cv scorer: {report['cv_scoring']})") + print("leaderboard:") + print(report["leaderboard"]) + print(f"holdout metrics : {report['holdout']}") + print(f"explanation : {report['explanation_method']}") + for name, importance in report["top_features"]: + print(f" {name:<26} {importance:+.4f}") + print(f"audit trail : {len(report['audit_trail'])} decisions persisted") + for record in report["audit_trail"]: + print(f" {record['decision']:<9} {record['feature']:<24} ({record['detail']})") + + +if __name__ == "__main__": + main() diff --git a/tests/samples/test_advanced_automl.py b/tests/samples/test_advanced_automl.py new file mode 100644 index 0000000..47166fe --- /dev/null +++ b/tests/samples/test_advanced_automl.py @@ -0,0 +1,71 @@ +# Copyright 2026 Firefly Software Foundation. +"""Smoke tests for the advanced-AutoML sample (calibration, ensembling, PR-AUC, explainability, audit). + +The offline test forces the deterministic proposer (so it runs in CI with no key); the integration +test runs the *same* sample against a live LLM when ``ANTHROPIC_API_KEY`` is present. +""" + +from __future__ import annotations + +import os + +import pytest + + +def _load_sample(): # type: ignore[no-untyped-def] + import pathlib + import sys + + sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "samples")) + import advanced_automl + + return advanced_automl + + +def test_advanced_automl_offline(monkeypatch: pytest.MonkeyPatch) -> None: + # Force the offline (deterministic) path regardless of the ambient environment. + for var in ("ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GEMINI_API_KEY"): + monkeypatch.delenv(var, raising=False) + + report = _load_sample().run() + + # Offline, deterministic feature engineering with a full audit trail (one record per proposal). + assert report["proposer"] == "static" + assert "noise" in report["rejected_features"] # the no-information feature is always rejected + assert len(report["audit_trail"]) == 3 + assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"]) + assert {r["feature"] for r in report["audit_trail"]} == {"area_to_perimeter", "concavity_interaction", "noise"} + + # Winner is a stacking ensemble, selected on PR-AUC. + assert report["winner"] == "stacking_ensemble" + assert report["selection_metric"] == "average_precision" + assert report["cv_scoring"] == "average_precision" + + # Richer metrics are reported, including calibration quality (Brier) and PR-AUC, on a strong model. + holdout = report["holdout"] + assert {"roc_auc", "average_precision", "brier_score"} <= set(holdout) + assert holdout["roc_auc"] > 0.95 + assert 0.0 <= holdout["brier_score"] <= 0.25 # well-calibrated probabilities + + # Explainability produced ranked global importances. + assert report["explanation_method"] in ("permutation_importance", "shap") + assert len(report["top_features"]) > 0 + + +@pytest.mark.integration +def test_advanced_automl_with_real_llm() -> None: + if not os.getenv("ANTHROPIC_API_KEY"): + pytest.skip("needs ANTHROPIC_API_KEY") + + report = _load_sample().run() + + # The real LLM proposed features and every gate decision was persisted to the audit trail. + assert report["proposer"] == "llm" + assert len(report["audit_trail"]) >= 1 + assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"]) + + # The robust selection pipeline still produced a calibrated stacking ensemble selected on PR-AUC. + assert report["winner"] == "stacking_ensemble" + assert report["selection_metric"] == "average_precision" + assert report["holdout"]["roc_auc"] > 0.9 + assert len(report["top_features"]) > 0