Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -189,8 +189,9 @@ Propose → execute (sandboxed) → observe → **verify** (correctness ≠ ran)
| [Architecture](docs/architecture.md) | layers, hexagonal ports, auto-configuration, the DI container |
| [Configuration](docs/configuration.md) | env / `.env` / YAML / profiles precedence |
| [Datasets](docs/datasets.md) | the `Dataset` container and loaders |
| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics |
| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate |
| [Classical AutoML](docs/automl.md) | the `AutoML` facade, trainers, search, metrics, calibration, ensembling, PR-AUC selection & CV strategies |
| [Explainability](docs/explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
| [GenAI Feature Engineering](docs/genai-features.md) | propose → execute → measure → gate; the persisted audit trail |
| [Agentic ML-Engineering Loop](docs/agentic-loop.md) | propose → verify → reflect → select |
| [Deep Learning & TabFM](docs/deep-learning.md) | MLP, TabPFN, the PyTorch integration point |
| [Serving & Lineage](docs/serving.md) | in-process and gated servers, lineage |
Expand Down
5 changes: 3 additions & 2 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,9 @@
|---|---|
| [Architecture](architecture.md) | the five layers, hexagonal ports/adapters, the DI container, auto-configuration |
| [Datasets](datasets.md) | the `Dataset` container, loaders, `train_test_split`, task inference |
| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard |
| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate` |
| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard, calibration, ensembling, PR-AUC selection & CV strategies |
| [Explainability](explainability.md) | deterministic global + local feature importances (permutation, SHAP) |
| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate`; the persisted audit trail |
| [Agentic ML-Engineering Loop](agentic-loop.md) | propose → train → verify → reflect → select |
| [Deep Learning & TabFM](deep-learning.md) | sklearn-MLP, PyTorch Lightning, HuggingFace, TabPFN |
| [Serving & Lineage](serving.md) | the in-process server, gated backends, lineage |
Expand Down
Binary file modified docs/brief/firefly-datascience-complete-guide.pdf
Binary file not shown.
11 changes: 10 additions & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,15 @@ swappability, and security by default.

[:octicons-arrow-right-24: Security model](security.md)

- :material-lightbulb-on-outline:{ .lg .middle } __Explainable & trustworthy__

---

Deterministic global + local feature importances (permutation, SHAP) and **calibrated**
probabilities — so every model can be explained, and its scores trusted for real decisions.

[:octicons-arrow-right-24: Explainability](explainability.md)

</div>

## Get started in 30 seconds
Expand Down Expand Up @@ -175,7 +184,7 @@ app = FireflyDataScienceApplication.run(config=config)
- :material-flask-outline:{ .middle } __[Classical AutoML](automl.md)__

---
The classical-first engine: train, score, select.
Train, score, select — with calibration, stacking ensembles, PR-AUC selection & CV strategies.

- :material-lightbulb-on-outline:{ .middle } __[Explainability](explainability.md)__

Expand Down
29 changes: 24 additions & 5 deletions docs/samples.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,11 @@

**Every sample is a single, runnable script that is covered by a test — so each one is guaranteed to work.**

The four scripts live in
The five scripts live in
[`samples/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/samples).
Three run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); one
calls a **real LLM**. Pick the card that matches what you want to see, copy its run command, and go.
Most run **offline with no LLM key** (the GenAI steps use deterministic stand-in proposers); two use a
**real LLM** when a key is present. Pick the card that matches what you want to see, copy its run
command, and go.

!!! firefly "The pattern every sample demonstrates — the LLM proposes; the classical engine decides"

Expand All @@ -14,7 +15,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
swap the LLM for a fixed proposer so the exact same gate runs without a key — see
[GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md).

## The four samples
## The five samples

<div class="grid cards" markdown>

Expand Down Expand Up @@ -47,6 +48,22 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its

[:octicons-arrow-right-24: Use case: Lumen](use-case-lumen.md)

- :material-tune-vertical:{ .lg .middle } __`advanced_automl.py` — production-grade selection & trust__

---

Every modeling/trust feature on the **real** `breast_cancer` data: a `StratifiedKFold` splitter,
**PR-AUC** selection, a **calibrated stacking ensemble**, deterministic **explainability** (global
importances), and a persisted **audit trail** of every GenAI gate decision. Runs **offline**, and
automatically uses a real LLM to propose features when a key is present. Needs the `tabular` extra
(`explain` adds SHAP).

```bash
uv run python samples/advanced_automl.py
```

[:octicons-arrow-right-24: Classical AutoML](automl.md)

- :material-database-outline:{ .lg .middle } __`industry_showcase.py` — real public data__

---
Expand Down Expand Up @@ -91,6 +108,7 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
```bash
uv run python samples/tutorial.py # the full guided tour
uv run python samples/lumen_credit_risk.py # focused credit-risk use case
uv run python samples/advanced_automl.py # calibration, ensembling, PR-AUC, explainability, audit
uv run python samples/industry_showcase.py # real OpenML data (needs network)
```

Expand All @@ -99,7 +117,8 @@ calls a **real LLM**. Pick the card that matches what you want to see, copy its
```bash
export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5 # optional; this is the default
uv run python samples/genai_llm_showcase.py
uv run python samples/genai_llm_showcase.py # real LLM: feature engineering + agentic loop
uv run python samples/advanced_automl.py # real LLM proposes features; calibrated ensemble + explainability
```

!!! tip "The offline samples are the place to start"
Expand Down
152 changes: 152 additions & 0 deletions samples/advanced_automl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,152 @@
# Copyright 2026 Firefly Software Foundation.
"""Advanced AutoML — production-grade selection, trust, and governance, on real data.

One runnable script that exercises every modeling/trust feature added in the latest release, on the
real ``breast_cancer`` dataset (no synthetic data):

1. **Governed feature engineering with a persisted audit trail.** The LLM (or, offline, a
deterministic stand-in) proposes features; the cost/benefit gate keeps only measured wins; and
*every* decision — accepted or rejected — is appended to a JSONL audit log.
2. **Robust model selection.** A scikit-learn ``StratifiedKFold`` splitter drives cross-validation,
and the winner is chosen on **PR-AUC** (``average_precision``) — the right target for imbalanced
or cost-sensitive binary problems.
3. **A calibrated stacking ensemble.** The top-k candidates are stacked, then calibrated so the
predicted probabilities are trustworthy (reported via the Brier score).
4. **Explainability.** The winner reports deterministic global feature importances on the holdout.

The GenAI step uses a real LLM when a key is present (``ANTHROPIC_API_KEY`` / ``OPENAI_API_KEY`` /
``GEMINI_API_KEY``), and a deterministic ``StaticFeatureProposer`` otherwise — so the same sample runs
offline in CI and against a live model when credentials are available.

Run it: ``python samples/advanced_automl.py`` (needs the ``tabular`` extra; ``explain`` for SHAP)
"""

from __future__ import annotations

import json
import os
import tempfile
from pathlib import Path
from typing import Any

import pandas as pd

from fireflyframework_datascience.automl import AutoML
from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
from fireflyframework_datascience.features.audit import JsonlAuditLog
from fireflyframework_datascience.features.genai import GenAIFeatureEngineer

DEFAULT_MODEL = os.getenv("FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL", "anthropic:claude-haiku-4-5")


def _has_llm_key() -> bool:
return bool(os.getenv("ANTHROPIC_API_KEY") or os.getenv("OPENAI_API_KEY") or os.getenv("GEMINI_API_KEY"))


def _make_proposer() -> tuple[Any, str]:
"""A real LLM proposer when a key is present; otherwise a deterministic, offline stand-in."""
if _has_llm_key():
from fireflyframework_datascience.features.genai import AgentFeatureProposer

return AgentFeatureProposer(model=DEFAULT_MODEL), "llm"
# Offline: real feature code over real columns. The gate decides what (if anything) earns its keep.
proposer = StaticFeatureProposer(
[
FeatureProposal(
"area_to_perimeter",
"df['area_to_perimeter'] = df['worst area'] / (df['worst perimeter'] + 1)",
"compactness proxy",
),
FeatureProposal(
"concavity_interaction",
"df['concavity_interaction'] = df['mean concavity'] * df['mean concave points']",
"interaction term",
),
FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected — adds nothing"),
]
)
return proposer, "static"


def _apply(engineered: Any, X: pd.DataFrame) -> pd.DataFrame:
"""Apply the *accepted* feature code to a frame, keeping train and test consistent."""
from fireflyframework_datascience.features.executor import FeatureCodeExecutor

executor = FeatureCodeExecutor()
working = X.copy()
for accepted in engineered.accepted:
working = executor.execute(accepted.proposal.code, working)
return working


def run(audit_path: str | Path | None = None) -> dict[str, Any]:
"""Run the advanced AutoML pipeline end-to-end and return a report dict."""
from sklearn.model_selection import StratifiedKFold

ds = SklearnDatasetLoader().load("breast_cancer") # real data
train, test = ds.train_test_split(test_size=0.25, random_state=0)

# 1. Governed GenAI feature engineering with a persisted, append-only audit trail.
cleanup = False
if audit_path is None:
audit_path = Path(tempfile.mkdtemp(prefix="firefly-audit-")) / "genai-decisions.jsonl"
cleanup = True
proposer, proposer_kind = _make_proposer()
audit = JsonlAuditLog(audit_path)
engineer = GenAIFeatureEngineer(proposer, audit_log=audit, cv=4)
engineered = engineer.engineer(train)

# 2. Robust selection: a StratifiedKFold splitter + PR-AUC, with a calibrated stacking ensemble.
splitter = StratifiedKFold(n_splits=4, shuffle=True, random_state=0)
automl = AutoML(cv=splitter, n_trials=1, calibrate=True, ensemble=True, ensemble_size=3, random_state=0)
result = automl.fit(engineered.dataset, metric="average_precision")

test_engineered = test.with_features(_apply(engineered, test.X))
evaluation = result.evaluate(test_engineered)

# 3. Explainability — deterministic global feature importances on the holdout.
explanation = result.explain(test_engineered)

# 4. Read the persisted audit trail back (one JSON line per gate decision).
audit_records = [json.loads(line) for line in Path(audit_path).read_text(encoding="utf-8").splitlines()]
if cleanup:
Path(audit_path).unlink(missing_ok=True)

return {
"proposer": proposer_kind,
"accepted_features": [a.proposal.name for a in engineered.accepted],
"rejected_features": [r.proposal.name for r in engineered.rejected],
"fe_lift": engineered.lift,
"winner": result.best_model.name,
"selection_metric": result.metric,
"cv_scoring": result.cv_scoring,
"leaderboard": result.leaderboard_table(),
"holdout": evaluation.metrics,
"explanation_method": explanation.method,
"top_features": explanation.top(8),
"audit_trail": audit_records,
}


def main() -> None:
report = run()
print("=== Advanced AutoML — calibrated stacking ensemble, PR-AUC selection, explainability ===")
print(f"proposer : {report['proposer']}")
print(f"accepted features : {report['accepted_features']}")
print(f"rejected features : {report['rejected_features']}")
print(f"winning model : {report['winner']}")
print(f"selected on : {report['selection_metric']} (cv scorer: {report['cv_scoring']})")
print("leaderboard:")
print(report["leaderboard"])
print(f"holdout metrics : {report['holdout']}")
print(f"explanation : {report['explanation_method']}")
for name, importance in report["top_features"]:
print(f" {name:<26} {importance:+.4f}")
print(f"audit trail : {len(report['audit_trail'])} decisions persisted")
for record in report["audit_trail"]:
print(f" {record['decision']:<9} {record['feature']:<24} ({record['detail']})")


if __name__ == "__main__":
main()
71 changes: 71 additions & 0 deletions tests/samples/test_advanced_automl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,71 @@
# Copyright 2026 Firefly Software Foundation.
"""Smoke tests for the advanced-AutoML sample (calibration, ensembling, PR-AUC, explainability, audit).

The offline test forces the deterministic proposer (so it runs in CI with no key); the integration
test runs the *same* sample against a live LLM when ``ANTHROPIC_API_KEY`` is present.
"""

from __future__ import annotations

import os

import pytest


def _load_sample(): # type: ignore[no-untyped-def]
import pathlib
import sys

sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[2] / "samples"))
import advanced_automl

return advanced_automl


def test_advanced_automl_offline(monkeypatch: pytest.MonkeyPatch) -> None:
# Force the offline (deterministic) path regardless of the ambient environment.
for var in ("ANTHROPIC_API_KEY", "OPENAI_API_KEY", "GEMINI_API_KEY"):
monkeypatch.delenv(var, raising=False)

report = _load_sample().run()

# Offline, deterministic feature engineering with a full audit trail (one record per proposal).
assert report["proposer"] == "static"
assert "noise" in report["rejected_features"] # the no-information feature is always rejected
assert len(report["audit_trail"]) == 3
assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])
assert {r["feature"] for r in report["audit_trail"]} == {"area_to_perimeter", "concavity_interaction", "noise"}

# Winner is a stacking ensemble, selected on PR-AUC.
assert report["winner"] == "stacking_ensemble"
assert report["selection_metric"] == "average_precision"
assert report["cv_scoring"] == "average_precision"

# Richer metrics are reported, including calibration quality (Brier) and PR-AUC, on a strong model.
holdout = report["holdout"]
assert {"roc_auc", "average_precision", "brier_score"} <= set(holdout)
assert holdout["roc_auc"] > 0.95
assert 0.0 <= holdout["brier_score"] <= 0.25 # well-calibrated probabilities

# Explainability produced ranked global importances.
assert report["explanation_method"] in ("permutation_importance", "shap")
assert len(report["top_features"]) > 0


@pytest.mark.integration
def test_advanced_automl_with_real_llm() -> None:
if not os.getenv("ANTHROPIC_API_KEY"):
pytest.skip("needs ANTHROPIC_API_KEY")

report = _load_sample().run()

# The real LLM proposed features and every gate decision was persisted to the audit trail.
assert report["proposer"] == "llm"
assert len(report["audit_trail"]) >= 1
assert all(r["decision"] in ("accepted", "rejected") for r in report["audit_trail"])

# The robust selection pipeline still produced a calibrated stacking ensemble selected on PR-AUC.
assert report["winner"] == "stacking_ensemble"
assert report["selection_metric"] == "average_precision"
assert report["holdout"]["roc_auc"] > 0.9
assert len(report["top_features"]) > 0
Loading