Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 9 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,11 @@ swappability, and security by default.
- **One reproducible pattern.** The LLM proposes code/features/pipelines/seeds; a deterministic
classical engine trains, scores, and selects; every GenAI step is gated behind a measured
improvement over a seeded classical baseline.
- **Hexagonal & swappable.** Every ML/MLOps library (scikit-learn, XGBoost, LightGBM, CatBoost,
AutoGluon, TabPFN, PyTorch Lightning, HuggingFace, MLflow, Feast, BentoML, …) is a swappable adapter
behind a `Protocol` port. The core stays library-agnostic.
- **Hexagonal & swappable.** Each ML/MLOps library sits behind a `Protocol` port, so the core stays
library-agnostic. Adapters that ship today: scikit-learn, XGBoost, LightGBM, CatBoost, TabPFN,
PyTorch Lightning, HuggingFace, and MLflow. Ports with reference or planned adapters (AutoGluon,
Feast, BentoML packaging, a model registry) are marked as such in the docs — the seams exist; the
adapters are landing.
- **Firefly-native.** Auto-configuration, dependency injection, a startup banner + wiring summary,
CalVer, and the same CI gates as the rest of the Firefly Framework.

Expand Down Expand Up @@ -121,8 +123,10 @@ Five acyclic layers, mirroring `fireflyframework-agentic` with a **DataScience**

### Hexagonal ports & adapters

Every ML/MLOps library (scikit-learn, XGBoost, AutoGluon, TabPFN, PyTorch Lightning, HuggingFace,
MLflow, BentoML, …) is a swappable adapter behind a `Protocol` port. The core stays library-agnostic.
Each ML/MLOps library sits behind a `Protocol` port, so the core stays library-agnostic. Shipping
adapters today: scikit-learn, XGBoost, LightGBM, CatBoost, TabPFN, PyTorch Lightning, HuggingFace,
MLflow. AutoGluon, Feast, BentoML packaging and a model registry are ports with reference/planned
adapters.

<p align="center">
<img src="docs/img/hexagonal.svg" alt="Hexagonal ports and adapters" width="78%">
Expand Down
103 changes: 103 additions & 0 deletions docs/explainability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
# Explainability

**Every fitted model can explain which features drive its predictions — with deterministic,
well-understood methods, never an LLM.**

Explainability is a first-class port in Firefly DataScience. After AutoML selects and refits a winner,
you get a model that can describe *why* it predicts what it predicts: globally (which features matter
across the dataset) and — with the optional SHAP adapter — locally (which features moved a single
prediction). This is table-stakes for regulated domains (lending, healthcare, insurance) where a model
you cannot explain is a model you cannot ship.

!!! firefly "Explanations are classical, not generated"

Importances come from **permutation importance** (the dependency-free default) or **SHAP** — both
deterministic, peer-reviewed methods. The LLM is never in the explanation path: just as GenAI
*proposes* and a classical engine *decides*, here the classical engine also *explains*. The numbers
are reproducible from a seed.

## Global feature importance

Every `AutoMLResult` can explain its winning model. Call `explain()` with a dataset (typically your
held-out split):

```python
from fireflyframework_datascience.automl import AutoML
from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader

ds = SklearnDatasetLoader().load("breast_cancer")
train, test = ds.train_test_split(test_size=0.25, random_state=0)

result = AutoML(cv=3, random_state=0).fit(train)
explanation = result.explain(test) # (1)!

print(explanation.method) # "permutation_importance"
for name, importance in explanation.top(8):
print(f"{name:<26} {importance:+.4f}")
```

1. `explain()` uses the DI-wired `ExplainerPort` when the result came from `AutoML.from_context`,
otherwise the dependency-free permutation-importance explainer. It returns a `GlobalExplanation`.

!!! success "Expected (a real run on `breast_cancer`)"

```text
winner: linear | holdout roc_auc: 0.9952
permutation_importance
radius error +0.0182
fractal dimension error +0.0140
mean concave points +0.0091
mean concavity +0.0077
compactness error +0.0056
worst area +0.0056
worst symmetry +0.0056
perimeter error +0.0049
```

Each value is the mean drop in the model's score when that feature is randomly permuted — higher
means more important. A pure-noise column lands at ≈ 0. Exact numbers depend on the data, the
winning model, and the seed.

`GlobalExplanation` exposes `feature_importances` (a `dict[str, float]`), `std`, `baseline_score`,
`.top(k)`, and `.to_frame()` for a tidy pandas table.

## Local (per-prediction) explanations with SHAP

For per-prediction attributions — "why did *this* applicant score the way they did?" — install the
optional `explain` extra and the SHAP explainer is used automatically:

```bash
uv add "fireflyframework-datascience[explain]" # adds shap
```

```python
from fireflyframework_datascience.explainability.adapters import ShapExplainer

explainer = ShapExplainer()
local = explainer.explain_local(result.best_model, test.X.iloc[:1]) # one LocalExplanation per row
for feature, contribution in local[0].top(5):
print(f"{feature:<26} {contribution:+.4f}")
```

Each `LocalExplanation` carries the `prediction`, per-feature `contributions`, and a `base_value`;
`.top(k)` ranks features by absolute contribution.

## How it fits the architecture

| Piece | What it is |
| --- | --- |
| `ExplainerPort` | The `Protocol`: `supports(model)` + `explain_global(model, dataset)`. |
| `PermutationImportanceExplainer` | Default adapter — model-agnostic, scikit-learn only (no extra dependency). |
| `ShapExplainer` | Optional adapter (`explain` extra) — global **and** local attributions. |
| `ExplainabilityAutoConfiguration` | Registers the explainer (SHAP when installed, else permutation). |
| `AutoMLResult.explain(dataset)` | The handle most users hold — delegates to the wired explainer. |

Because it is a port, you can inject your own explainer (register an `ExplainerPort` bean and it wins),
and the DI-wired `AutoML.from_context(app)` automatically threads it into every result.

## See also

- [Classical AutoML](automl.md) — the engine that produces the model you explain.
- [Architecture](architecture.md) — how ports, adapters, and auto-configuration fit together.
- [GenAI features](genai-features.md) — the gated accelerator (proposals are explained the same way).
- [Security](security.md) — why generated code is treated as untrusted, and what is enforced today.
5 changes: 5 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,11 @@ app = FireflyDataScienceApplication.run(config=config)
---
The classical-first engine: train, score, select.

- :material-lightbulb-on-outline:{ .middle } __[Explainability](explainability.md)__

---
Deterministic global + local feature importances (permutation, SHAP).

- :material-creation-outline:{ .middle } __[GenAI features](genai-features.md)__

---
Expand Down
21 changes: 15 additions & 6 deletions docs/security.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,16 @@ The post-conditions are enforced in order: a non-`DataFrame` result raises `Feat

## Layer 3 — the tiered sandbox

Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. For untrusted data, escalate isolation with `execution.sandbox` in `ExecutionConfig`:
!!! warning "Implementation status — what is enforced today"

Layers 1–2 (static analysis + the restricted in-process namespace) are **enforced now** and are
what protects you today. The sandbox *tiers* below (`docker`, `e2b`), `execution.timeout_seconds`,
and the `require_approval` HITL gate are currently **declared, validated configuration** — their
routing and enforcement are on the roadmap (a `CodeExecutorPort` with per-tier adapters and an
approval gate). Until that ships, the real isolation is the in-process `monty` / `local` path:
**do not run genuinely untrusted data through GenAI expecting container/microVM isolation yet.**

Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. The configuration surface below lets you *declare* stronger isolation for untrusted data via `execution.sandbox` in `ExecutionConfig` (enforcement is roadmap, per the note above):

```python
from fireflyframework_datascience.core.config import FireflyDataScienceConfig
Expand Down Expand Up @@ -142,7 +151,7 @@ The literal type for `sandbox` is exactly `Literal["monty", "docker", "e2b", "lo

Profile overlays outrank the base `firefly-datascience.yaml`, so a `prod` profile can tighten isolation without touching the base file. See [Configuration](configuration.md) for the full precedence order.

Beyond the strongest sandbox sits **HITL** (human-in-the-loop): when `execution.require_approval` is `True` (the default), generated code is surfaced for human approval before it runs. This is the final tier — a person, not a policy, signs off.
Beyond the strongest sandbox sits **HITL** (human-in-the-loop): `execution.require_approval` defaults to `True`, and the design's final tier is a person — not a policy — signing off on generated code before it runs. (Per the status note above, the approval-gate wiring is on the roadmap; today the field is declared and validated.)

!!! note "Defaults are the safe end of every axis"
Out of the box, `sandbox = "monty"` (in-process restricted interpreter), `timeout_seconds = 60`, and `require_approval = True`. You loosen these deliberately — and only `local` removes isolation entirely.
Expand All @@ -154,19 +163,19 @@ The subtle attack is not the model going rogue on its own; it is a **column valu
1. **Static analysis is content-blind.** It rejects `os`, `subprocess`, `socket`, dunder access, and `eval`/`exec`/`open` regardless of *why* the model wrote them — so a successful injection still produces code that gets rejected.
2. **The restricted namespace** means even "clever" injected code has no I/O, no imports, no host reach.
3. **The numeric-new-column contract** means injected code that tries to do anything other than add a numeric feature fails the post-conditions.
4. **Sandboxing + HITL** mean that for genuinely untrusted data you route to `docker`/`e2b` and require approval — so injection cannot silently reach a capability.
4. **Sandboxing + HITL** are the *intended* outer tiers for genuinely untrusted data (route to `docker`/`e2b`, require approval). Their enforcement is on the roadmap (see the Layer 3 status note) — today, rely on points 1–3, which are enforced in-process.

!!! warning "The framework does not read your data's meaning"
Firefly cannot inspect or sanitize the *semantics* of your data. Prompt-injection defense rests on capability restriction and sandboxing, not on detecting malicious text. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on.

## Governance — the CostBenefitGate

GenAI is **off by default** (`genai.enabled = False`) — Firefly is classical-first. When you do enable it, the `CostBenefitGate` is the governance control: it decides whether an LLM call is *worth it* before spending tokens, bounded by a budget.
GenAI is **off by default** (`genai.enabled = False`) — Firefly is classical-first. When you do enable it, the `CostBenefitGate` is the governance control: it is a **post-hoc, measured-lift filter** — a proposal (feature or pipeline) is adopted only if it *measurably beats the seeded baseline* on cross-validation; anything that doesn't is discarded. (It governs *what is kept*, not token spend: `genai.budget_usd` is a declared ceiling whose pre-call enforcement is on the roadmap.)

```python
config.genai.enabled # False by default
config.genai.cost_benefit_gate # True — gate LLM spend on expected benefit
config.genai.budget_usd # optional hard ceiling (float | None), e.g. 5.00
config.genai.budget_usd # declared ceiling (float | None); pre-call enforcement is roadmap
```

```yaml
Expand All @@ -179,7 +188,7 @@ genai:
```

!!! firefly "Two orthogonal gates: how much, and what"
The `CostBenefitGate` is a *governance* control, not a security control: it limits spend and runaway agentic loops, not capability. Keep both axes in mind — `cost_benefit_gate` governs **how much** the model runs; the executor and sandbox govern **what its output may do**. Neither substitutes for the other.
The `CostBenefitGate` is a *governance* control, not a security control: it governs **what GenAI output is kept** (only proposals that measurably beat the baseline), not capability. Keep both axes in mind — the gate governs **whether a proposal earns its place**; the executor and sandbox govern **what its output may do**. Neither substitutes for the other.

## Limits of the trust model

Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,7 @@ nav:
- Architecture: architecture.md
- Datasets: datasets.md
- Classical AutoML: automl.md
- Explainability: explainability.md
- GenAI Feature Engineering: genai-features.md
- Agentic ML-Engineering Loop: agentic-loop.md
- Deep Learning & TabFM: deep-learning.md
Expand Down
5 changes: 4 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,7 @@ nlp = ["transformers>=4.45.0", "datasets>=3.0.0", "peft>=0.13.0", "trl>=0.11.0",
tracking = ["mlflow>=2.17.0"]
tracking-wandb = ["wandb>=0.18.0"]
validation = ["pandera>=0.20.0"]
explain = ["shap>=0.46.0"]
featurestore = ["feast>=0.40.0"]
serving = ["bentoml>=1.3.0"]
serving-llm = ["vllm>=0.6.0"]
Expand All @@ -64,7 +65,7 @@ genai = [
]
# convenience bundles
automl-stack = ["fireflyframework-datascience[tabular,tabfm,automl,tracking,validation,data]"]
full = ["fireflyframework-datascience[tabular,tabfm,automl,dl,nlp,tracking,validation,featurestore,serving,lineage,orchestration,data,genai]"]
full = ["fireflyframework-datascience[tabular,tabfm,automl,dl,nlp,tracking,validation,explain,featurestore,serving,lineage,orchestration,data,genai]"]

[project.scripts]
firefly-ds = "fireflyframework_datascience.cli.main:cli"
Expand All @@ -77,6 +78,7 @@ datasets = "fireflyframework_datascience.datasets.auto_configuration:DatasetsAut
engineering = "fireflyframework_datascience.engineering.auto_configuration:EngineeringAutoConfiguration"
models = "fireflyframework_datascience.models.auto_configuration:ModelsAutoConfiguration"
evaluation = "fireflyframework_datascience.evaluation.auto_configuration:EvaluationAutoConfiguration"
explainability = "fireflyframework_datascience.explainability.auto_configuration:ExplainabilityAutoConfiguration"
features = "fireflyframework_datascience.features.auto_configuration:FeaturesAutoConfiguration"
search = "fireflyframework_datascience.search.auto_configuration:SearchAutoConfiguration"
validation = "fireflyframework_datascience.validation.auto_configuration:ValidationAutoConfiguration"
Expand Down Expand Up @@ -125,6 +127,7 @@ ignore = ["E501", "TC001", "TC002", "TC003", "UP040", "UP046", "UP047", "B008",
"src/fireflyframework_datascience/models/**" = ["PLC0415"]
"src/fireflyframework_datascience/engineering/**" = ["PLC0415"]
"src/fireflyframework_datascience/evaluation/**" = ["PLC0415"]
"src/fireflyframework_datascience/explainability/**" = ["PLC0415"]
"src/fireflyframework_datascience/features/**" = ["PLC0415"]
"src/fireflyframework_datascience/search/**" = ["PLC0415"]
"src/fireflyframework_datascience/validation/**" = ["PLC0415"]
Expand Down
15 changes: 15 additions & 0 deletions src/fireflyframework_datascience/automl/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@

if TYPE_CHECKING:
from fireflyframework_datascience.automl.facade import AutoML
from fireflyframework_datascience.explainability import ExplainerPort, GlobalExplanation


@dataclass
Expand All @@ -44,6 +45,7 @@ class AutoMLResult:
evaluator: MetricsEvaluatorPort
cv_scoring: str = ""
extras: dict[str, Any] = field(default_factory=dict)
explainer: ExplainerPort | None = None

@property
def best_score(self) -> float:
Expand All @@ -69,6 +71,19 @@ def evaluate(self, dataset: Dataset) -> EvaluationResult:
def leaderboard_table(self) -> str:
return "\n".join(str(entry) for entry in self.leaderboard)

def explain(self, dataset: Dataset) -> GlobalExplanation:
"""Global feature importances for the winning model.

Uses the injected :class:`ExplainerPort` (the DI-wired explainer when built via
``AutoML.from_context``), falling back to the dependency-free permutation-importance explainer.
"""
explainer = self.explainer
if explainer is None:
from fireflyframework_datascience.explainability.adapters import PermutationImportanceExplainer

explainer = PermutationImportanceExplainer()
return explainer.explain_global(self.best_model, dataset)


@runtime_checkable
class AutoMLBackendPort(Protocol):
Expand Down
31 changes: 24 additions & 7 deletions src/fireflyframework_datascience/automl/facade.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
from fireflyframework_datascience.core.types import TaskType
from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.evaluation import MetricsEvaluatorPort
from fireflyframework_datascience.explainability import ExplainerPort
from fireflyframework_datascience.models import Model, TrainerPort
from fireflyframework_datascience.search import SearchPolicyPort
from fireflyframework_datascience.tracking import TrackerPort
Expand All @@ -35,6 +36,7 @@ def __init__(
search_policy: SearchPolicyPort | None = None,
validator: ValidatorPort | None = None,
tracker: TrackerPort | None = None,
explainer: ExplainerPort | None = None,
cv: int = 5,
n_trials: int = 20,
random_state: int = 42,
Expand All @@ -44,6 +46,7 @@ def __init__(
self._search = search_policy or _default_search()
self._validator = validator
self._tracker = tracker
self._explainer = explainer
self._cv = cv
self._n_trials = n_trials
self._random_state = random_state
Expand All @@ -59,6 +62,7 @@ def from_context(cls, context: Any, **overrides: Any) -> AutoML:
search_policy=container.resolve_optional(SearchPolicyPort) or _default_search(),
validator=container.resolve_optional(ValidatorPort),
tracker=container.resolve_optional(TrackerPort),
explainer=container.resolve_optional(ExplainerPort),
**overrides,
)

Expand Down Expand Up @@ -113,6 +117,7 @@ def fit(self, dataset: Dataset, *, task: TaskType | None = None, metric: str | N
task=task,
evaluator=self._evaluator,
cv_scoring=scoring,
explainer=self._explainer,
)

# -- internals --------------------------------------------------------
Expand Down Expand Up @@ -164,13 +169,25 @@ def _track_results(self, run: Any, model: Model, leaderboard: list[LeaderboardEn


def _default_trainers() -> list[TrainerPort]:
from fireflyframework_datascience.models.adapters import (
HistGradientBoostingTrainer,
LinearTrainer,
RandomForestTrainer,
)

return [RandomForestTrainer(), LinearTrainer(), HistGradientBoostingTrainer()]
import importlib
import importlib.util

adapters = importlib.import_module("fireflyframework_datascience.models.adapters")
trainers: list[TrainerPort] = [
adapters.RandomForestTrainer(),
adapters.LinearTrainer(),
adapters.HistGradientBoostingTrainer(),
]
# Match the documented "+ XGBoost / LightGBM / CatBoost when installed" behaviour (the DI and
# agentic paths already do this) by including each boosting trainer whose library is importable.
for lib, cls_name in (
("xgboost", "XGBoostTrainer"),
("lightgbm", "LightGBMTrainer"),
("catboost", "CatBoostTrainer"),
):
if importlib.util.find_spec(lib) is not None:
trainers.append(getattr(adapters, cls_name)())
return trainers


def _default_evaluator() -> MetricsEvaluatorPort:
Expand Down
Loading
Loading