diff --git a/README.md b/README.md
index 7631582..5a221db 100644
--- a/README.md
+++ b/README.md
@@ -74,9 +74,9 @@ rediscovering a withheld driver (`revenue = price Γ units`) from the schema β
guarantees it never regresses, at **< $0.01**. Every number is reproducible β see the
[benchmark results](benchmarks/RESULTS.md).
-> π **For business & transformation leaders:** a polished
-> [Strategic Introduction (PDF)](docs/brief/firefly-datascience-strategic-introduction.pdf) frames the
-> value without the engineering detail.
+> π **The whole story in one document:** **[The Complete Guide (PDF)](docs/brief/firefly-datascience-complete-guide.pdf)**
+> combines the executive summary and strategic case with the architecture, a full hands-on tutorial,
+> and the benchmark evidence β for both leaders and engineers.
## Quick start
diff --git a/assets/README.md b/assets/README.md
index 7c6993e..2046ef0 100644
--- a/assets/README.md
+++ b/assets/README.md
@@ -10,15 +10,34 @@ Visual assets for Firefly DataScience, consistent with the Firefly ecosystem hou
| Primary (cyan) | `#06b6d4` |
| Deep | `#0891b2` |
| Light | `#67e8f9` |
-| Wordmark gradient | `#a5f3fc β #22d3ee β #0891b2` |
-| Sky (banner) | `#07131a β #0a1a1f β #08161a` |
+| Wordmark gradient | `#d6fbff β #22d3ee β #0e7d97` |
+| Sky (banner) | `#06121a β #0a1b22 β #071419` |
| Shared family firefly (amber) | `#f6a821` |
-## Banner
+## Banner, logo & favicon
-`banner.svg` β 1280Γ320, Maven Pro wordmark with the cyan gradient, dark teal sky gradient + radial
-glows, and the shared firefly motif (amber + cyan "sister" fireflies) connected as a data
-constellation. Self-contained (system-font fallback, no external fonts), GitHub-safe.
+Generated by `tools/gen_banner.py` into `docs/img/`:
+
+- `banner.svg` β 1280Γ320 wide hero. Dark-teal sky gradient + radial glows, the **Maven Pro**
+ wordmark in the cyan gradient, eyebrow / tagline / sub-tagline, and the shared firefly motif
+ (amber + cyan "sister" fireflies) connected as a data constellation.
+- `logo.svg` β 40Γ40 transparent firefly-hub glyph for the (dark) MkDocs header; sits left of the
+ Maven-Pro-styled site title.
+- `favicon.svg` β 64Γ64 firefly-hub glyph on a dark rounded tile; legible down to 16px.
+
+**Maven Pro, the house-style way.** The ecosystem rule is *no external fonts in SVG* (so they render
+identically on GitHub ``, in MkDocs, and in WeasyPrint PDFs). The generator therefore converts
+the wordmark to **real Maven Pro vector paths** β [HarfBuzz](https://harfbuzz.github.io/)
+(`uharfbuzz`) shapes each string with the font's GPOS kerning and fontTools extracts the glyph
+outlines. The committed SVGs are pure `` data, so the published build never needs the font.
+
+```bash
+uv pip install uharfbuzz cairosvg # build-time only β NOT a project/docs dependency
+python assets/tools/gen_banner.py # -> docs/img/{banner,logo,favicon}.svg
+```
+
+Maven Pro TTFs are read from `$MAVEN_PRO_DIR` (default `~/Library/Fonts`, weights 400/500/600/700).
+Regenerate after editing the generator and commit the resulting SVGs.
## Diagrams
diff --git a/assets/tools/gen_banner.py b/assets/tools/gen_banner.py
new file mode 100644
index 0000000..6f26f82
--- /dev/null
+++ b/assets/tools/gen_banner.py
@@ -0,0 +1,203 @@
+# Copyright 2026 Firefly Software Foundation.
+"""Generate the Firefly DataScience brand banner, nav logo and favicon as self-contained SVGs.
+
+House style (matches the agentic/pyfly/rust ecosystem): **no external fonts**. The "Maven Pro"
+wordmark is therefore converted to real Maven Pro *vector paths* β HarfBuzz (``uharfbuzz``) shapes
+each string (applying the font's GPOS kerning) and fontTools extracts the glyph outlines. The
+resulting SVGs render identically on GitHub ````, in MkDocs, and in WeasyPrint PDFs with zero
+font dependency. Teal brand palette; amber shared-family firefly motif.
+
+Build-time tooling only (not a runtime/docs dependency). Regenerate with::
+
+ uv pip install uharfbuzz cairosvg # one-off, into your working venv
+ python assets/tools/gen_banner.py # -> docs/img/{banner,logo,favicon}.svg
+
+Maven Pro TTFs are read from ``$MAVEN_PRO_DIR`` or ``~/Library/Fonts`` (weights 400/500/600/700).
+Generated SVGs are committed, so the published build never needs the font or these tools.
+"""
+
+from __future__ import annotations
+
+import os
+from pathlib import Path
+
+import uharfbuzz as hb
+from fontTools.pens.svgPathPen import SVGPathPen
+from fontTools.ttLib import TTFont
+
+# --- Teal brand palette (the DataScience member colour) + shared amber firefly. -----------------
+SKY1, SKY2, SKY3 = "#06121a", "#0a1b22", "#071419"
+CYAN = "#06b6d4"
+CYAN_L = "#67e8f9"
+CYAN_XL = "#d6fbff"
+CYAN_D = "#0e7d97"
+EYEBROW = "#5fb3c6"
+TAG = "#dff6fb"
+SUBTAG = "#7fa9b1"
+LINE = "#3fb6cf"
+AMBER = "#f6a821"
+AMBER_HI = "#ffd980"
+
+_FONT_DIR = Path(os.environ.get("MAVEN_PRO_DIR", str(Path.home() / "Library" / "Fonts")))
+FONTS = {w: _FONT_DIR / f"MavenPro-{w}.ttf" for w in (400, 500, 600, 700)}
+
+_OUT = Path(__file__).resolve().parents[2] / "docs" / "img"
+
+# Cache of loaded fonttools/HarfBuzz objects per weight.
+_cache: dict[int, tuple] = {}
+
+
+def _load(weight: int):
+ if weight not in _cache:
+ path = str(FONTS[weight])
+ tt = TTFont(path)
+ blob = hb.Blob.from_file_path(path)
+ face = hb.Face(blob)
+ font = hb.Font(face) # default scale == upem, so advances/offsets are in font units
+ _cache[weight] = (tt, font, tt["head"].unitsPerEm, tt.getGlyphOrder(), tt.getGlyphSet())
+ return _cache[weight]
+
+
+def wordmark(text: str, weight: int, size: float, tracking: float = 0.0, fill: str = "#000",
+ baseline_xy: tuple[float, float] = (0.0, 0.0)) -> tuple[str, float, float]:
+ """Return ``(svg_group, width_px, cap_height_px)`` placing ``text`` in Maven Pro.
+
+ The group's baseline origin sits at ``baseline_xy`` (SVG coords, y-down). ``tracking`` is extra
+ letter-spacing in px (like CSS ``letter-spacing``). Output is pure ```` data β no fonts.
+ """
+ tt, font, upem, order, glyphs = _load(weight)
+ scale = size / upem
+ buf = hb.Buffer()
+ buf.add_str(text)
+ buf.guess_segment_properties()
+ hb.shape(font, buf, {"kern": True, "liga": True})
+
+ tracking_units = tracking / scale
+ bx, by = baseline_xy
+ parts: list[str] = []
+ x = 0.0
+ for info, pos in zip(buf.glyph_infos, buf.glyph_positions):
+ pen = SVGPathPen(glyphs)
+ glyphs[order[info.codepoint]].draw(pen)
+ d = pen.getCommands()
+ if d:
+ gx = bx + (x + pos.x_offset) * scale
+ gy = by - pos.y_offset * scale
+ parts.append(f'')
+ x += pos.x_advance + tracking_units
+ width = (x - tracking_units) * scale
+ cap = getattr(tt["OS/2"], "sCapHeight", 700) * scale
+ return f'{"".join(parts)}', width, cap
+
+
+# --- Reusable SVG fragments. --------------------------------------------------------------------
+def _defs() -> str:
+ return f"""
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+"""
+
+
+def _constellation() -> str:
+ """Right-side data constellation with the amber firefly as the hub (Candidate A)."""
+ return f"""
+
+
+
+
+
+
+
+
+ """
+
+
+# --- Asset builders. ----------------------------------------------------------------------------
+def build_banner() -> str:
+ eyebrow, _, _ = wordmark("THE FIREFLY FRAMEWORK", 700, 13, 3.0, EYEBROW, (88, 88))
+ mark, _, _ = wordmark("Firefly DataScience", 700, 76, -1.0, "url(#wordmark)", (82, 176))
+ tagline, _, _ = wordmark("AutoML that fuses GenAI with classical ML & Deep Learning.", 600, 21.5, 0.1, TAG, (86, 248))
+ sub, _, _ = wordmark("Hexagonal Β· secure-by-default Β· governed GenAI Β· built on Firefly Agentic", 400, 15, 0.1, SUBTAG, (86, 280))
+ aria = ("Firefly DataScience β AutoML that fuses GenAI with classical ML and Deep Learning, "
+ "built on the Firefly Framework")
+ return f""""""
+
+
+def _glyph(cx: float, cy: float, s: float) -> str:
+ """Compact firefly-hub mark: amber firefly + 3 cyan nodes + edges. ``s`` = radius of node ring."""
+ import math
+
+ nodes = [(cx + s * math.cos(a), cy + s * math.sin(a)) for a in (-2.5, -0.4, 1.4)]
+ edges = "".join(f''
+ for nx, ny in nodes)
+ dots = "".join(f'' for nx, ny in nodes)
+ r = s * 0.42
+ fly = (f''
+ f''
+ f'')
+ return edges + dots + fly
+
+
+def build_logo() -> str:
+ """Transparent firefly-hub icon for the (dark) Material header, sits left of the site title."""
+ return f""""""
+
+
+def build_favicon() -> str:
+ """Square firefly-hub glyph on a dark rounded tile; legible at 16px."""
+ return f""""""
+
+
+def main() -> None:
+ _OUT.mkdir(parents=True, exist_ok=True)
+ for name, svg in (("banner", build_banner()), ("logo", build_logo()), ("favicon", build_favicon())):
+ path = _OUT / f"{name}.svg"
+ path.write_text(svg)
+ print(f"wrote {path.relative_to(_OUT.parents[1])} ({len(svg)} bytes)")
+
+
+if __name__ == "__main__":
+ main()
diff --git a/docs/agentic-loop.md b/docs/agentic-loop.md
index e601db2..5600cbc 100644
--- a/docs/agentic-loop.md
+++ b/docs/agentic-loop.md
@@ -1,4 +1,4 @@
-# Agentic ML-Engineering Loop
+# Agentic ML-engineering loop
**An LLM proposes, the classical engine decides β a deterministic verifier, not "it ran", is the judge.**
@@ -8,7 +8,14 @@ cross-validates* it, and a **Verifier** β a stage distinct from execution-succ
the result is genuinely good. Search is greedy with reflection over the attempt history, bounded by
an iteration and patience budget.
-The whole cycle is: **propose β train/CV β verify β reflect β select**.
+The whole cycle is **propose β train/CV β verify β reflect β select**.
+
+!!! firefly "The recurring pattern β the LLM proposes; the classical engine decides"
+
+ The LLM never gets the last word. It *suggests* the next `(trainer, params)` candidate; the
+ classical engine cross-validates it and a deterministic `Verifier` rules on whether it actually
+ beats a trivial baseline. A candidate that runs but fails to clear the baseline is rejected β so
+ the loop can only ever return a model that measurably earned its place.
@@ -27,6 +34,11 @@ The whole cycle is: **propose β train/CV β verify β reflect β select**.
| `AgentSolutionProposer` | LLM-backed proposer (reflects via a `FireflyAgent`). |
| `SequenceProposer` | Deterministic proposer for tests / fixed strategies. |
+`SolutionCandidate`, `Verdict`, and `AttemptRecord` are frozen dataclasses; `EngineeringRun` carries
+the trace and the refit model. The two proposers both satisfy the `CandidateProposer` protocol, and
+`DeterministicVerifier` satisfies the `Verifier` protocol β so any of them can be swapped for a custom
+implementation.
+
## Quick start
`AgenticAutoML` takes any proposer and runs the loop over a `Dataset`:
@@ -48,9 +60,14 @@ loop = AgenticAutoML(proposer, max_iterations=8, patience=3, cv=5)
run = loop.solve(dataset)
print(run.summary())
-# Agentic AutoML: 3 attempts (2 verified); best=hist_gradient_boosting roc_auc=0.9123 (baseline 0.5000)
```
+!!! success "Expected"
+
+ ```text
+ Agentic AutoML: 3 attempts (2 verified); best=hist_gradient_boosting roc_auc=0.9123 (baseline 0.5000)
+ ```
+
The engine seeds the candidates from `propose_initial`, then repeatedly calls `propose_next` until a
candidate is `None`, the iteration budget is spent, or patience runs out.
@@ -67,9 +84,47 @@ The engine never trusts a candidate just because it executed. Each attempt is:
4. **reflect** β `propose_next` is handed the full `history` to inform the next proposal.
5. **select** β the highest-scoring *verified* candidate wins and is refit on all data.
+```python
+for candidate in self._proposer.propose_initial(dataset, names): # (1)!
+ record = self._attempt(dataset, candidate, task, scoring, baseline) # (2)!
+ attempts.append(record)
+ if record.verdict.valid and record.score > best_score: # (3)!
+ best, best_score = candidate, record.score
+
+patience = self._patience
+for _ in range(self._max_iterations):
+ candidate = self._proposer.propose_next(dataset, attempts, names) # (4)!
+ if candidate is None:
+ break
+ record = self._attempt(dataset, candidate, task, scoring, baseline)
+ attempts.append(record)
+ if record.verdict.valid and record.score > best_score:
+ best, best_score, patience = candidate, record.score, self._patience # (5)!
+ else:
+ patience -= 1
+ if patience <= 0:
+ break
+```
+
+1. **Seed** β `propose_initial` returns the starting population, evaluated before any reflection.
+2. **Train / CV + verify** β `_attempt` cross-validates the candidate and asks the `Verifier` for a
+ `Verdict` in one step.
+3. **Select** β only a `valid` verdict that *improves* on the running best can take the lead.
+4. **Reflect** β `propose_next` receives the full `attempts` history; a returned `None` ends the loop.
+5. **Patience reset** β an improving verified attempt restores the full patience budget; a
+ non-improving one decrements it, stopping the loop at zero.
+
## DeterministicVerifier β correctness, not execution
A run that produces a number is not the same as a run that produced a *good* number.
+
+!!! note "Correctness β ran"
+
+ A candidate that trained and returned a score has only proven it *executed*. Verification is a
+ separate stage: `DeterministicVerifier` demands a *finite* score that *beats the trivial
+ baseline* by a `margin`. Anything else is rejected β execution-success is never mistaken for
+ correctness.
+
`DeterministicVerifier` requires a finite score that beats the trivial baseline (a `DummyClassifier`
with `strategy="prior"` / `DummyRegressor` with `strategy="mean"`) by a `margin`:
@@ -93,40 +148,52 @@ You can supply any object implementing the `Verifier` protocol
## Proposers
-### AgentSolutionProposer β the LLM in the loop
+Both built-in proposers satisfy the `CandidateProposer` protocol β pick the LLM-backed one for real
+search, or the deterministic one for tests and fixed strategies.
-`AgentSolutionProposer` seeds every trainer at its defaults, then reflects on the ranked history via a
-`FireflyAgent`. The LLM client is built lazily on first reflection β no client is created at startup:
+=== "AgentSolutionProposer (LLM)"
-```python
-from fireflyframework_datascience.engineering.loop import AgenticAutoML, AgentSolutionProposer
+ `AgentSolutionProposer` seeds every trainer at its defaults, then reflects on the ranked history
+ via a `FireflyAgent`. The LLM client is built lazily on first reflection β no client is created at
+ startup:
-proposer = AgentSolutionProposer(model="openai:gpt-4o")
-run = AgenticAutoML(proposer, max_iterations=10).solve(dataset)
-```
+ ```python
+ from fireflyframework_datascience.engineering.loop import AgenticAutoML, AgentSolutionProposer
-On each `propose_next`, the agent receives the task, the allowed trainers, and the best-first
-attempt history, and returns a structured `(trainer, params_json, rationale)`. If the model names a
-trainer outside the allowed list, the proposer falls back to the best trainer seen so far; malformed
-`params_json` degrades to `{}`. You can also inject a pre-built agent with `AgentSolutionProposer(agent=my_agent)`.
+ proposer = AgentSolutionProposer(model="openai:gpt-4o")
+ run = AgenticAutoML(proposer, max_iterations=10).solve(dataset)
+ ```
-### SequenceProposer β deterministic strategies
+ On each `propose_next`, the agent receives the task, the allowed trainers, and the best-first
+ attempt history (top 8), and returns a structured `(trainer, params_json, rationale)`. If the
+ model names a trainer outside the allowed list, the proposer falls back to the best trainer seen so
+ far; malformed `params_json` degrades to `{}`. You can also inject a pre-built agent with
+ `AgentSolutionProposer(agent=my_agent)`.
-For tests or fixed search plans, `SequenceProposer` replays a fixed candidate list β the first is the
-seed, the rest are dispensed one per `propose_next`:
+=== "SequenceProposer (deterministic)"
-```python
-from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate
+ For tests or fixed search plans, `SequenceProposer` replays a fixed candidate list β the first is
+ the seed, the rest are dispensed one per `propose_next`:
-proposer = SequenceProposer([
- SolutionCandidate("linear", rationale="cheap baseline"),
- SolutionCandidate("random_forest", {"max_depth": 8}),
-])
-```
+ ```python
+ from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate
+
+ proposer = SequenceProposer([
+ SolutionCandidate("linear", rationale="cheap baseline"),
+ SolutionCandidate("random_forest", {"max_depth": 8}),
+ ])
+ ```
To write your own, implement the `CandidateProposer` protocol: `propose_initial(dataset, trainers)`
and `propose_next(dataset, history, trainers)`.
+!!! tip "Which trainers are allowed"
+
+ The `trainers` list a proposer sees comes from the loop's registry. By default that is `linear`,
+ `random_forest`, and `hist_gradient_boosting`, plus `xgboost`, `lightgbm`, and `catboost` when
+ those optional libraries are installed. Pass `trainers=...` to `AgenticAutoML` to constrain or
+ extend the search space.
+
## The EngineeringRun trace
`solve` returns an `EngineeringRun` β a full, auditable trace plus the refit best model:
@@ -146,6 +213,9 @@ for a in run.attempts:
print(a.candidate.trainer, a.score, a.verdict.valid, a.verdict.reason)
```
+`n_iterations` and `valid_attempts` are derived from `attempts`, and `summary()` renders the one-line
+recap shown under [Quick start](#quick-start).
+
## Budgets: iterations and patience
The loop is greedy with two knobs:
@@ -166,9 +236,16 @@ loop = AgenticAutoML(
)
```
+!!! warning "Patience only counts after seeding"
+
+ The initial population from `propose_initial` is always fully evaluated; patience and
+ `max_iterations` bound only the reflection rounds that follow. A run with an empty or trivial seed
+ still respects the iteration budget.
+
## See also
- [Datasets](datasets.md) β the `Dataset` the loop searches over.
-- [Models & trainers](automl.md) β the trainer registry candidates draw from.
-- [Evaluation](automl.md) β metrics, scoring, and the trivial baseline.
-- [Preprocessing](automl.md) β the pipeline wrapped around every candidate.
+- [AutoML](automl.md) β the trainer registry, metrics, and the preprocessing pipeline wrapped around every candidate.
+- [GenAI features](genai-features.md) β other places the LLM proposes and the classical engine decides.
+- [LLM configuration](llm-configuration.md) β wiring the model behind `AgentSolutionProposer`.
+- [Architecture](architecture.md) β the ports and adapters the loop plugs into.
diff --git a/docs/architecture.md b/docs/architecture.md
index b139057..ac759b4 100644
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -1,8 +1,8 @@
# Architecture
-**Firefly DataScience is a hexagonal, auto-configured data-science framework: a light DI container wires ports to adapters, and a Spring-Boot-style application context boots it all from packaging entry points.**
+**Firefly DataScience is a hexagonal, auto-configured data-science framework: a lean DI container wires ports to adapters, and a Spring-Boot-style application context boots it all from packaging entry points.**
-This page explains how the pieces fit together: the five layers, the ports-and-adapters (hexagonal) core, entry-point auto-configuration, the dependency-injection container, and the `FireflyDataScienceApplication` startup lifecycle.
+This page explains how the pieces fit together: the five layers, the ports-and-adapters (hexagonal) core, entry-point auto-configuration, the dependency-injection container, and the `FireflyDataScienceApplication` startup lifecycle. The design goal throughout is that the domain never depends on a vendor SDK, and that an adapter can be added, swapped, or overridden without touching calling code.

@@ -16,7 +16,15 @@ The framework is organised top-to-bottom so that the domain never depends on a v
4. **Domain / Ports** β protocol interfaces (e.g. `DatasetLoaderPort`) plus the light, dependency-free core types in `core.types` (`TaskType`, `Modality`, `Scope`).
5. **Adapters** β concrete implementations of the ports backed by optional extras (scikit-learn, OpenML, deep-learning, GenAI, ...), each gated by a condition.
-The core stays importable with **no** optional ML extra installed β vendor imports live inside adapters and `@bean` methods, never at module top level.
+The core stays importable with **no** optional ML extra installed β vendor imports live inside adapters and `@bean` methods, never at module top level. `core.types` enforces this with hand-written `StrEnum`s (`TaskType`, `Modality`, `Scope`) and no third-party ML imports.
+
+!!! firefly "The reproducible pattern β the LLM proposes; the classical engine decides"
+
+ The same separation that keeps vendor SDKs out of the domain keeps GenAI out of the decision
+ path. GenAI lives in **adapters** behind ports; the deterministic classical engine trains, scores
+ and selects. The architecture is what makes the rule enforceable: a GenAI adapter can only ever
+ *propose* β the container resolves a port, and the classical engine decides whether a proposal
+ survives a measured improvement over a seeded baseline.
## Hexagonal: ports and adapters
@@ -39,6 +47,8 @@ class SklearnDatasetLoader:
return getattr(datasets, f"load_{name}")()
```
+Each data-science port is declared as a `Protocol` in **its own domain module** (not in a central package): `DatasetLoaderPort` in `datasets`, `TrainerPort` in `models`, `AutoMLBackendPort` in `automl`, `FeatureEngineerPort` in `features`, `SearchPolicyPort` in `search`, `MetricsEvaluatorPort` in `evaluation`, `ValidatorPort` in `validation`, and `TrackerPort` / `RegistryPort` in `tracking`. Each is a contract the domain calls; the concrete class that fulfils it is decided at boot.
+
The adapter is contributed by an auto-configuration, gated on the optional dependency being importable:
```python
@@ -55,12 +65,30 @@ from fireflyframework_datascience.datasets import DatasetLoaderPort
@configuration
class DatasetsAutoConfiguration:
@bean(name="sklearn_dataset_loader")
- def sklearn_loader(self) -> DatasetLoaderPort:
+ def sklearn_loader(self) -> DatasetLoaderPort: # (1)!
from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
return SklearnDatasetLoader()
```
-The `@bean` method's **return annotation is the provided type** β `DatasetLoaderPort` here β so the container registers `SklearnDatasetLoader` under the port. Resolving `DatasetLoaderPort` yields whichever adapter won.
+1. The `@bean` method's **return annotation is the provided type** β `DatasetLoaderPort` here β so the container registers `SklearnDatasetLoader` under the port. Resolving `DatasetLoaderPort` yields whichever adapter won. (At boot, `_apply_one` reads `get_type_hints(method)["return"]`; a `@bean` method with no return annotation is skipped.)
+
+### Key types
+
+A small, stable vocabulary spans the wiring layer. These are the names you actually import:
+
+| Type / decorator | Module | Role |
+| --- | --- | --- |
+| `Container` | `container.container` | The IoC container; resolution by type annotation. |
+| `Scope` | `core.types` | `SINGLETON` (cached, default) or `TRANSIENT` (new each resolve). |
+| `@configuration` / `@bean` | `container.stereotypes` | Mark a class as holding factory methods; mark a method as a bean factory. |
+| `@component` | `container.stereotypes` | Mark a class as injectable by its own type. |
+| `@auto_configuration` | `container.conditions` | Mark a class discoverable via the entry-point group. |
+| `@order` | `core.ordering` | Set ordering (lower runs/resolves first). |
+| `ConditionContext` | `container.conditions` | What a condition is evaluated against (`config` + `container`). |
+| `ApplicationContext` | `application` | A started app: the loaded config plus the wired container. |
+| `WiringError` | `core.exceptions` | Raised on ambiguous, missing, or circular dependencies. |
+
+The `@bean` decorator defaults to `scope=Scope.SINGLETON` and `primary=False`; pass `name=`, `scope=`, or `primary=` to override. `@component` and the container's `register_*` methods share the same defaults.
## Entry-point auto-configuration
@@ -73,7 +101,7 @@ datasets = "fireflyframework_datascience.datasets.auto_configuration:DatasetsAut
models = "fireflyframework_datascience.models.auto_configuration:ModelsAutoConfiguration"
```
-At startup `discover_auto_configurations()` loads every class in the group, tolerating any whose optional extra is missing (it is simply skipped), then sorts them by `@order`:
+At startup `discover_auto_configurations()` loads every class in the group, tolerating any whose optional extra is missing (it is simply skipped β its `@conditional_on_class` would have excluded it anyway), then sorts them by `@order`:
```python
from fireflyframework_datascience.core.plugin import discover_auto_configurations
@@ -82,6 +110,25 @@ for cls in discover_auto_configurations():
print(cls.__name__)
```
+`CoreAutoConfiguration` is the always-on reference example: it has no `@conditional_on_class`, so it always applies, and it registers a single `RuntimeInfo` bean snapshotting the framework version, Python version, platform, default ML framework, and whether GenAI is enabled:
+
+```python
+@auto_configuration
+@configuration
+class CoreAutoConfiguration:
+ @bean
+ def runtime_info(self, config: FireflyDataScienceConfig) -> RuntimeInfo: # (1)!
+ return RuntimeInfo(
+ framework_version=__version__,
+ python_version=platform.python_version(),
+ platform=platform.platform(),
+ default_ml_framework=config.default_ml_framework,
+ genai_enabled=config.genai.enabled,
+ )
+```
+
+1. The method's only parameter, `config`, is filled by type hint: `FireflyDataScienceConfig` is already registered as a bean (the application context registers it first), so the container injects it when it calls the factory.
+
### Conditions
Conditions gate both whole auto-configurations and individual `@bean` methods. Each is evaluated against a `ConditionContext` (the loaded config plus the partially-wired container):
@@ -95,24 +142,52 @@ from fireflyframework_datascience.container.conditions import (
)
```
-`conditional_on_missing_bean(DatasetLoaderPort)` is the **secure-by-default override hook**: a framework default applies only when you have not already registered your own.
+- `conditional_on_class("sklearn")` matches when `importlib.util.find_spec("sklearn")` resolves β i.e. the optional extra is installed.
+- `conditional_on_property("genai.enabled")` reads a dotted path off the config; with no `having_value` it matches when the value is truthy (`"1"`, `"true"`, `"yes"`, `"on"`, or any truthy object), and `match_if_missing=True` controls behaviour when the key is absent.
+- `conditional_on_bean(SomePort)` / `conditional_on_missing_bean(SomePort)` query the partially-wired container β so ordering (`@order`) decides what is already present when a condition runs.
+
+`conditional_on_missing_bean(DatasetLoaderPort)` is the **secure-by-default override hook**: a framework default applies only when you have not already registered your own. Because conditions see the live container, registering your adapter first (lower `@order`, or via `extra_auto_configurations`) is enough to win.
## The DI container
-`Container` is a lean IoC container; resolution is by type annotation, with constructor injection and circular-dependency detection.
+`Container` is a lean IoC container; resolution is by type annotation, with constructor injection and circular-dependency detection. There are three ways to register a bean:
-```python
-from fireflyframework_datascience.container.container import Container
-from fireflyframework_datascience.core.types import Scope
+=== "Register an instance"
+
+ ```python
+ from fireflyframework_datascience.container.container import Container
+
+ container = Container()
+ container.register_instance(DatasetLoaderPort, SklearnDatasetLoader()) # (1)!
+ ```
-container = Container()
+ 1. Register an already-constructed object as a singleton. Use this when you built the
+ instance yourself (e.g. the application context registers the loaded config this way).
-# Three ways to register a bean:
-container.register_instance(DatasetLoaderPort, SklearnDatasetLoader()) # pre-built
-container.register_type(SklearnDatasetLoader, scope=Scope.SINGLETON) # ctor-injected
-container.register_factory(DatasetLoaderPort, lambda: SklearnDatasetLoader()) # factory
+=== "Register a type"
-loader = container.resolve(DatasetLoaderPort) # single bean (honours @primary)
+ ```python
+ from fireflyframework_datascience.core.types import Scope
+
+ container.register_type(SklearnDatasetLoader, scope=Scope.SINGLETON) # (1)!
+ ```
+
+ 1. Register a class; its constructor parameters are resolved by type hint on demand.
+ Pass `provided_type=` to register it under a port rather than its own class.
+
+=== "Register a factory"
+
+ ```python
+ container.register_factory(DatasetLoaderPort, lambda: SklearnDatasetLoader()) # (1)!
+ ```
+
+ 1. Register a callable whose own parameters are injected by type hint. `@bean` methods are
+ registered this way under the hood.
+
+Resolution mirrors the three shapes you need in practice:
+
+```python
+loader = container.resolve(DatasetLoaderPort) # single bean (honours @primary)
maybe = container.resolve_optional(DatasetLoaderPort) # None if absent
allof = container.resolve_all(DatasetLoaderPort) # every bean, sorted by @order
```
@@ -121,9 +196,16 @@ Key behaviours:
- **Scopes** β `Scope.SINGLETON` (cached, the default) and `Scope.TRANSIENT` (new each resolve).
- **Ambiguity** β multiple beans for one type require exactly one marked `primary=True`, else `resolve` raises `WiringError`. Resolve by name with `resolve_by_name(...)` to disambiguate.
-- **Injection** β constructor / factory parameters are filled by type hint; `Optional[X]` / `X | None` params resolve to `None` when no bean exists.
+- **Injection** β constructor / factory parameters are filled by type hint; `Optional[X]` / `X | None` params resolve to `None` when no bean exists, and a parameter with a default is left to its default when no matching bean is found.
+- **Circular dependencies** β detected during construction; a cycle raises `WiringError` rather than recursing.
- **Fail-fast** β `eager_init()` instantiates every singleton at boot, validating the whole wiring graph before your code runs.
+!!! note "Resolution is by annotation, not by name"
+
+ `resolve(...)` looks up registrations by the *provided type*. Names are a side channel:
+ `register_*` accept a `name=`, and `resolve_by_name(...)` / `bean_names()` work off it. A bean
+ with no usable return annotation is never registered (the application context skips it).
+
## The application lifecycle
`FireflyDataScienceApplication.start()` runs a fixed sequence, mirroring pyfly's lifecycle:
@@ -131,7 +213,7 @@ Key behaviours:
1. Load config (`FireflyDataScienceConfig.load`) β unless one is passed in.
2. Print the banner.
3. Create the `Container` and register the config as a bean.
-4. Discover auto-configurations (entry points + any extras), de-duplicate, sort by `@order`.
+4. Discover auto-configurations (entry points + any extras), de-duplicate while preserving order, sort by `@order`.
5. Evaluate each auto-configuration's conditions; for those that pass, instantiate the class and register every `@bean` method whose own conditions also pass.
6. `eager_init()` all singletons (fail-fast).
7. Print the wiring summary and return a ready `ApplicationContext`.
@@ -145,10 +227,27 @@ ctx = FireflyDataScienceApplication.run()
print(ctx.bean_count, "beans")
print([ac.__name__ for ac in ctx.applied_auto_configurations])
-loader = ctx.get(DatasetLoaderPort) # resolve a bean by type
+loader = ctx.get(DatasetLoaderPort) # resolve a bean by type
tracker = ctx.get_optional(SomeOptionalPort) # None if not wired
```
+When the banner is on, boot ends by printing the wiring summary β a quick check that the expected adapters were applied:
+
+!!! success "Expected"
+
+ ```text
+ Firefly DataScience is ready.
+ profiles : default
+ beans : 7
+ auto-config : 3 applied (CoreAutoConfiguration, DatasetsAutoConfiguration, ModelsAutoConfiguration)
+ ml framework : sklearn
+ genai : disabled
+ sandbox : ...
+ ```
+
+ The exact bean count and applied list depend on which optional extras are installed; the line
+ *shape* (profiles, beans, auto-config, ml framework, genai, sandbox) is fixed.
+
You can steer the boot without forking the framework:
```python
@@ -162,12 +261,11 @@ ctx = FireflyDataScienceApplication.run(
Passing `auto_configurations=[...]` **replaces** discovery entirely (handy for hermetic tests); `extra_auto_configurations=[...]` **appends** to whatever was discovered. The returned `ApplicationContext` exposes `.config`, `.container`, `.applied_auto_configurations`, `.bean_count`, and the `get` / `get_optional` resolvers.
-## See also
+!!! tip "Quiet boots and hermetic tests"
-- [Getting started](quickstart.md)
-- [Configuration](./configuration.md)
-- [Ports and adapters reference](index.md)
-- [Writing an auto-configuration](index.md)
+ Pass `print_output=False` to silence the banner and wiring summary, and
+ `auto_configurations=[...]` to pin an exact set of auto-configurations β together they make the
+ context fully deterministic for tests, with no dependence on which extras happen to be installed.
## Auto-configuration flow
@@ -176,3 +274,11 @@ Adapters self-register via the `firefly_datascience.auto_configuration` entry-po
+
+## See also
+
+- [Quickstart](quickstart.md) β boot the application context in one call.
+- [Configuration](configuration.md) β the `FireflyDataScienceConfig` that conditions read.
+- [AutoML](automl.md) β what the wired ports drive end to end.
+- [GenAI features](genai-features.md) β the gated adapters behind the ports.
+- [Security](security.md) β the override and sandbox guarantees this wiring underpins.
diff --git a/docs/automl.md b/docs/automl.md
index 2987a55..45f87c5 100644
--- a/docs/automl.md
+++ b/docs/automl.md
@@ -7,7 +7,14 @@ trainer that supports the task (optionally tuning each one), and returns a fitte
leaderboard. It is import-light: scikit-learn is only loaded when you actually call `fit`, so
`from fireflyframework_datascience.automl import AutoML` stays cheap.
-
+!!! firefly "The LLM proposes; the classical engine decides"
+
+ `AutoML` is pure classical machine learning β deterministic, seeded, and reproducible. Where GenAI
+ enters elsewhere in the framework, it only ever *proposes* (seeds, bounds, candidate features);
+ this engine *decides* by cross-validated score. The search is owned by Optuna and scikit-learn,
+ never by a language model.
+
+
## Quick start
@@ -38,11 +45,27 @@ result = AutoML().fit(train, task=TaskType.BINARY, metric="f1")
For each trainer that `supports(task)`, `AutoML`:
-1. Builds the trainer's hyperparameter search space (skipped when `n_trials <= 1`).
+1. Builds the trainer's hyperparameter search space β but only when `n_trials > 1`; with `n_trials <= 1`
+ the space is empty and the search collapses to a single default-hyperparameter evaluation.
2. Runs the search policy, whose objective wraps the estimator in a preprocessing pipeline and
cross-validates it (`cross_val_score`, `cv` folds).
3. Records a `LeaderboardEntry` with the best CV score.
-4. Refits the highest-scoring trainer on the full training data and wraps it as a `Model`.
+4. After all candidates are scored, refits the highest-scoring trainer on the full training data and
+ wraps it as a `Model`.
+
+```python
+space = trainer.param_space(task) if n_trials > 1 else {} # (1)!
+result = search_policy.optimize(objective, space, n_trials=n_trials, seed=random_state) # (2)!
+leaderboard.append(LeaderboardEntry(trainer.name, dict(result.best_params), result.best_score, metric))
+if best is None or result.best_score > best[0]: # (3)!
+ best = (result.best_score, trainer, dict(result.best_params))
+```
+
+1. No tuning budget means no space to search β the policy evaluates the estimator's defaults once.
+2. The CV objective returns the mean fold score. A candidate that raises during CV is logged and scored
+ `-inf`, so one broken estimator never aborts the whole run.
+3. Selection is strictly by CV score (greater is better, always). The winning trainer is then refit on
+ `dataset.X, dataset.y` inside the same preprocessing pipeline.
The preprocessing pipeline is built automatically from the column dtypes: numeric columns get median
imputation + `StandardScaler`; categorical columns get most-frequent imputation + `OneHotEncoder`
@@ -71,20 +94,35 @@ result = automl.fit(train)
```
The constructor accepts `trainers`, `evaluator`, `search_policy`, `validator`, `tracker`, plus the
-`cv`, `n_trials`, and `random_state` knobs. Anything left as `None` falls back to sensible defaults:
-`[RandomForestTrainer(), LinearTrainer(), HistGradientBoostingTrainer()]`, the `SklearnMetricsEvaluator`,
-and the `DefaultSearchPolicy`.
+`cv`, `n_trials`, and `random_state` knobs (defaults `cv=5`, `n_trials=20`, `random_state=42`). Anything
+left as `None` falls back to sensible defaults: `[RandomForestTrainer(), LinearTrainer(), HistGradientBoostingTrainer()]`,
+the `SklearnMetricsEvaluator`, and the `DefaultSearchPolicy`. A `validator` and `tracker` stay `None`
+unless you supply them β when present, the validator runs first and raises on failure, and the tracker
+logs the winner's params, CV score, and model artifact.
-### DI-wired construction
+### Wiring it up
-In an application, resolve the components from a started `ApplicationContext` instead of wiring them by hand:
+=== "Imperative / notebook"
-```python
-automl = AutoML.from_context(app, cv=10, n_trials=50)
-```
+ Construct the engine by hand and call `fit` directly β ideal for exploration:
+
+ ```python
+ automl = AutoML(cv=10, n_trials=50)
+ result = automl.fit(train)
+ ```
+
+=== "Declarative / DI"
-`from_context` pulls every registered `TrainerPort` from the container and resolves the optional
-evaluator, search policy, validator, and tracker. Keyword `overrides` win over the resolved components.
+ In an application, resolve the components from a started `ApplicationContext` instead of wiring
+ them by hand:
+
+ ```python
+ automl = AutoML.from_context(app, cv=10, n_trials=50)
+ ```
+
+ `from_context` pulls every registered `TrainerPort` from the container and resolves the optional
+ evaluator, search policy, validator, and tracker. Keyword `overrides` win over the resolved
+ components, and missing evaluator/search policy fall back to the same defaults as the constructor.
## Trainers
@@ -104,20 +142,27 @@ The boosting-library trainers (`xgboost`, `lightgbm`, `catboost`) import their b
pay for the extra you install. A trainer exposes three methods used by the engine:
```python
-trainer.supports(task) # -> bool
-trainer.make_estimator(task, params) # -> unfitted estimator
-trainer.param_space(task) # -> ParamSpace
+trainer.supports(task) # -> bool
+trainer.make_estimator(task, params) # -> unfitted estimator (sensible defaults merged with params)
+trainer.param_space(task) # -> ParamSpace
```
+!!! note "Defaults are baked in, not magic"
+
+ `make_estimator` merges your `params` over each trainer's defaults β for example `RandomForestTrainer`
+ starts from `n_estimators=200, n_jobs=-1, random_state=42`, and `HistGradientBoostingTrainer` from
+ `learning_rate=0.1, max_iter=200, random_state=42`. The `param_space` only widens the dimensions worth
+ tuning (e.g. `n_estimators`, `max_depth`, `max_features` for random forest).
+
## Search policies
A search policy optimizes the cross-validation objective over a trainer's `ParamSpace`. Scores are always
"greater is better" (the evaluator maps loss-style metrics to negated sklearn scorers).
- **`DefaultSearchPolicy`** (`name="default"`) evaluates the estimator's default hyperparameters once β fast
- and fully deterministic. This is the engine default.
+ and fully deterministic. This is the engine default, and it reports `n_trials=1`.
- **`OptunaSearchPolicy`** (`name="optuna"`) runs seeded Bayesian optimization (TPE). The space spec drives
- the suggestions; if the space is empty it degrades to a single default evaluation.
+ the suggestions; if the space is empty it degrades to a single default evaluation (`n_trials=1`).
```python
from fireflyframework_datascience.search.adapters import OptunaSearchPolicy
@@ -132,35 +177,47 @@ result = OptunaSearchPolicy().optimize(objective, space, n_trials=40, seed=42)
print(result.best_params, result.best_score, result.n_trials)
```
-Both policies return a `SearchResult(best_params, best_score, n_trials)`. The seeded sampler keeps the search
-reproducible β classical HPO owns the search, not an LLM.
+Both policies return a `SearchResult(best_params, best_score, n_trials)`. The seeded TPE sampler
+(`TPESampler(seed=seed)`) keeps the search reproducible β classical HPO owns the search, not an LLM.
## Metrics
The default `SklearnMetricsEvaluator` (`name="sklearn"`) supplies CV scoring names and a panel of held-out
metrics:
-- **Classification**: `accuracy`, `f1` (weighted), `precision`, `recall`, plus `roc_auc` and `log_loss` when
- probabilities are available.
+- **Classification**: `accuracy`, `f1` (weighted), `precision` (weighted), `recall` (weighted), plus `roc_auc`
+ and `log_loss` when probabilities are available.
- **Regression**: `rmse`, `mae`, `r2`.
```python
ev = result.evaluator
-ev.default_metric(result.task) # "roc_auc" for binary
+ev.default_metric(result.task) # "roc_auc" for binary, "accuracy" multiclass, "rmse" regression
ev.scoring_name(result.task, "f1") # "f1_weighted" (the CV scorer)
ev.greater_is_better("rmse") # False
```
+The leaderboard and CV objective use the *scoring* name, not the raw metric: `f1` maps to the
+`f1_weighted` scorer, `rmse` to `neg_root_mean_squared_error`, and binary `roc_auc` stays `roc_auc` while
+multiclass `roc_auc` becomes `roc_auc_ovr_weighted`. This is why CV scores are always maximized β a lower
+RMSE shows up as a larger (less negative) `neg_root_mean_squared_error`.
+
+!!! tip "Two scores, one winner"
+
+ `result.metric` is the human-facing metric name (e.g. `roc_auc`); `result.cv_scoring` is the sklearn
+ scoring string actually used for cross-validation. The leaderboard's `cv_score` is the mean CV score
+ under that scorer, and `evaluate(test)` recomputes the full panel on held-out data.
+
## The `AutoMLResult` API
`fit` returns an `AutoMLResult` carrying the fitted winner, the sorted leaderboard, and the evaluator used:
```python
result.best_model # Model: name, estimator, task, feature_names, params
-result.best_score # top leaderboard cv_score
+result.best_score # leaderboard[0].cv_score (the top CV score)
result.leaderboard # list[LeaderboardEntry] sorted best-first
result.metric # primary metric name
result.task # TaskType
+result.cv_scoring # sklearn scoring string used during CV
result.predict(test.X) # winner predictions
result.predict_proba(test.X) # class probabilities (classification)
@@ -171,13 +228,28 @@ print(result.leaderboard_table()) # one line per candidate
```
Each `LeaderboardEntry` holds `model_name`, `params`, `cv_score`, and `metric`, and prints as a tidy
-`model_name metric=score` line. `evaluate` automatically passes probabilities through for classification
-when the winning estimator exposes `predict_proba`.
+`model_name metric=score` line (the name is left-padded to 24 columns, the score to 4 decimals).
+`best_score` is a property that reads the top entry, so it always agrees with the first row of the table.
+`evaluate` automatically passes probabilities through for classification when the winning estimator exposes
+`predict_proba`.
+
+!!! success "Expected"
+
+ `result.leaderboard_table()` on the breast-cancer quick-start prints one line per candidate, sorted
+ best CV score first. Each line is the `LeaderboardEntry.__str__` format β the trainer `name`
+ left-padded to 24 columns, then `metric=score` to 4 decimals (values vary slightly by environment
+ and library versions):
+
+ ```text
+ hist_gradient_boosting roc_auc=0.9921
+ random_forest roc_auc=0.9907
+ linear roc_auc=0.9886
+ ```
## See also
-- [Datasets and loaders](./datasets.md)
-- [Models and trainers](automl.md)
-- [Hyperparameter tuning](index.md)
-- [Evaluation and metrics](index.md)
-- [GenAI + classical fusion](genai-features.md)
+- [Datasets and loaders](datasets.md) β build the `Dataset` you feed to `fit`.
+- [GenAI + classical fusion](genai-features.md) β how the LLM proposes and this engine decides.
+- [The agentic loop](agentic-loop.md) β the cost-benefit gate around GenAI proposals.
+- [Serving the winner](serving.md) β deploy `result.best_model`.
+- [Benchmarks](benchmarks.md) β measured leaderboard results on real datasets.
diff --git a/docs/benchmarks.md b/docs/benchmarks.md
index 7244c32..8ac1357 100644
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -1,8 +1,15 @@
# Benchmarks & Datasets
-**A three-tier evaluation strategy: credible public benchmarks, fast CI smoke datasets, and agentic capability suites β fed by pluggable dataset loaders.**
+**A three-tier evaluation strategy β credible public benchmarks, fast CI smoke datasets, and agentic capability suites β fed by pluggable dataset loaders, with every published number produced by a bundled, runnable harness.**
-Firefly DataScience separates *how we prove the framework is good* from *how we load data day-to-day*. The same `DatasetLoaderPort` that powers a quick `iris` smoke test in CI also pulls real OpenML benchmark suites for credibility runs. This page describes the evaluation roadmap and shows how to load datasets through the loaders.
+Firefly DataScience separates *how we prove the framework is good* from *how we load data day-to-day*. The same `DatasetLoaderPort` that powers a quick `iris` smoke test in CI also pulls real OpenML benchmark suites for credibility runs. This page describes the evaluation strategy, shows how to load datasets through the loaders, and lists the real, reproducible results. Every figure here was produced by running a script in `benchmarks/` with no manual tuning β see the full table in [`benchmarks/RESULTS.md`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/benchmarks/RESULTS.md).
+
+!!! firefly "The recurring thesis β the LLM proposes; the classical engine decides"
+
+ GenAI proposes feature code; a deterministic classical engine measures the cross-validated lift;
+ and a **cost/benefit gate keeps only what is proven on the data**. That is why the GenAI ablation
+ below can only improve or stay neutral β never regress. The benchmarks measure both the classical
+ core and the gated accelerator on the same footing.
## The three tiers
@@ -18,57 +25,61 @@ Tier 2 is the only tier that runs without network access, which is why it backs
Two loaders ship today, both implementing `DatasetLoaderPort` (`name`, `can_load`, `load`).
-### Tier 2 β scikit-learn (offline, no network)
+=== "Tier 2 β scikit-learn (offline, no network)"
-`SklearnDatasetLoader` resolves bare names or `sklearn:`-prefixed names against scikit-learn's bundled datasets. No download, fully deterministic.
+ `SklearnDatasetLoader` resolves bare names or `sklearn:`-prefixed names against scikit-learn's
+ bundled datasets. No download, fully deterministic.
-```python
-from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
+ ```python
+ from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
-loader = SklearnDatasetLoader()
-loader.can_load("breast_cancer") # True
-loader.can_load("sklearn:diabetes") # True (prefix is stripped)
+ loader = SklearnDatasetLoader()
+ loader.can_load("breast_cancer") # True
+ loader.can_load("sklearn:diabetes") # True (prefix is stripped)
-ds = loader.load("breast_cancer")
-print(ds.name, ds.task, ds.n_rows, ds.n_features)
-# breast_cancer TaskType.BINARY 569 30
-```
+ ds = loader.load("breast_cancer")
+ print(ds.name, ds.task, ds.n_rows, ds.n_features)
+ # breast_cancer TaskType.BINARY 569 30
+ ```
-The built-in Tier 2 names map to fixed task types:
+ The built-in Tier 2 names map to fixed task types:
-```python
-# binary -> breast_cancer
-# multiclass -> iris, wine, digits
-# regression -> diabetes, california_housing
-```
+ ```python
+ # binary -> breast_cancer
+ # multiclass -> iris, wine, digits
+ # regression -> diabetes, california_housing
+ ```
-Each `load` returns a `Dataset` dataclass with `X`, `y`, `task`, `target_name`, `feature_names`, and a `metadata` dict (`source`, `n_rows`, `n_features`).
+ Each `load` returns a `Dataset` dataclass with `X`, `y`, `task`, `target_name`, `feature_names`,
+ and a `metadata` dict (`source`, `n_rows`, `n_features`).
-### Tier 1 β OpenML (benchmark suites, network)
+=== "Tier 1 β OpenML (benchmark suites, network)"
-`OpenMLDatasetLoader` fetches by numeric id or by name using the `openml:` prefix. It needs the `data` extra (`openml`) and network access; without the extra it raises `AdapterUnavailableError`.
+ `OpenMLDatasetLoader` fetches by numeric id or by name using the `openml:` prefix. It needs the
+ `data` extra (`openml`) and network access; without the extra it raises `AdapterUnavailableError`.
-```python
-from fireflyframework_datascience.datasets.adapters import OpenMLDatasetLoader
+ ```python
+ from fireflyframework_datascience.datasets.adapters import OpenMLDatasetLoader
-loader = OpenMLDatasetLoader()
-loader.can_load("openml:31") # True
-loader.can_load("breast_cancer") # False (no openml: prefix)
+ loader = OpenMLDatasetLoader()
+ loader.can_load("openml:31") # True
+ loader.can_load("breast_cancer") # False (no openml: prefix)
-ds = loader.load("openml:31") # by id, e.g. the 'credit-g' task
-ds = loader.load("openml:credit-g") # by name
-ds = loader.load("openml:31", target="class") # override the default target
+ ds = loader.load("openml:31") # by id, e.g. the 'credit-g' task
+ ds = loader.load("openml:credit-g") # by name
+ ds = loader.load("openml:31", target="class") # override the default target
-print(ds.metadata["openml_id"], ds.task)
-```
+ print(ds.metadata["openml_id"], ds.task)
+ ```
-OpenML dataset ids are how Tier 1 suites (OpenMLβCC18, OpenMLβCTR23, AMLB) are addressed β each suite is a curated set of these ids, so a credibility run is "load each id, fit, score, compare".
+ OpenML dataset ids are how Tier 1 suites (OpenMLβCC18, OpenMLβCTR23, AMLB) are addressed β each
+ suite is a curated set of these ids, so a credibility run is "load each id, fit, score, compare".
-Install the extra:
+ Install the extra:
-```bash
-pip install "fireflyframework-datascience[data]"
-```
+ ```bash
+ pip install "fireflyframework-datascience[data]"
+ ```
### Working with a loaded `Dataset`
@@ -77,14 +88,16 @@ pip install "fireflyframework-datascience[data]"
```python
ds = SklearnDatasetLoader().load("iris")
-train, test = ds.train_test_split(test_size=0.25, random_state=42)
-# classification targets are stratified automatically
+train, test = ds.train_test_split(test_size=0.25, random_state=42) # (1)!
print(train.name, test.name) # iris[train] iris[test]
ds.has_target # True
ds.task.is_classification() # True
```
+1. Classification targets are stratified automatically, so each split preserves the class balance β
+ important on the small (~1000-row) datasets used in the Tierβ1 runs below.
+
When the target is unknown (OpenML without a declared task type), the loader infers it:
```python
@@ -112,50 +125,91 @@ Tier 3 measures the *agent*, not a single estimator: given a task description an
## Results (real, executed)
-These are produced by running the harnesses β fixed `random_state=0`, default trainers, no manual
-tuning. Full table and reproduction steps: [`benchmarks/RESULTS.md`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/benchmarks/RESULTS.md).
+These are produced by running the harnesses β fixed `random_state=0`, default trainers, no manual tuning. Full table and reproduction steps: [`benchmarks/RESULTS.md`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/benchmarks/RESULTS.md).
+
+To reproduce locally:
+
+```bash
+uv sync --extra tabular --extra data --extra validation
+uv run python benchmarks/automl_benchmark.py # Tier-2 (offline, no network)
+uv run python benchmarks/amlb_benchmark.py # Tier-1 (OpenML, needs network)
+```
+
+!!! success "Expected β Tier-2 offline suite (`automl_benchmark.py`)"
+
+ `AutoML(cv=3)` over the default trainers (RandomForest, Linear, HistGradientBoosting; + XGBoost /
+ LightGBM / CatBoost when installed). Runs in seconds, no network.
+
+ | Dataset | Task | Metric | CV | Holdout | Winner | Seconds |
+ |---|---|---|---:|---:|---|---:|
+ | breast_cancer | binary | roc_auc | 0.9939 | **0.9952** | linear | 1.8 |
+ | iris | multiclass | accuracy | 0.9467 | **1.0000** | random_forest | 1.6 |
+ | wine | multiclass | accuracy | 0.9700 | **1.0000** | linear | 1.0 |
+ | diabetes | regression | rmse | β54.10 | **56.46** | linear | 1.4 |
+ | california_housing | regression | rmse | β0.473 | **0.455** | hist_gradient_boosting | 9.0 |
+
+### Tier-1 β OpenML-CC18 (AMLB-style)
+
+`amlb_benchmark.py` runs `AutoML(cv=5)` across real OpenML tasks with genuine categorical data (e.g. `credit-g`), exercising the dtype-aware preprocessing and string-target encoding. Holdout ROC-AUC:
-**Tier-1 β OpenML-CC18 (AMLB-style), holdout ROC-AUC:**
+| OpenML id | Dataset | CV | Holdout | Winner |
+|---:|---|---:|---:|---|
+| 31 | credit-g | 0.7689 | **0.825** | random_forest |
+| 37 | diabetes | 0.8155 | **0.872** | linear |
+| 1464 | blood-transfusion | 0.7465 | **0.751** | linear |
+| 1480 | ilpd | 0.7347 | **0.780** | linear |
-| credit-g | diabetes | blood-transfusion | ilpd |
-|---:|---:|---:|---:|
-| 0.825 | 0.872 | 0.751 | 0.780 |
+Comparable to published AutoGluon / H2O / FLAML numbers on the same datasets β out of the box, on real data with categorical features.
-Comparable to published AutoGluon / H2O / FLAML numbers on the same datasets β out of the box, on real
-data with categorical features.
+!!! note "On real finance & retail data (`samples/industry_showcase.py`)"
-**On real finance & retail data** (`samples/industry_showcase.py`): German credit risk (`credit-g`)
-reaches **0.82** holdout ROC-AUC and bank-marketing campaign conversion reaches **0.92** β each a full
-load β validate β AutoML β evaluate run on public OpenML data, no Kaggle account required.
+ German credit risk (`credit-g`) reaches **0.82** holdout ROC-AUC and bank-marketing campaign
+ conversion reaches **0.92** β each a full load β validate β AutoML β evaluate run on public OpenML
+ data, no Kaggle account required.
### Unbiased comparison β nested cross-validation
-`benchmarks/scientific_eval.py` uses **nested 5-fold CV** (inner CV selects the model; the untouched
-outer fold gives the unbiased estimate) to compare Firefly AutoML against fixed single models on
-identical folds, with a Wilcoxon signed-rank test:
+`benchmarks/scientific_eval.py` uses **nested 5-fold CV** (inner CV selects the model on each outer fold's *training* data only; the untouched outer fold gives the unbiased estimate) to compare Firefly AutoML against fixed single models on identical folds, with a one-sided Wilcoxon signed-rank test over all 25 paired deltas (5 folds Γ 5 datasets):
-| Firefly AutoML vs⦠| mean ΠROC-AUC | Wilcoxon p |
-|---|---:|---:|
-| LogReg (linear) | **+0.029** | **0.046** |
-| RandomForest | +0.012 | 0.051 (on par) |
-| XGBoost | **+0.030** | **7.5e-6** |
+| Firefly AutoML vs⦠| mean ΠROC-AUC | wins / ties / losses | Wilcoxon p |
+|---|---:|---|---:|
+| LogReg (linear) | **+0.029** | 8 / 14 / 3 | **0.046** |
+| RandomForest | +0.012 | 16 / 2 / 7 | 0.051 (on par) |
+| XGBoost | **+0.030** | 22 / 1 / 2 | **7.5e-6** |
-Firefly **significantly beats** single LogReg and single XGBoost and is **statistically on par with**
-RandomForest β because it *adapts* per dataset (boosting on non-linear data, linear where linear wins).
-On 2 of 5 small datasets a fixed model edges it out by ~0.01 (selection variance) β reported honestly.
+Firefly **significantly beats** single LogReg and single XGBoost and is **statistically on par with** RandomForest β because it *adapts* per dataset (boosting/bagging on non-linear data like `phoneme`, linear where linear genuinely wins, e.g. `blood-transfusion` and `ilpd`). On 2 of 5 small datasets a fixed model edges it out by ~0.01β0.02 (selection variance on ~1000-row data) β reported honestly.
+
+!!! note "Why nested CV"
+
+ An AutoML system that reports the cross-validated score of the model it *selected* is
+ optimistically biased β it is the maximum over many models scored on the same folds. Nested CV
+ removes that bias: model selection happens on the inner CV of each outer fold's training data, and
+ the outer fold β never seen during selection β gives the honest estimate.
### GenAI value β controlled ablation (real LLM)
-`benchmarks/genai_value.py` isolates the GenAI contribution on a retail task whose driver
-(`revenue = price Γ units`) is withheld. Over 8 splits with `anthropic:claude-haiku-4-5`, GenAI feature
-engineering lifts a **linear model by +0.0205 ROC-AUC** (0.975 β 0.996, **Wilcoxon p = 0.0039**) β Claude
-rediscovered `total_revenue` from the schema alone. On Firefly's tree-based AutoML the lift is smaller
-(+0.002) and the **gate guarantees no regression**. Cost: 8 calls, **< $0.01**. GenAI is a *Pareto-safe
-accelerator* β significant value where structure exists, never a regression.
+`benchmarks/genai_value.py` isolates the GenAI contribution on a retail "high-value customer" task whose true driver (`revenue = unit_price Γ units`) is withheld from the model β a multiplicative interaction a *linear* learner cannot derive on its own. Four systems, 8 repeated train/test splits, real `anthropic:claude-haiku-4-5`:
+
+| System | ROC-AUC (mean Β± std) |
+|---|---:|
+| linear (raw) | 0.9752 Β± 0.006 |
+| **linear + GenAI** | **0.9957 Β± 0.002** |
+| Firefly AutoML (raw) | 0.9929 Β± 0.003 |
+| Firefly AutoML + GenAI | 0.9950 Β± 0.003 |
+
+GenAI feature engineering lifts the **linear model by +0.0205 ROC-AUC** (0.975 β 0.996, **Wilcoxon p = 0.0039**) β Claude proposed and the gate accepted `total_revenue` / `price_volume_ratio`, rediscovering the withheld multiplicative driver from the schema alone. On Firefly's tree-based AutoML the lift is smaller (+0.002): trees already approximate the interaction, so there is less to add β and the **gate guarantees no regression**. Cost: 8 LLM calls, **well under $0.01** with Claude Haiku.
+
+!!! tip "Pareto-safe accelerator"
+
+ GenAI feature engineering adds measurable, significant value where the data has structure a model
+ cannot reach on its own, surfaces interpretable domain features, and is gated to never hurt β at
+ negligible cost. See [GenAI features](genai-features.md) and the [agentic loop](agentic-loop.md)
+ for the propose-measure-gate mechanism.
## See also
-- [Datasets API](./datasets.md)
-- [Container & auto-configuration](index.md)
-- [Task types](index.md)
+- [Datasets API](datasets.md)
+- [Classical AutoML](automl.md)
+- [GenAI features](genai-features.md)
+- [Configuration](configuration.md)
- [Getting started](quickstart.md)
diff --git a/docs/brief/firefly-datascience-complete-guide.pdf b/docs/brief/firefly-datascience-complete-guide.pdf
new file mode 100644
index 0000000..79ee13d
Binary files /dev/null and b/docs/brief/firefly-datascience-complete-guide.pdf differ
diff --git a/docs/brief/firefly-datascience-strategic-introduction.pdf b/docs/brief/firefly-datascience-strategic-introduction.pdf
deleted file mode 100644
index 20911ee..0000000
Binary files a/docs/brief/firefly-datascience-strategic-introduction.pdf and /dev/null differ
diff --git a/docs/configuration.md b/docs/configuration.md
index 4ca414e..b87f384 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -12,31 +12,66 @@ print(config.default_ml_framework) # "sklearn"
print(config.genai.enabled) # False
```
+The model is classical-first by design: GenAI is **off** until you turn it on, and even then every GenAI call sits behind a cost/benefit gate.
+
+!!! firefly "The LLM proposes; the classical engine decides"
+
+ Configuration encodes the framework's governance posture. `genai.enabled` defaults to `False`
+ and `genai.cost_benefit_gate` defaults to `True`, so the deterministic classical core runs unless
+ you explicitly opt into the GenAI accelerator β and even then GenAI stays gated behind a measured
+ improvement. You change one model; the whole app inherits the policy.
+
## Precedence
-Values resolve from highest priority to lowest:
+Values resolve from highest priority to lowest. Anything set at a higher level wins; a missing source is simply skipped.
-1. **Constructor kwargs** β values passed directly to `FireflyDataScienceConfig(...)`.
-2. **Environment variables** β prefixed `FIREFLY_DATASCIENCE_`, nested via `__`.
-3. **`.env` file** β same naming as environment variables.
-4. **Profile YAML overlays** β `firefly-datascience-.yaml` (later profiles outrank earlier ones).
-5. **Base YAML** β `firefly-datascience.yaml`.
-6. **Field defaults** β the defaults shown below.
+| Priority | Source | How to set it |
+| --- | --- | --- |
+| 1 (highest) | **Constructor kwargs** | values passed directly to `FireflyDataScienceConfig(...)` |
+| 2 | **Environment variables** | prefixed `FIREFLY_DATASCIENCE_`, nested via `__` |
+| 3 | **`.env` file** | same naming as environment variables |
+| 4 | **Profile YAML overlays** | `firefly-datascience-.yaml` (later profiles outrank earlier ones) |
+| 5 | **Base YAML** | `firefly-datascience.yaml` |
+| 6 (lowest) | **Field defaults** | the defaults shown below |
-Anything set at a higher level wins. A missing source is simply skipped.
+This ordering comes straight from `settings_customise_sources`, which returns `(init_settings, env_settings, dotenv_settings, *reversed(yaml_sources), file_secret_settings)` β earlier sources win, and reversing the YAML list lets profile overlays outrank the base file.
-```bash
-# Environment beats both YAML files and the field default:
-export FIREFLY_DATASCIENCE_DEFAULT_ML_FRAMEWORK=pytorch
-export FIREFLY_DATASCIENCE_GENAI__ENABLED=true # nested via __
-export FIREFLY_DATASCIENCE_BANNER__MODE=MINIMAL
-```
+=== "Environment beats YAML"
-```python
-# Constructor kwargs beat everything (useful in tests):
-config = FireflyDataScienceConfig(default_ml_framework="xgboost")
-assert config.default_ml_framework == "xgboost"
-```
+ ```bash
+ # Environment beats both YAML files and the field default:
+ export FIREFLY_DATASCIENCE_DEFAULT_ML_FRAMEWORK=pytorch
+ export FIREFLY_DATASCIENCE_GENAI__ENABLED=true # nested via __
+ export FIREFLY_DATASCIENCE_BANNER__MODE=MINIMAL
+ ```
+
+=== "Constructor beats everything"
+
+ ```python
+ # Constructor kwargs beat env, .env, YAML, and defaults (useful in tests):
+ config = FireflyDataScienceConfig(default_ml_framework="xgboost")
+ assert config.default_ml_framework == "xgboost"
+ ```
+
+### Environment variable naming
+
+Every field is reachable from the environment using the `FIREFLY_DATASCIENCE_` prefix; nested models use the `__` delimiter once per level of nesting.
+
+| Field path | Environment variable |
+| --- | --- |
+| `default_ml_framework` | `FIREFLY_DATASCIENCE_DEFAULT_ML_FRAMEWORK` |
+| `tracking_enabled` | `FIREFLY_DATASCIENCE_TRACKING_ENABLED` |
+| `banner.mode` | `FIREFLY_DATASCIENCE_BANNER__MODE` |
+| `genai.enabled` | `FIREFLY_DATASCIENCE_GENAI__ENABLED` |
+| `genai.budget_usd` | `FIREFLY_DATASCIENCE_GENAI__BUDGET_USD` |
+| `execution.sandbox` | `FIREFLY_DATASCIENCE_EXECUTION__SANDBOX` |
+| `execution.timeout_seconds` | `FIREFLY_DATASCIENCE_EXECUTION__TIMEOUT_SECONDS` |
+
+!!! note "Two loader-only environment variables"
+
+ `FIREFLY_DATASCIENCE_CONFIG_DIR` and `FIREFLY_DATASCIENCE_PROFILES` are read by `load` itself β not
+ declared model fields β to discover the YAML directory and the active profiles. See
+ [`load(config_dir, profiles)`](#loadconfig_dir-profiles).
## `load(config_dir, profiles)`
@@ -57,16 +92,20 @@ def load(
```python
# Explicit arguments:
-config = FireflyDataScienceConfig.load(config_dir="config", profiles=["dev", "gpu"])
+config = FireflyDataScienceConfig.load(config_dir="config", profiles=["dev", "gpu"]) # (1)!
# Driven entirely by the environment:
# FIREFLY_DATASCIENCE_CONFIG_DIR=config
# FIREFLY_DATASCIENCE_PROFILES=dev,gpu
-config = FireflyDataScienceConfig.load()
+config = FireflyDataScienceConfig.load() # (2)!
-print(config.profiles) # ["dev", "gpu"]
+print(config.profiles) # ["dev", "gpu"] # (3)!
```
+1. Explicit `config_dir` and `profiles` take priority over the matching environment variables.
+2. With no arguments, `load` reads `FIREFLY_DATASCIENCE_CONFIG_DIR` and the comma-separated `FIREFLY_DATASCIENCE_PROFILES`.
+3. When profiles came from the loader (not from YAML), `load` back-fills the `profiles` field so the active profiles are visible on the returned config.
+
YAML files are discovered relative to `config_dir`:
```
@@ -76,6 +115,8 @@ config/
firefly-datascience-gpu.yaml # overlay for profile "gpu" (outranks "dev")
```
+A file that does not exist is skipped β only base and the overlays for active profiles are read.
+
## Configuration fields
### Top level
@@ -117,34 +158,55 @@ Secure code-execution settings for LLM-generated code.
| `execution.timeout_seconds` | `int` | `60` | Per-execution timeout. |
| `execution.require_approval` | `bool` | `True` | Require human approval before running generated code. |
-## Example YAML
-
-`config/firefly-datascience.yaml` (base):
-
-```yaml
-app_name: lumen-ds
-default_ml_framework: sklearn
-tracking_enabled: false
-banner:
- mode: TEXT
-genai:
- enabled: false
- default_model: openai:gpt-4o
- cost_benefit_gate: true
-execution:
- sandbox: monty
- timeout_seconds: 60
- require_approval: true
-```
+## Profiles in practice
-`config/firefly-datascience-gpu.yaml` (profile overlay):
+A profile is just a named YAML overlay. Keep a base file with shared settings, then add one overlay per environment or hardware target and activate them by name. Each tab below is a complete, self-contained overlay.
-```yaml
-default_ml_framework: pytorch
-execution:
- sandbox: docker
- timeout_seconds: 300
-```
+=== "Base"
+
+ `config/firefly-datascience.yaml`
+
+ ```yaml
+ app_name: lumen-ds
+ default_ml_framework: sklearn
+ tracking_enabled: false
+ banner:
+ mode: TEXT
+ genai:
+ enabled: false
+ default_model: openai:gpt-4o
+ cost_benefit_gate: true
+ execution:
+ sandbox: monty
+ timeout_seconds: 60
+ require_approval: true
+ ```
+
+=== "gpu"
+
+ `config/firefly-datascience-gpu.yaml`
+
+ ```yaml
+ default_ml_framework: pytorch
+ execution:
+ sandbox: docker
+ timeout_seconds: 300
+ ```
+
+=== "prod"
+
+ `config/firefly-datascience-prod.yaml`
+
+ ```yaml
+ tracking_enabled: true
+ genai:
+ enabled: true
+ budget_usd: 25.0
+ execution:
+ require_approval: true
+ ```
+
+Activating the `gpu` profile overlays the base file; untouched keys fall back to base, then to field defaults:
```python
config = FireflyDataScienceConfig.load(config_dir="config", profiles=["gpu"])
@@ -152,6 +214,11 @@ assert config.default_ml_framework == "pytorch" # overlay wins over base
assert config.tracking_enabled is False # untouched key falls back to base
```
+!!! tip "Stacking profiles"
+
+ Pass more than one profile to compose overlays β `profiles=["dev", "gpu"]`. They apply in order,
+ and a later profile outranks an earlier one for any key both set.
+
## The banner
`banner.mode` controls the startup banner. Build a printer from a loaded config:
@@ -167,14 +234,31 @@ print(printer.render())
- `MINIMAL` β a single `:: Firefly DataScience :: (vX.Y.Z)` line.
- `OFF` β renders the empty string.
-Override it without touching YAML:
+`from_config` carries the active profiles and `genai.enabled` into the printer, so the `TEXT` status line reflects the resolved config:
+
+!!! success "Expected β `TEXT` status line"
+
+ ```
+ :: Firefly DataScience :: (v1.2.3) app=lumen-ds v1.0.0 profiles=['gpu'] genai=off
+ ```
+
+The framework version is filled in automatically; `app=`, `profiles=`, and `genai=` come from the config and the arguments you pass to `from_config`.
+
+Override the mode without touching YAML:
```bash
export FIREFLY_DATASCIENCE_BANNER__MODE=OFF
```
+!!! warning "Enum values are case-sensitive"
+
+ `BannerMode` is a string enum with members `TEXT`, `MINIMAL`, and `OFF`. Set the env var to one of
+ those exact upper-case strings β `off` or `text` will not parse.
+
## See also
- [Getting Started](quickstart.md)
+- [Configure the LLM](llm-configuration.md)
- [GenAI Accelerator](genai-features.md)
-- [Code Execution](security.md)
+- [Code Execution & Security](security.md)
+- [Architecture](architecture.md)
diff --git a/docs/datasets.md b/docs/datasets.md
index afa0b55..29ba1e3 100644
--- a/docs/datasets.md
+++ b/docs/datasets.md
@@ -1,20 +1,40 @@
# Datasets
-**A small, dependency-light container for tabular data β plus pluggable loaders for getting it.**
+**One dependency-light container to pass around your data β plus pluggable loaders that know how to fetch it.**
-The `datasets` module gives you one thing to pass around: a `Dataset` holding features `X`, an
-optional target `y`, the `task`, and metadata. The module itself is import-light (pandas and
-scikit-learn are imported lazily), so the `Dataset` type and the `DatasetLoaderPort` protocol are
-usable without the `tabular` extra installed. Concrete loaders live in
+The `datasets` module gives the rest of the framework a single thing to hand around: a `Dataset`
+holding features `X`, an optional target `y`, the `task`, and metadata. The module is import-light
+(pandas and scikit-learn are imported lazily, inside the methods that need them), so the `Dataset`
+type and the `DatasetLoaderPort` protocol are usable without the `tabular` extra installed. Concrete
+loaders β the things that actually touch sklearn or the network β live in
`fireflyframework_datascience.datasets.adapters`.
-
-
-
+That split is the hexagonal "ports and adapters" idea applied to data: the **port**
+(`DatasetLoaderPort`) is a tiny protocol in the pure core; the **adapters**
+(`SklearnDatasetLoader`, `OpenMLDatasetLoader`) carry the heavy, optional dependencies.
+
+
+
+!!! firefly "Why a port, not a base class"
+ A loader is just any object with a `name`, a `can_load`, and a `load`. `DatasetLoaderPort` is a
+ `@runtime_checkable` `Protocol`, so your loader does not import or subclass anything from the
+ framework β duck typing is enough. The dependency points inward: adapters depend on the core,
+ never the reverse.
## The `Dataset` container
-`Dataset` is a dataclass. The only required fields are `name` and `X`.
+`Dataset` is a `@dataclass`. The only required fields are `name` and `X`; everything else has a
+default.
+
+| Field | Type | Default | Meaning |
+| --- | --- | --- | --- |
+| `name` | `str` | *(required)* | A human-readable label, carried into split names. |
+| `X` | `Any` | *(required)* | The feature matrix (a pandas DataFrame or array-like). |
+| `y` | `Any` | `None` | The target; `None` for unsupervised data. |
+| `task` | `TaskType` | `TaskType.CLASSIFICATION` | The learning task (see [Core types](architecture.md)). |
+| `target_name` | `str \| None` | `None` | The target column's name. |
+| `feature_names` | `list[str]` | `[]` | Column names for `X`. |
+| `metadata` | `dict[str, Any]` | `{}` | Free-form provenance (e.g. `{"source": "sklearn"}`). |
```python
from fireflyframework_datascience.core.types import TaskType
@@ -35,25 +55,31 @@ ds.n_features # int β X.shape[1]
ds.has_target # bool β y is not None
```
+`n_rows`, `n_features`, and `has_target` are read-only properties, so they always reflect the
+current `X` and `y` β there is nothing to keep in sync.
+
### `train_test_split`
Splits into `(train, test)` datasets. For classification tasks with a target present, the split is
-stratified on `y` automatically. The returned datasets carry the same `task`, `target_name`,
-`feature_names`, and a copy of `metadata`; their names are suffixed `[train]` / `[test]`.
+stratified on `y` automatically; otherwise no stratification is applied. The returned datasets carry
+the same `task`, `target_name`, and `feature_names`, plus a *copy* of `metadata`; their names are
+suffixed `[train]` / `[test]`.
```python
-train, test = ds.train_test_split(test_size=0.25, random_state=42)
+train, test = ds.train_test_split(test_size=0.25, random_state=42) # (1)!
train.name # "my_data[train]"
test.name # "my_data[test]"
```
-Both arguments are keyword-only with the defaults shown above.
+1. Both arguments are **keyword-only** with the defaults shown. Stratification kicks in only when
+ `task.is_classification()` is true *and* `y is not None`.
### `with_features`
-Returns a copy with the feature matrix `X` replaced (used by feature engineering). The new
-`feature_names` are taken from the DataFrame's columns; `y` and the rest are preserved.
+Returns a copy with the feature matrix `X` replaced β this is how feature engineering hands work
+back without mutating the original. The new `feature_names` are taken from the DataFrame's columns;
+`y`, `task`, `target_name`, and a copy of `metadata` are preserved.
```python
engineered = ds.with_features(new_frame_x)
@@ -62,27 +88,72 @@ engineered.feature_names == list(new_frame_x.columns) # True
## `infer_task`
-Infers a `TaskType` from a target series or array:
+Infers a `TaskType` from a target series or array. Useful when a source does not tell you what kind
+of problem its target represents.
```python
from fireflyframework_datascience.datasets import infer_task
infer_task([0, 1, 0, 1]) # TaskType.BINARY (2 unique values)
infer_task(["a", "b", "c"]) # TaskType.MULTICLASS (>2 categorical)
-infer_task([0.1, 0.2, ... , 3.4]) # TaskType.REGRESSION (float, >20 unique)
+infer_task([0.1, 0.2, ..., 3.4]) # TaskType.REGRESSION (float, >20 unique)
```
-The rules: float or integer targets with more than 20 distinct values are `REGRESSION`; exactly two
-unique values are `BINARY`; everything else is `MULTICLASS`.
+The rules, in order:
+
+- A **float** target (`dtype.kind == "f"`) with **more than 20** distinct values β `REGRESSION`.
+- An **integer** target (`dtype.kind` in `"i"`/`"u"`) with **more than 20** distinct values β
+ `REGRESSION`.
+- Exactly **two** unique values β `BINARY`.
+- Everything else β `MULTICLASS`.
+
+!!! note "The 20-value threshold is a heuristic"
+ An integer target with, say, 10 distinct levels is treated as `MULTICLASS`, not `REGRESSION`.
+ If the inference is wrong for your data, set `Dataset.task` explicitly rather than relying on
+ `infer_task`.
+
+## `TaskType` and `Modality`
+
+Two enums from the core (`fireflyframework_datascience.core.types`) describe *what* you are learning
+and *on what kind of data*. Both are `StrEnum`s, so they compare and serialize as plain strings.
+
+`TaskType` is the learning task and lives on every `Dataset`:
+
+```python
+from fireflyframework_datascience.core.types import TaskType
+
+TaskType.BINARY.is_classification() # True
+TaskType.REGRESSION.is_classification() # False
+```
+
+`is_classification()` returns `True` for `BINARY`, `MULTICLASS`, and the generic `CLASSIFICATION`.
+The full set is `BINARY`, `MULTICLASS`, `CLASSIFICATION`, `REGRESSION`, `CLUSTERING`, and
+`FORECASTING`.
+
+`Modality` describes the *kind of data* a pipeline operates on β orthogonal to the task. The
+`datasets` module ships tabular loaders, but the enum is the framework-wide vocabulary other modules
+key off:
+
+| `Modality` | Value | Typical use |
+| --- | --- | --- |
+| `TABULAR` | `"tabular"` | Rows and columns β what `Dataset` and the built-in loaders produce. |
+| `TEXT` | `"text"` | Free-text / NLP inputs. |
+| `VISION` | `"vision"` | Images. |
+| `TIMESERIES` | `"timeseries"` | Ordered observations over time. |
+| `MULTIMODAL` | `"multimodal"` | A mix of the above. |
+
+`Modality` and `TaskType` are both re-exported at the top level, so
+`from fireflyframework_datascience import Modality, TaskType` works too.
## `DatasetLoaderPort`
-A `@runtime_checkable` `Protocol`. Implement it to teach the framework about a new data source.
+A `@runtime_checkable` `Protocol`. Implement it to teach the framework about a new data source β no
+inheritance required.
```python
-from typing import Any
-from fireflyframework_datascience.datasets import Dataset, DatasetLoaderPort
+from typing import Any, Protocol, runtime_checkable
+@runtime_checkable
class DatasetLoaderPort(Protocol):
name: str
def can_load(self, source: str) -> bool: ...
@@ -90,7 +161,46 @@ class DatasetLoaderPort(Protocol):
```
A loader inspects a string `source` (a name, an id, or a URI), reports whether it `can_load` it, and
-returns a fully-populated `Dataset`.
+β if so β returns a fully-populated `Dataset`. The framework's auto-configuration registers the
+built-in loaders as DI beans when their libraries are importable: the sklearn loader is registered
+when `sklearn` is present, and the OpenML loader additionally when `openml` is present.
+
+### A worked example: your own loader
+
+Any object matching the protocol is a valid loader. Because the port is `@runtime_checkable`, you
+can even assert the match at runtime:
+
+```python
+import pandas as pd
+from fireflyframework_datascience.core.types import TaskType
+from fireflyframework_datascience.datasets import Dataset, DatasetLoaderPort, infer_task
+
+class CsvDatasetLoader: # (1)!
+ name = "csv"
+
+ def can_load(self, source: str) -> bool:
+ return source.endswith(".csv")
+
+ def load(self, source, *, target=None, **kwargs):
+ frame = pd.read_csv(source)
+ series_y = frame.pop(target) if target else None # (2)!
+ return Dataset(
+ name=source,
+ X=frame,
+ y=series_y,
+ task=infer_task(series_y) if series_y is not None else TaskType.CLASSIFICATION,
+ target_name=target,
+ feature_names=list(frame.columns),
+ metadata={"source": "csv"},
+ )
+
+loader = CsvDatasetLoader()
+isinstance(loader, DatasetLoaderPort) # True β duck typing satisfies the protocol
+```
+
+1. No base class β defining `name`, `can_load`, and `load` is all the protocol asks for.
+2. `target` is keyword-only on `load`, matching the port signature. Letting `infer_task` pick the
+ `TaskType` mirrors what `OpenMLDatasetLoader` does.
## `SklearnDatasetLoader` (offline)
@@ -111,6 +221,13 @@ ds.n_features # 30
ds.metadata # {"source": "sklearn", "n_rows": ..., "n_features": ...}
```
+!!! success "Expected"
+ ```python
+ ds.task # TaskType.BINARY
+ ds.n_features # 30
+ ds.has_target # True
+ ```
+
Available names and their tasks:
| Source | Task |
@@ -119,35 +236,53 @@ Available names and their tasks:
| `iris`, `wine`, `digits` | `MULTICLASS` |
| `diabetes`, `california_housing` | `REGRESSION` |
-An unknown name raises `ValueError` listing the available datasets.
+The loader resolves each name to `load_` (or, failing that, `fetch_`) in
+`sklearn.datasets` and calls it with `as_frame=True`, so `X` is a DataFrame and `y` a Series. An
+unknown name raises `ValueError` listing the available datasets.
+
+!!! warning "Requires scikit-learn"
+ `load` imports `sklearn.datasets` directly. Without scikit-learn installed (the `tabular`
+ extra), the import raises a plain `ImportError`.
## `OpenMLDatasetLoader` (`data` extra)
Loads datasets from [OpenML](https://www.openml.org/) by id (`openml:31`) or name
(`openml:credit-g`). Requires the `data` extra (`openml`) and network access. The `task` is inferred
-via `infer_task` from the resolved target.
+via `infer_task` from the resolved target (or defaults to `TaskType.CLASSIFICATION` when there is no
+target).
-```bash
-pip install "fireflyframework-datascience[data]"
-```
+=== "Install"
-```python
-from fireflyframework_datascience.datasets.adapters import OpenMLDatasetLoader
+ ```bash
+ pip install "fireflyframework-datascience[data]"
+ ```
-loader = OpenMLDatasetLoader()
-loader.name # "openml"
-loader.can_load("openml:31") # True
+=== "Use"
-ds = loader.load("openml:credit-g") # by name
-ds = loader.load("openml:31", target="class") # override the default target
-ds.metadata # {"source": "openml", "openml_id": ...}
-```
+ ```python
+ from fireflyframework_datascience.datasets.adapters import OpenMLDatasetLoader
+
+ loader = OpenMLDatasetLoader()
+ loader.name # "openml"
+ loader.can_load("openml:31") # True
+
+ ds = loader.load("openml:credit-g") # by name
+ ds = loader.load("openml:31", target="class") # override the default target
+ ds.metadata # {"source": "openml", "openml_id": ...}
+ ```
+
+The reference after `openml:` is treated as an id when it is all digits, and as a name otherwise.
+When you do not pass `target`, the loader falls back to the dataset's `default_target_attribute`.
-If the `openml` package is not installed, `load` raises `AdapterUnavailableError("OpenMLDatasetLoader", "data")`.
+!!! warning "Adapter unavailable without the extra"
+ If the `openml` package is not installed, `load` raises
+ `AdapterUnavailableError("OpenMLDatasetLoader", "data")`, whose message tells you exactly which
+ extra to install.
## See also
-- [Core types](architecture.md) β `TaskType` and the rest of the core enums
-- [Adapters](architecture.md) β the adapter pattern and the `data` / `tabular` extras
-- [Feature engineering](genai-features.md) β consumers of `Dataset.with_features`
-- [Getting started](quickstart.md)
+- [Architecture](architecture.md) β the hexagonal ports-and-adapters design and the optional extras
+- [Quickstart](quickstart.md) β load a dataset and run a pipeline end to end
+- [AutoML](automl.md) β how the engine consumes a `Dataset` and its `task`
+- [GenAI features](genai-features.md) β consumers of `Dataset.with_features`
+- [Benchmarks](benchmarks.md) β the sklearn/OpenML datasets used to measure the framework
diff --git a/docs/deep-learning.md b/docs/deep-learning.md
index 1405e55..38e4191 100644
--- a/docs/deep-learning.md
+++ b/docs/deep-learning.md
@@ -1,10 +1,15 @@
# Deep Learning & Tabular Foundation Models
-**Neural and tabular-foundation-model training behind two import-light ports β with a verified sklearn reference and gated PyTorch / TabPFN adapters.**
+**Neural and tabular-foundation-model training behind two import-light ports β with a verified sklearn reference and gated PyTorch / TabPFN adapters that share the exact contract of the classical engine.**
-The `dl` module defines two ports for non-classical-ML training. `DLTrainerPort` covers neural trainers; `TabFMPort` covers tabular foundation models (in-context fit/predict, e.g. TabPFN). Both are runtime-checkable `Protocol`s and share the same shape as the rest of the framework: `supports(task)` plus `fit(dataset) -> Model`.
+The `dl` module defines two ports for non-classical-ML training. `DLTrainerPort` covers neural trainers; `TabFMPort` covers tabular foundation models (in-context fit/predict, e.g. TabPFN). Both are runtime-checkable `Protocol`s and share the same shape as the rest of the framework: `name`, `supports(task)`, and `fit(dataset) -> Model`.
-The module itself is import-light β it pulls in no heavy dependencies. A verified `MLPTrainer` (scikit-learn) ships as the reference adapter and needs only the `tabular` extra. The heavy adapters (`TabPFNPredictor`, `TorchTabularTrainer`) are **gated behind extras** and raise a clear error when those extras are missing.
+That parity is the point. A deep-learning adapter is not a special case β it is the same `supports` + `fit` seam the classical trainers expose, so AutoML can rank a `MLPTrainer` against a gradient-boosted tree without any extra wiring. The framework's discipline still holds across modalities:
+
+!!! firefly "The LLM proposes; the classical engine decides"
+ Deep learning, text, and vision adapters widen *what* can be proposed β neural nets, transformers, CNNs, in-context foundation models. They do not change *who decides*. Every adapter returns a measured `Model`, and the same cost-benefit gate that scores classical models scores these, on held-out data. A heavyweight neural trainer earns its place only when it beats the dependency-light baseline on the metric, not because it is fashionable.
+
+The module itself is import-light β it pulls in no heavy dependencies at import time. A verified `MLPTrainer` (scikit-learn) ships as the reference adapter and needs only the `tabular` extra. The heavy adapters (`TabPFNPredictor`, `TorchTabularTrainer`) are **gated behind extras** and raise a clear error when those extras are missing.
## The ports
@@ -17,19 +22,22 @@ from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.models import Model
-class DLTrainerPort(Protocol):
+class DLTrainerPort(Protocol): # (1)!
name: str
def supports(self, task: TaskType) -> bool: ...
def fit(self, dataset: Dataset) -> Model: ...
-class TabFMPort(Protocol):
+class TabFMPort(Protocol): # (2)!
name: str
def supports(self, task: TaskType) -> bool: ...
def fit(self, dataset: Dataset) -> Model: ...
```
-Because they are `@runtime_checkable`, any object with the right attributes satisfies them:
+1. Neural trainers. The verified sklearn-MLP reference ships here; PyTorch Lightning / HuggingFace adapters plug in behind the `dl` / `nlp` extras.
+2. Tabular foundation models β in-context fit/predict (e.g. TabPFN), behind the `tabfm` extra.
+
+Because they are `@runtime_checkable`, any object with the right attributes satisfies them β no base class, no registration:
```python
from fireflyframework_datascience.dl import DLTrainerPort
@@ -38,9 +46,11 @@ from fireflyframework_datascience.dl.adapters import MLPTrainer
assert isinstance(MLPTrainer(), DLTrainerPort) # True
```
+This is the same structural-typing trick the classical `TrainerPort` uses, which is why a DL adapter slots into the engine with zero glue code.
+
## MLPTrainer β the verified neural reference
-`MLPTrainer` wraps scikit-learn's `MLPClassifier` / `MLPRegressor` in the standard preprocessing pipeline and returns a fitted `Model`. It is verified in the PR gate and needs only the `tabular` extra β no PyTorch required.
+`MLPTrainer` wraps scikit-learn's `MLPClassifier` / `MLPRegressor` in the standard preprocessing pipeline and returns a fitted `Model`. It is verified in the PR gate and needs only the `tabular` extra β no PyTorch required. The estimator is built with `build_pipeline(...)`, so it shares the exact preprocessing every other trainer gets.
```bash
pip install "fireflyframework-datascience[tabular]"
@@ -64,6 +74,7 @@ dataset = Dataset(
)
trainer = MLPTrainer()
+trainer.name # "mlp"
assert trainer.supports(dataset.task)
model = trainer.fit(dataset) # -> fireflyframework_datascience.models.Model
@@ -78,6 +89,8 @@ trainer.supports(TaskType.MULTICLASS) # True
trainer.supports(TaskType.REGRESSION) # True
```
+Under the hood the network is fixed and reproducible β `hidden_layer_sizes=(64, 32)`, `max_iter=400`, `random_state=42` β chosen as a sensible baseline rather than a tuning target.
+
The returned `Model` is the same wrapper the rest of the framework uses, so you get `predict` / `predict_proba` / `save` / `load` for free.
## TabPFNPredictor β tabular foundation model (`tabfm` extra)
@@ -92,13 +105,15 @@ pip install "fireflyframework-datascience[tabfm]"
from fireflyframework_datascience.dl.adapters import TabPFNPredictor
predictor = TabPFNPredictor()
-predictor.name # "tabpfn"
+predictor.name # "tabpfn"
predictor.supports(TaskType.BINARY) # True
model = predictor.fit(dataset) # in-context fit -> Model
preds = model.predict(dataset.X)
```
+Like `MLPTrainer`, it dispatches on the task β `TabPFNClassifier` for classification, `TabPFNRegressor` for regression β and wraps the result through the same `build_pipeline` / `Model` path, so the foundation model is just another scored candidate from AutoML's point of view.
+
If the `tabfm` extra is not installed, `fit` raises a clear, actionable error rather than an opaque `ImportError`:
```python
@@ -111,6 +126,13 @@ except AdapterUnavailableError as exc:
print(exc)
```
+!!! success "Expected"
+ ```text
+ Adapter 'TabPFNPredictor' requires the optional dependency group 'tabfm'.
+ Install it with: pip install 'fireflyframework-datascience[tabfm]'
+ ```
+ The exception carries `.adapter` and `.extra` attributes too, so callers can branch programmatically instead of parsing the message.
+
## TorchTabularTrainer β the PyTorch / Lightning integration point (`dl` extra)
`TorchTabularTrainer` is the `DLTrainerPort` adapter where full deep-learning workloads plug in: PyTorch Lightning, HuggingFace Accelerate, distributed training (FSDP/DDP), and PEFT/TRL all share this contract. It is **gated behind the `dl` extra** and verified under the nightly/integration suite rather than the PR gate.
@@ -122,12 +144,17 @@ pip install "fireflyframework-datascience[dl]"
```python
from fireflyframework_datascience.dl.adapters import TorchTabularTrainer
-trainer = TorchTabularTrainer(epochs=50, hidden=64, lr=1e-3)
-trainer.name # "torch_tabular"
+trainer = TorchTabularTrainer(epochs=50, hidden=64, lr=1e-3) # (1)!
+trainer.name # "torch_tabular"
trainer.supports(TaskType.REGRESSION) # True
```
-Be honest about current state: in the published package the training loop lives behind the `dl` extra and the nightly suite. Without `torch` installed, `fit` raises `AdapterUnavailableError("TorchTabularTrainer", "dl")`; with `torch` present, the reference build raises `NotImplementedError` pointing at the nightly DL suite. Treat it as the **integration seam** β the contract (`supports` + `fit(dataset) -> Model`) is identical to `MLPTrainer`, so your own Lightning/HF trainer can drop straight in:
+1. These are the constructor defaults β `epochs=50`, `hidden=64`, `lr=1e-3` β surfaced explicitly here for clarity. With the `dl` extra present, `fit` builds the preprocessor, encodes labels, and trains a small MLP via the bundled torch implementation.
+
+!!! warning "Gated behind extras"
+ Without `torch` installed, `fit` raises `AdapterUnavailableError("TorchTabularTrainer", "dl")`. The training loop runs only with the `dl` extra, under the nightly suite β not the PR gate.
+
+Treat it as the **integration seam**: the contract (`supports` + `fit(dataset) -> Model`) is identical to `MLPTrainer`, so your own Lightning/HF trainer can drop straight in with no inheritance and no registration:
```python
class MyLightningTrainer:
@@ -143,6 +170,57 @@ class MyLightningTrainer:
# isinstance(MyLightningTrainer(), DLTrainerPort) -> True
```
+## Beyond tabular: text and vision share the shape
+
+The same ports-parity discipline extends past the `dl` module. Each modality lives in its own import-light package, defines a port plus a result model, and exposes a single `fit(...)` method β the only thing that varies is the input type.
+
+=== "Text (`nlp`)"
+
+ `HFTextClassifier` fine-tunes a HuggingFace sequence-classification model on `(texts, labels)` and returns a `TextModel`. It defaults to DistilBERT and is **gated behind the `nlp` extra** (transformers + torch).
+
+ ```python
+ from fireflyframework_datascience.nlp.adapters import HFTextClassifier
+
+ clf = HFTextClassifier() # model_name="distilbert-base-uncased"
+ clf.name # "hf_text"
+ model = clf.fit(["great product", "broke on day one"], ["pos", "neg"])
+ model.predict(["love it"]) # -> ["pos"] (a TextModel)
+ ```
+
+ Swap `model_name` for any other sequence-classification checkpoint (RoBERTa, DeBERTa, β¦). Defaults: `epochs=3`, `lr=5e-5`, `max_length=64`, `batch_size=8`. Without the extra, `fit` raises `AdapterUnavailableError("HFTextClassifier", "nlp")`.
+
+=== "Vision (`dl`)"
+
+ `TorchCNNClassifier` trains a small CNN on `(N, C, H, W)` image arrays and returns an `ImageModel`. It is **gated behind the `dl` extra**.
+
+ ```python
+ from fireflyframework_datascience.vision.adapters import TorchCNNClassifier
+
+ clf = TorchCNNClassifier() # epochs=15, lr=1e-3, batch_size=16
+ clf.name # "torch_cnn"
+ model = clf.fit(images, labels) # images: (N, C, H, W) float array
+ model.predict(images) # -> list of labels (an ImageModel)
+ ```
+
+ Without the `dl` extra, `fit` raises `AdapterUnavailableError("TorchCNNClassifier", "dl")`.
+
+The text and vision ports (`TextClassifierPort`, `ImageClassifierPort`) take `(inputs, labels)` rather than a `Dataset`, because text and image inputs are not the tabular `Dataset` the `dl` ports consume β but the fit/predict rhythm, the gated-extra discipline, and the structural-typing contract are identical.
+
+## Modalities at a glance
+
+The framework names five modalities in the `Modality` enum (`fireflyframework_datascience.core.types`). Three have shipping adapters today; two are reserved in the type system but have no built-in adapter yet β you would supply your own against the relevant port.
+
+| Modality | Port | Result | Reference adapter | Extra | Input |
+| --- | --- | --- | --- | --- | --- |
+| `TABULAR` | `DLTrainerPort` / `TabFMPort` | `Model` | `MLPTrainer`, `TabPFNPredictor`, `TorchTabularTrainer` | `tabular` / `tabfm` / `dl` | `Dataset` |
+| `TEXT` | `TextClassifierPort` | `TextModel` | `HFTextClassifier` | `nlp` | `(texts, labels)` |
+| `VISION` | `ImageClassifierPort` | `ImageModel` | `TorchCNNClassifier` | `dl` | `(images, labels)` |
+| `TIMESERIES` | β | β | none built in | β | β |
+| `MULTIMODAL` | β | β | none built in | β | β |
+
+!!! note "Reserved, not implemented"
+ `TIMESERIES` and `MULTIMODAL` exist in the `Modality` enum (and `TaskType.FORECASTING` exists for forecasting tasks), but the published package ships no adapter for them. The enum names the design space; bring your own adapter on the same `supports` + `fit` contract and it will satisfy the port like any other.
+
## Choosing an adapter
| Adapter | Port | Extra | Status |
@@ -150,12 +228,16 @@ class MyLightningTrainer:
| `MLPTrainer` | `DLTrainerPort` | `tabular` | Verified (PR gate) |
| `TabPFNPredictor` | `TabFMPort` | `tabfm` | Gated; runs with the extra |
| `TorchTabularTrainer` | `DLTrainerPort` | `dl` | Integration seam; nightly suite |
+| `HFTextClassifier` | `TextClassifierPort` | `nlp` | Gated; runs with the extra |
+| `TorchCNNClassifier` | `ImageClassifierPort` | `dl` | Gated; runs with the extra |
-Start with `MLPTrainer` for a dependency-light neural baseline, reach for `TabPFNPredictor` on small/medium tables where a foundation model shines, and use `TorchTabularTrainer` as the entry point for full PyTorch-based deep learning.
+!!! tip "Where to start"
+ Reach for `MLPTrainer` for a dependency-light neural baseline, `TabPFNPredictor` on small/medium tables where a foundation model shines, and `TorchTabularTrainer` as the entry point for full PyTorch-based deep learning. For text or images, start with `HFTextClassifier` and `TorchCNNClassifier`. In every case the adapter only earns its keep if it beats the baseline on the held-out metric β the cost-benefit gate, not the adapter, makes the call.
## See also
-- [Datasets](./datasets.md) β the `Dataset` container and `DatasetLoaderPort`
-- [Models](automl.md) β the fitted `Model` wrapper and `TrainerPort`
-- [Preprocessing](automl.md) β `build_pipeline`, shared by every adapter
-- [Core Types](configuration.md) β `TaskType` and friends
+- [Datasets](datasets.md) β the `Dataset` container and loader ports
+- [AutoML](automl.md) β the fitted `Model` wrapper, `TrainerPort`, and the cost-benefit gate that scores every adapter
+- [GenAI features](genai-features.md) β where the LLM proposes and the classical engine decides
+- [Configuration](configuration.md) β `TaskType`, `Modality`, and how extras toggle adapters on
+- [Architecture](architecture.md) β the hexagonal ports-and-adapters design these trainers plug into
diff --git a/docs/genai-features.md b/docs/genai-features.md
index 67f0904..2fd0129 100644
--- a/docs/genai-features.md
+++ b/docs/genai-features.md
@@ -8,124 +8,154 @@ cross-validation lift of each one, and a `CostBenefitGate` keeps a feature only
beats the current baseline by a measurable margin. The LLM never touches the score β it
just generates candidates, and the data does the rest.
-
+
## The loop
`GenAIFeatureEngineer` runs **propose β execute β measure β gate**:
-1. **Propose** β a `FeatureProposer` returns a list of `FeatureProposal`s (name, code, rationale).
+1. **Propose** β a `FeatureProposer` returns a list of `FeatureProposal`s (`name`, `code`, `rationale`).
2. **Execute** β `FeatureCodeExecutor` statically vets and safely runs each snippet against a copy of the frame.
3. **Measure** β a classical estimator scores the candidate frame via cross-validation.
-4. **Gate** β `CostBenefitGate` accepts the feature only if the score improves by at least `min_gain`.
+4. **Gate** β `CostBenefitGate` accepts the feature only if the score improves by more than `min_gain`.
-Everything is injectable, so the loop runs fully offline with a stub proposer β no LLM required for tests.
+Each accepted feature is folded into the working frame, so the next proposal is measured
+against the *improved* baseline β features must earn their keep on top of everything kept
+so far. Everything is injectable, so the loop runs fully offline with a stub proposer β no
+LLM required for tests.
-## Quick start (no LLM)
+!!! firefly "The cost-benefit gate β GenAI proposes, the measured score decides"
-Use `StaticFeatureProposer` to drive the loop with a fixed, known set of features.
+ `CostBenefitGate` is the governance primitive that keeps GenAI honest. It compares the
+ candidate score against the current best and accepts only a strict improvement beyond
+ `min_gain`:
-```python
-from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
-from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
+ ```python
+ gate.accepts(current_score, candidate_score)
+ # True β (candidate_score - current_score) > min_gain
+ ```
-proposer = StaticFeatureProposer([
- FeatureProposal(
- name="income_per_dependent",
- code="df['income_per_dependent'] = df['income'] / (df['dependents'] + 1)",
- rationale="Normalises income by household size.",
- ),
- FeatureProposal(
- name="utilization",
- code="df['utilization'] = df['balance'] / (df['credit_limit'] + 1)",
- rationale="Classic credit-risk ratio.",
- ),
-])
-
-engineer = GenAIFeatureEngineer(proposer, cv=5, max_features=5)
-result = engineer.engineer(dataset)
-
-print(result.summary())
-# GenAI feature engineering: 1 accepted, 1 rejected; roc_auc 0.8123 -> 0.8310 (lift +0.0187)
-```
+ With the default `min_gain=0.0`, any strict improvement is kept; raise it to demand
+ features that clear a meaningful bar before they earn their complexity. A proposal that
+ does not measurably beat the seeded classical baseline is rejected β the LLM never
+ overrides the data.
-## Reading the result
+## Quick start
-`engineer()` returns an `EngineeringResult` with the engineered dataset plus a full audit trail.
+Pick a proposer: a deterministic `StaticFeatureProposer` for known features and LLM-free
+runs, or an `AgentFeatureProposer` that asks a model for candidates.
-```python
-result.dataset # Dataset with accepted features merged in (dataset.with_features(...))
-result.baseline_score # CV score before any GenAI feature
-result.final_score # CV score after accepted features
-result.lift # final_score - baseline_score
-result.metric # e.g. "roc_auc"
+=== "Static (no LLM)"
-for acc in result.accepted: # AcceptedFeature
- print(acc.proposal.name, acc.score, acc.gain)
+ `StaticFeatureProposer` drives the loop with a fixed, known set of features β ideal for
+ tests, reproducible pipelines, and codifying domain knowledge.
-for rej in result.rejected: # RejectedFeature
- print(rej.proposal.name, rej.reason, rej.score)
-```
+ ```python
+ from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
+ from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
-A proposal is rejected when its code is unsafe, fails at runtime, adds no new numeric
-column, or produces **no measured lift** over the current best score.
+ proposer = StaticFeatureProposer([
+ FeatureProposal(
+ name="income_per_dependent",
+ code="df['income_per_dependent'] = df['income'] / (df['dependents'] + 1)",
+ rationale="Normalises income by household size.",
+ ),
+ FeatureProposal(
+ name="utilization",
+ code="df['utilization'] = df['balance'] / (df['credit_limit'] + 1)",
+ rationale="Classic credit-risk ratio.",
+ ),
+ ])
-## The gate is the governance primitive
+ engineer = GenAIFeatureEngineer(proposer, cv=5, max_features=5)
+ result = engineer.engineer(dataset)
+ print(result.summary())
+ ```
-`CostBenefitGate` is what keeps GenAI honest. It compares the candidate score against the
-current best and only accepts a strict improvement beyond `min_gain`.
+ !!! success "Expected"
-```python
-from fireflyframework_datascience.features import CostBenefitGate
-from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
+ ```text
+ GenAI feature engineering: 1 accepted, 1 rejected; roc_auc 0.8123 -> 0.8310 (lift +0.0187)
+ ```
-# Require at least +0.005 of lift before a feature is worth its complexity.
-gate = CostBenefitGate(min_gain=0.005)
-engineer = GenAIFeatureEngineer(proposer, gate=gate)
-```
+=== "Agent (LLM)"
-`gate.accepts(current_score, candidate_score)` returns `True` only when
-`candidate_score - current_score > min_gain`. With the default `min_gain=0.0`, any strict
-improvement is kept; raise it to demand features that earn their keep.
+ `AgentFeatureProposer` wraps a `FireflyAgent` from `fireflyframework-agentic`. It sends the
+ schema, a few sample rows, and the task to the model, then maps the structured output to
+ `FeatureProposal`s. The agent is built lazily on first use, so no LLM client is created at
+ startup.
-## Proposing with an LLM agent
+ ```python
+ from fireflyframework_datascience.features.genai import (
+ AgentFeatureProposer,
+ GenAIFeatureEngineer,
+ )
-`AgentFeatureProposer` wraps a `FireflyAgent` from `fireflyframework-agentic`. The agent is
-built lazily on first use, so no LLM client is created at startup. It sends the schema, a
-few sample rows, and the task to the model, then maps the structured output to
-`FeatureProposal`s.
+ proposer = AgentFeatureProposer(model="openai:gpt-4o", sample_rows=5) # (1)!
-```python
-from fireflyframework_datascience.features.genai import (
- AgentFeatureProposer,
- GenAIFeatureEngineer,
-)
+ engineer = GenAIFeatureEngineer(proposer, cv=5, max_features=8)
+ result = engineer.engineer(dataset)
+ print(result.summary())
+ ```
-# Pass a model string (default "openai:gpt-4o") or your own pre-built FireflyAgent.
-proposer = AgentFeatureProposer(model="openai:gpt-4o", sample_rows=5)
+ 1. Pass a model string (defaults to `"openai:gpt-4o"`) or your own pre-built `FireflyAgent`
+ via `agent=...`. `sample_rows` controls how many rows of the frame are sent to the model.
-engineer = GenAIFeatureEngineer(proposer, cv=5, max_features=8)
-result = engineer.engineer(dataset)
-print(result.summary())
-```
+ The agent is instructed to return short pandas snippets that add **exactly one new numeric
+ column** to a DataFrame named `df`, using only `df`, `pd`, and `np` β no imports, no I/O.
+ See [LLM configuration](llm-configuration.md) for choosing and configuring the model.
+
+## Proposers and the proposer port
-The agent is instructed to return short pandas snippets that add **exactly one new numeric
-column** to a DataFrame named `df`, using only `df`, `pd`, and `np` β no imports, no I/O.
+Both proposers satisfy the `FeatureProposer` protocol β `propose(dataset, *, max_features=5)
+-> list[FeatureProposal]` β so the engineer depends only on the port, never a concrete LLM.
-For tests, inject a pre-built agent (or a fake) instead of a model:
+| Proposer | Signature | LLM? | Use it for |
+|---|---|---|---|
+| `StaticFeatureProposer` | `StaticFeatureProposer(proposals: list[FeatureProposal])` | No | Tests, reproducible runs, domain-known features |
+| `AgentFeatureProposer` | `AgentFeatureProposer(*, model=None, agent=None, sample_rows=5)` | Yes (lazy) | Discovering candidates from the schema |
+
+For tests, inject a pre-built agent (or a fake) instead of a model β no network call, no
+startup client:
```python
proposer = AgentFeatureProposer(agent=my_fake_agent)
```
+The structured output the agent returns is a `FeatureList` of `Feature` objects
+(`name`, `code`, `rationale`), defined in `features/_schema.py`; the proposer maps each one
+into a `FeatureProposal` and truncates to `max_features`.
+
+## Reading the result
+
+`engineer()` returns an `EngineeringResult` with the engineered dataset plus a full audit trail.
+
+```python
+result.dataset # Dataset with accepted features merged in (dataset.with_features(...))
+result.baseline_score # CV score before any GenAI feature
+result.final_score # CV score after accepted features
+result.lift # final_score - baseline_score
+result.metric # e.g. "roc_auc"
+
+for acc in result.accepted: # AcceptedFeature
+ print(acc.proposal.name, acc.score, acc.gain)
+
+for rej in result.rejected: # RejectedFeature
+ print(rej.proposal.name, rej.reason, rej.score)
+```
+
+`AcceptedFeature` carries the `proposal`, its `score`, and the `gain` over the previous best;
+`RejectedFeature` carries the `proposal`, a `reason`, and the candidate `score` (`NaN` when
+the code never ran). A proposal is rejected when its code is unsafe, fails at runtime, adds
+no new numeric column, or produces **no measured lift** over the current best score β in which
+case the reason reads `no lift ( <= )`.
+
## Secure execution
LLM-generated code is an attack surface, so `FeatureCodeExecutor` applies defence in depth
before anything runs. It reuses the static safety analysis from
-`fireflyframework_agentic.execution` (denying imports, dunder access, and dangerous
-builtins like `eval`/`exec`/`open`), then executes the vetted snippet in a restricted
-namespace exposing only `df` (a copy of the frame), `pd`, and `np`, with a minimal
-`__builtins__` allowlist.
+`fireflyframework_agentic.execution` (`analyze_code` against a `SafetyPolicy`), then executes
+the vetted snippet in a restricted namespace.
```python
from fireflyframework_datascience.features.executor import (
@@ -140,10 +170,30 @@ except FeatureExecutionError as exc:
print("rejected:", exc)
```
-`execute(code, X)` raises `FeatureExecutionError` if the code is unsafe, errors at runtime,
-leaves something other than a DataFrame in `df`, adds no new column, or adds a non-numeric
-column. Newly added columns also have `Β±inf` replaced with `NaN` so downstream estimators
-do not break. You can pass a custom executor into the engineer:
+The defence is layered:
+
+1. **Static analysis** rejects denied modules (`os`, `sys`, `subprocess`, `shutil`, `socket`,
+ `pathlib`, `importlib`, `builtins`), dunder access, and dangerous builtins
+ (`eval`, `exec`, `compile`, `open`, `__import__`, `input`, `globals`, `locals`, `vars`,
+ `getattr`, `setattr`) before anything runs.
+2. **Restricted execution** runs the snippet against a *copy* of the frame in a namespace that
+ exposes only `df`, `pd`, and `np`, with a minimal `__builtins__` allowlist (arithmetic and
+ aggregation helpers like `abs`, `min`, `max`, `sum`, `round`, `len`, `range` β and nothing
+ that performs I/O).
+3. **Output validation** rejects anything that is not a DataFrame, adds no new column, or adds
+ a non-numeric column; surviving new columns have `Β±inf` replaced with `NaN` so downstream
+ estimators do not break.
+
+`execute(code, X)` raises `FeatureExecutionError` if any layer fails. This is the CAAFE
+pattern: pandas/numpy transforms only, never arbitrary capability.
+
+!!! warning "Untrusted data needs more than the in-process allowlist"
+
+ The in-process sandbox blocks the obvious escapes, but for untrusted inputs you can still
+ require human-in-the-loop approval and/or route execution to a container sandbox via
+ `config.execution.sandbox`. See [Security](security.md).
+
+You can pass a custom executor into the engineer:
```python
GenAIFeatureEngineer(proposer, executor=FeatureCodeExecutor())
@@ -151,24 +201,43 @@ GenAIFeatureEngineer(proposer, executor=FeatureCodeExecutor())
## Customising the measurement
-By default the engineer measures lift with a `HistGradientBoosting*` estimator (classifier
-or regressor, chosen by task) wrapped in an imputation/encoding pipeline, scored with the
-evaluator's default metric for the task. Override the scoring estimator or evaluator:
+By default the engineer measures lift with a `HistGradientBoosting*` estimator β a
+`HistGradientBoostingClassifier` for classification tasks, otherwise a
+`HistGradientBoostingRegressor` β wrapped in an imputation/encoding pipeline (median impute
+for numerics; most-frequent impute plus one-hot encoding for categoricals). It is scored with
+the evaluator's default metric for the task. Override the scoring estimator, the evaluator, or
+the CV folds:
```python
from sklearn.ensemble import RandomForestClassifier
engineer = GenAIFeatureEngineer(
proposer,
- scorer_estimator=lambda task: RandomForestClassifier(n_estimators=200),
+ scorer_estimator=lambda task: RandomForestClassifier(n_estimators=200), # (1)!
cv=10,
random_state=7,
)
```
+1. `scorer_estimator` is a `Callable[[TaskType], estimator]`: it receives the task and returns
+ the estimator used to measure lift. The same estimator scores both the baseline and every
+ candidate, so the comparison stays fair.
+
+To tighten the acceptance bar, supply a gate with a non-zero `min_gain`:
+
+```python
+from fireflyframework_datascience.features import CostBenefitGate
+from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
+
+# Require more than +0.005 of lift before a feature is worth its complexity.
+gate = CostBenefitGate(min_gain=0.005)
+engineer = GenAIFeatureEngineer(proposer, gate=gate)
+```
+
## See also
-- [Datasets](datasets.md)
-- [Evaluation & Metrics](automl.md)
-- [Core Types](configuration.md)
-- [AutoML Pipeline](automl.md)
+- [Datasets](datasets.md) β the `Dataset` the engineer consumes and `with_features` returns.
+- [AutoML pipeline](automl.md) β where GenAI feature engineering fits in the end-to-end run.
+- [Agentic loop](agentic-loop.md) β the broader propose-gate pattern across the framework.
+- [LLM configuration](llm-configuration.md) β choosing and wiring the model behind the agent.
+- [Security](security.md) β sandboxing and approval for model-generated code.
diff --git a/docs/img/banner.svg b/docs/img/banner.svg
index 5a96550..1cf1ce2 100644
--- a/docs/img/banner.svg
+++ b/docs/img/banner.svg
@@ -1,83 +1,47 @@
-
\ No newline at end of file
diff --git a/docs/img/favicon.svg b/docs/img/favicon.svg
new file mode 100644
index 0000000..46138b1
--- /dev/null
+++ b/docs/img/favicon.svg
@@ -0,0 +1,8 @@
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/docs/img/logo.svg b/docs/img/logo.svg
new file mode 100644
index 0000000..f9aba88
--- /dev/null
+++ b/docs/img/logo.svg
@@ -0,0 +1,3 @@
+
+
+
\ No newline at end of file
diff --git a/docs/index.md b/docs/index.md
index 7a76534..81d1e68 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -1,93 +1,146 @@
-
-**AutoML that fuses GenAI with classical ML & Deep Learning β hexagonal, secure-by-default, native to the Firefly Framework.**
+
-> New here? Jump to the **[Tutorial](tutorial.md)** for a guided, runnable walkthrough, or
-> **[Configuring the LLM](llm-configuration.md)** to wire up GenAI.
->
-> π **Business or transformation leader?** A polished
-> [Strategic Introduction (PDF)](brief/firefly-datascience-strategic-introduction.pdf) frames the value
-> β faster time-to-value, governed GenAI, no lock-in β without the engineering detail.
+**AutoML that fuses GenAI with classical ML & Deep Learning β hexagonal, secure-by-default, and
+native to the Firefly Framework.**
-`fireflyframework-datascience` is a state-of-the-art Python metaframework for AutoML. It pairs **GenAI**
-β built on [`fireflyframework-agentic`](https://github.com/fireflyframework/fireflyframework-agentic),
-which wraps [Pydantic AI](https://ai.pydantic.dev/) β with **traditional ML and Deep Learning**, so any
-team can apply data science to any project quickly, with production governance, hexagonal
+`fireflyframework-datascience` is a state-of-the-art Python metaframework for AutoML. It pairs
+**GenAI** β built on [`fireflyframework-agentic`](https://github.com/fireflyframework/fireflyframework-agentic),
+which wraps [Pydantic AI](https://ai.pydantic.dev/) β with **traditional ML and Deep Learning**, so
+any team can apply data science to any project quickly, with production governance, hexagonal
swappability, and security by default.
-The reproducible pattern: **the LLM proposes; a deterministic classical engine decides.** GenAI
-proposes code, features, pipelines and seeds; a classical engine trains, scores and selects; and every
-GenAI step is gated behind a measured improvement over a seeded classical baseline. GenAI is a
-governed, measurably-gated accelerator over a battle-tested classical core β never a black box.
+!!! firefly "The reproducible pattern β the LLM proposes; the classical engine decides"
+
+ GenAI proposes code, features, pipelines and seeds; a deterministic classical engine trains,
+ scores and selects; and **every GenAI step is gated behind a measured improvement over a seeded
+ classical baseline**. GenAI is a governed, measurably-gated accelerator over a battle-tested
+ classical core β never a black box.
-
+
-## The 7 pillars
-
-1. **Classical-first AutoML.** A deterministic engine trains, scores and selects models across
- scikit-learn, XGBoost, LightGBM, CatBoost, AutoGluon and TabPFN β reproducible from a seed.
-2. **GenAI as a gated accelerator.** The LLM proposes features and pipelines; nothing ships unless it
- beats the seeded classical baseline (`genai.cost_benefit_gate` is on by default).
-3. **The agentic ML-engineering loop.** Propose β train β score β select, driven by the agentic
- runtime, with measured improvement at every step.
-4. **Deep Learning, swappable.** PyTorch Lightning and HuggingFace sit behind the same ports as the
- classical adapters β tabular, text, vision, timeseries and multimodal.
-5. **Hexagonal & swappable.** Every ML/MLOps library (MLflow, Feast, BentoML, β¦) is a swappable
- adapter behind a `Protocol` port; the core stays library-agnostic.
-6. **Secure by default.** LLM-generated code runs in a sandbox (`monty` by default) with timeouts and
- approval gates; GenAI is **off** until you enable it.
-7. **Firefly-native.** Auto-configuration, dependency injection, a startup banner + wiring summary,
- CalVer, and the same CI gates as the rest of the Firefly Framework.
-
-## Install
-
-```bash
-uv add fireflyframework-datascience # core
-uv add 'fireflyframework-datascience[automl-stack]' # + classical AutoML + tracking
-```
+!!! tip "Want the whole story in one document?"
-## End-to-end example
+ **[The Complete Guide (PDF)](brief/firefly-datascience-complete-guide.pdf)** combines the executive
+ summary and strategic case (faster time-to-value, governed GenAI, no lock-in) with the full
+ architecture, a hands-on tutorial, and the benchmark evidence β one document for both leaders and
+ engineers.
-Booting an application returns a started `ApplicationContext`: the loaded config plus the wired DI
-container.
+## Why Firefly DataScience?
-```python
-from fireflyframework_datascience import (
- FireflyDataScienceApplication,
- FireflyDataScienceConfig,
- Modality,
- TaskType,
-)
-
-# Boot: load config -> banner -> wire container -> wiring summary -> ready context
-app = FireflyDataScienceApplication.run()
-
-print(app.config.default_ml_framework) # "sklearn"
-print(app.bean_count) # number of wired beans
-print(app.applied_auto_configurations) # discovered auto-configurations
-
-# Core domain types stay importable with zero ML extras installed
-task = TaskType.BINARY
-assert task.is_classification()
-assert Modality.TABULAR in Modality
-```
+
+
+- :material-flask-outline:{ .lg .middle } __Classical-first AutoML__
+
+ ---
+
+ A deterministic engine trains, scores and selects across scikit-learn, XGBoost, LightGBM,
+ CatBoost, AutoGluon and TabPFN β reproducible from a seed.
+
+ [:octicons-arrow-right-24: Classical AutoML](automl.md)
+
+- :material-creation-outline:{ .lg .middle } __GenAI as a gated accelerator__
+
+ ---
+
+ The LLM proposes features and pipelines; nothing ships unless it beats the seeded classical
+ baseline (`genai.cost_benefit_gate` is on by default).
+
+ [:octicons-arrow-right-24: GenAI features](genai-features.md)
+
+- :material-sync:{ .lg .middle } __The agentic ML-engineering loop__
+
+ ---
+
+ Propose β train β score β select, driven by the agentic runtime, with a measured improvement
+ required at every step.
+
+ [:octicons-arrow-right-24: Agentic loop](agentic-loop.md)
+
+- :material-layers-triple-outline:{ .lg .middle } __Deep Learning, swappable__
+
+ ---
+
+ PyTorch Lightning and HuggingFace sit behind the same ports as the classical adapters β tabular,
+ text, vision, timeseries and multimodal.
+
+ [:octicons-arrow-right-24: Deep Learning](deep-learning.md)
+
+- :material-hexagon-outline:{ .lg .middle } __Hexagonal & swappable__
+
+ ---
+
+ Every ML/MLOps library (MLflow, Feast, BentoML, β¦) is a swappable adapter behind a `Protocol`
+ port; the core stays library-agnostic.
+
+ [:octicons-arrow-right-24: Architecture](architecture.md)
+
+- :material-shield-lock-outline:{ .lg .middle } __Secure by default__
+
+ ---
+
+ LLM-generated code runs in a sandbox (`monty` by default) with timeouts and approval gates;
+ GenAI is **off** until you enable it.
+
+ [:octicons-arrow-right-24: Security model](security.md)
+
+
+
+## Get started in 30 seconds
-Configuration is a `pydantic-settings` model. Values resolve (highest precedence first) from
-constructor kwargs β `FIREFLY_DATASCIENCE_*` env vars β `.env` β profile YAML overlays β
-`firefly-datascience.yaml` β field defaults. GenAI is classical-first and **off by default**:
+=== "Install"
+
+ ```bash
+ uv add fireflyframework-datascience # core (ports, app, DI β no heavy ML libs)
+ uv add 'fireflyframework-datascience[automl-stack]' # + classical AutoML + tracking
+ ```
+
+ Requires **Python 3.13+**. Extras compose, e.g. `[tabular,tracking,genai]`.
+
+=== "Boot the app"
+
+ ```python
+ from fireflyframework_datascience import FireflyDataScienceApplication
+
+ # load config -> print banner -> wire DI container -> wiring summary -> ready context
+ app = FireflyDataScienceApplication.run()
+
+ print(app.bean_count) # number of wired beans
+ print(app.config.default_ml_framework) # "sklearn"
+ print(app.applied_auto_configurations) # discovered auto-configurations
+ ```
+
+=== "CLI"
+
+ ```bash
+ firefly-ds doctor # check your environment & installed adapters
+ firefly-ds introspect # boot the app and show discovered auto-configurations
+ ```
+
+[Full quick start :octicons-arrow-right-24:](quickstart.md)
+
+GenAI is classical-first and **off by default** β opt in, and require a measured win, explicitly:
```python
-config = FireflyDataScienceConfig(
- app_name="lumen-credit-risk",
- default_ml_framework="lightgbm",
- profiles=["prod"],
-)
+config = FireflyDataScienceConfig(app_name="lumen-credit-risk", default_ml_framework="lightgbm")
config.genai.enabled = True # opt in to the GenAI accelerator
config.genai.cost_benefit_gate = True # require a measured win over baseline
config.execution.sandbox = "docker" # sandbox LLM-generated code
@@ -95,40 +148,68 @@ config.execution.sandbox = "docker" # sandbox LLM-generated code
app = FireflyDataScienceApplication.run(config=config)
```
-You can also boot from a config directory and active profiles directly:
+## Explore the docs
-```python
-app = FireflyDataScienceApplication.run(config_dir="./config", profiles=["prod"])
-```
+
-## CLI
+- :material-hexagon-outline:{ .middle } __[Architecture](architecture.md)__
-```bash
-firefly-ds doctor # check your environment & installed adapters
-firefly-ds introspect # boot the app and show discovered auto-configurations
-```
+ ---
+ Hexagonal ports/adapters, the DI container, and auto-configuration.
+
+- :material-rocket-launch-outline:{ .middle } __[Quick Start](quickstart.md)__
+
+ ---
+ Install, boot an `ApplicationContext`, run your first AutoML job.
+
+- :material-tune:{ .middle } __[Configuration](configuration.md)__
+
+ ---
+ `FireflyDataScienceConfig`, profiles, env vars, YAML overlays.
+
+- :material-database-outline:{ .middle } __[Datasets](datasets.md)__
+
+ ---
+ Dataset backends (pandas, β¦) and `Modality`.
+
+- :material-flask-outline:{ .middle } __[Classical AutoML](automl.md)__
+
+ ---
+ The classical-first engine: train, score, select.
+
+- :material-creation-outline:{ .middle } __[GenAI features](genai-features.md)__
+
+ ---
+ The gated GenAI accelerator and the cost-benefit gate.
+
+- :material-sync:{ .middle } __[Agentic loop](agentic-loop.md)__
+
+ ---
+ Propose β train β score β select on the agentic runtime.
+
+- :material-layers-triple-outline:{ .middle } __[Deep Learning](deep-learning.md)__
+
+ ---
+ PyTorch Lightning & HuggingFace behind the ports.
+
+- :material-server-network:{ .middle } __[Serving](serving.md)__
+
+ ---
+ Model registry, feature store, and BentoML serving.
+
+- :material-shield-lock-outline:{ .middle } __[Security](security.md)__
+
+ ---
+ Sandboxed code execution, approval gates, secure defaults.
+
+- :material-chart-line:{ .middle } __[Benchmarks](benchmarks.md)__
+
+ ---
+ Reproducible measurement of GenAI vs. classical baselines.
+
+- :material-bank-outline:{ .middle } __[Use case: Lumen](use-case-lumen.md)__
+
+ ---
+ End-to-end lending vertical worked example.
-## Documentation
-
-| Page | What it covers |
-| --- | --- |
-| [Architecture](architecture.md) | Hexagonal ports/adapters, the DI container, auto-configuration. |
-| [Quickstart](quickstart.md) | Install, boot an `ApplicationContext`, run your first AutoML job. |
-| [Configuration](configuration.md) | `FireflyDataScienceConfig`, profiles, env vars, YAML overlays. |
-| [Datasets](datasets.md) | Dataset backends (pandas, β¦) and `Modality`. |
-| [AutoML](automl.md) | The classical-first engine: train, score, select. |
-| [GenAI features](genai-features.md) | The gated GenAI accelerator and the cost-benefit gate. |
-| [Agentic loop](agentic-loop.md) | Propose β train β score β select on the agentic runtime. |
-| [Deep Learning](deep-learning.md) | PyTorch Lightning & HuggingFace behind the ports. |
-| [Serving](serving.md) | Model registry, feature store, and BentoML serving. |
-| [Security](security.md) | Sandboxed code execution, approval gates, secure defaults. |
-| [Benchmarks](benchmarks.md) | Reproducible measurement of GenAI vs. classical baselines. |
-| [Use case: Lumen](use-case-lumen.md) | End-to-end lending vertical worked example. |
-
-## See also
-
-- [Architecture](architecture.md)
-- [Quickstart](quickstart.md)
-- [Configuration](configuration.md)
-- [AutoML](automl.md)
-- [GenAI features](genai-features.md)
+
+
+- :material-map-outline:{ .lg .middle } __`tutorial.py` β the full guided tour__
+
+ ---
+
+ The whole framework end-to-end on a synthetic credit-risk dataset: boot β validate β classical
+ AutoML β GenAI feature engineering β agentic loop β serve. Runs **offline**; it uses deterministic
+ stand-in proposers and prints exactly how to switch on a real LLM. Needs the `tabular` extra.
+
+ ```bash
+ uv run python samples/tutorial.py
+ ```
+
+ [:octicons-arrow-right-24: Walkthrough](tutorial.md)
+
+- :material-bank-outline:{ .lg .middle } __`lumen_credit_risk.py` β a focused use case__
+
+ ---
+
+ One realistic (synthetic) lending dataset where default risk is driven by debt-to-income.
+ GenAI feature engineering discovers `debt_to_income`, the gate keeps it because it measurably
+ lifts the score, AutoML selects the winner, and it is served to score a new applicant. Runs
+ **offline** (`StaticFeatureProposer`). Needs the `tabular` extra.
+
+ ```bash
+ uv run python samples/lumen_credit_risk.py
+ ```
+
+ [:octicons-arrow-right-24: Use case: Lumen](use-case-lumen.md)
+
+- :material-database-outline:{ .lg .middle } __`industry_showcase.py` β real public data__
+
+ ---
+
+ The full pipeline (load β validate β AutoML β holdout evaluation) on genuine, mixed-type data
+ with categorical features, loaded straight from OpenML β no Kaggle account needed. Two cases:
+ `credit-g` (German credit risk) and `bank-marketing` (campaign conversion). Needs the `tabular`
+ and `data` extras and network access.
+
+ ```bash
+ uv run python samples/industry_showcase.py
+ ```
+
+ [:octicons-arrow-right-24: Datasets](datasets.md)
+
+- :material-creation-outline:{ .lg .middle } __`genai_llm_showcase.py` β a real LLM__
+
+ ---
+
+ The only sample that calls a real model (Claude / GPT / β¦) for **both** GenAI feature engineering
+ and the agentic ML-engineering loop. Model and credentials come from the environment β nothing is
+ hard-coded. Needs the `tabular` and `genai` extras and an LLM key.
+
+ ```bash
+ export ANTHROPIC_API_KEY=sk-ant-... # (1)!
+ uv run python samples/genai_llm_showcase.py
+ ```
+
+ [:octicons-arrow-right-24: Configuring the LLM](llm-configuration.md)
+
+
+
+1. Or `OPENAI_API_KEY` / `GEMINI_API_KEY`. The model string defaults to
+ `anthropic:claude-haiku-4-5`; override it with
+ `export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=...`. The script exits cleanly with a message if
+ no key is set.
## Run them
-```bash
-uv run python samples/tutorial.py # offline, ~5 s
-uv run python samples/lumen_credit_risk.py # offline, ~10 s
-uv run python samples/industry_showcase.py # real OpenML data (network)
+=== "Offline (no key)"
-# real LLM β set a key first (see Configuring the LLM)
-export ANTHROPIC_API_KEY=sk-ant-...
-export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5
-uv run python samples/genai_llm_showcase.py
-```
+ ```bash
+ uv run python samples/tutorial.py # the full guided tour
+ uv run python samples/lumen_credit_risk.py # focused credit-risk use case
+ uv run python samples/industry_showcase.py # real OpenML data (needs network)
+ ```
+
+=== "Real LLM"
+
+ ```bash
+ export ANTHROPIC_API_KEY=sk-ant-... # or OPENAI_API_KEY=...
+ export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5 # optional; this is the default
+ uv run python samples/genai_llm_showcase.py
+ ```
+
+!!! tip "The offline samples are the place to start"
+
+ `tutorial.py` and `lumen_credit_risk.py` run the *same* cost/benefit gate as the real-LLM
+ showcase β the only difference is who proposes the features and models. Both finish in a few
+ seconds offline (β5β10 s on a laptop), so you can see the full governance loop without spending a token.
## What the real-LLM showcase produces
@@ -41,14 +125,20 @@ A representative run with `anthropic:claude-haiku-4-5`:
```
The model proposes; the deterministic engine measures; the gate keeps only what is proven. Nothing
-unverified is adopted β exactly the governance described in [GenAI Feature Engineering](genai-features.md)
-and the [Agentic Loop](agentic-loop.md).
+unverified is adopted β exactly the governance described in
+[GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md).
+
+!!! note "LLM output is non-deterministic"
+ The exact feature names, gains and attempt counts vary run to run β the LLM is free to propose
+ anything. What is invariant is the gate: a proposal is accepted only if it lifts a seeded,
+ cross-validated baseline, so the *decision* is always reproducible even when the *proposal* is not.
## Benchmark scenarios
-The [`benchmarks/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/benchmarks)
-directory holds the evaluation harnesses; all results live in
+For evaluation rather than demonstration, the
+[`benchmarks/`](https://github.com/fireflyframework/fireflyframework-datascience/tree/main/benchmarks)
+directory holds the harnesses; all results live in
[`benchmarks/RESULTS.md`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/benchmarks/RESULTS.md).
| Script | What it measures |
@@ -61,5 +151,8 @@ directory holds the evaluation harnesses; all results live in
## See also
-- [Tutorial](tutorial.md) Β· [Configuring the LLM](llm-configuration.md) Β· [Benchmarks](benchmarks.md) Β·
- [Use Case: Lumen](use-case-lumen.md)
+- [Tutorial](tutorial.md) β the step-by-step walkthrough behind `tutorial.py`.
+- [Use case: Lumen](use-case-lumen.md) β the credit-risk story behind `lumen_credit_risk.py`.
+- [Configuring the LLM](llm-configuration.md) β providers, model strings and keys for the real-LLM sample.
+- [Datasets](datasets.md) β the OpenML loader used by `industry_showcase.py`.
+- [Benchmarks](benchmarks.md) β the evaluation harnesses and their results.
diff --git a/docs/security.md b/docs/security.md
index d3bca2b..ebd3244 100644
--- a/docs/security.md
+++ b/docs/security.md
@@ -4,6 +4,9 @@
The GenAI accelerators (CAAFE-style automated feature engineering, agentic analysis) ask a model to *write Python that runs against your data*. That is an attack surface. The framework's job is to make the default path safe even when the model is wrong, compromised, or steered by adversarial data. This page describes the trust model, the controls that enforce it, and β importantly β where those controls stop.
+!!! warning "GenAI is off until you enable it"
+ Firefly is classical-first. `genai.enabled` defaults to **`False`** ([`GenAIConfig`](configuration.md)), so none of the code-generating accelerators run unless you opt in. Until then there is no LLM, no generated code, and no executor invocation β the secure default is *nothing executes*. Everything below describes the controls that engage **once you turn GenAI on**.
+
@@ -18,6 +21,9 @@ The model is **not** trusted. We assume any of:
We *do* trust the host process, the installed libraries (`pandas`, `numpy`), and the configuration. The goal is: a wrong or hostile model snippet cannot do more than fail loudly.
+!!! firefly "The LLM proposes; the executor decides"
+ The model only ever produces *candidate text*. It cannot import, open files, reach the network, or mutate your data. The classical executor β `FeatureCodeExecutor` β is the part that actually runs anything, and it does so under a static analysis pass, a minimal builtins allowlist, and a strict result contract. The proposal is cheap and untrusted; the decision to run is governed.
+
## Layer 1 β static safety analysis
Before a single byte of model output executes, `FeatureCodeExecutor` runs it through the static analyzer reused from `fireflyframework_agentic.execution`. The policy denies dangerous modules, dangerous builtins, and **all dunder access** (which is how sandbox escapes are typically built):
@@ -49,24 +55,35 @@ executor = FeatureCodeExecutor()
# Rejected statically β never runs:
try:
- executor.execute("import os; os.system('rm -rf /')", X)
+ executor.execute("import os; os.system('rm -rf /')", X) # (1)!
except FeatureExecutionError as exc:
print(exc) # Unsafe feature code rejected: ...
```
+1. `os` is in `denied_modules`, so `analyze_code` reports the snippet as unsafe and `execute` raises before the `exec` call is ever reached.
+
Internally the executor calls `analyze_code(code, policy)` and refuses to proceed unless `report.safe` is true, surfacing each `report.violations[*].message` in the raised `FeatureExecutionError`.
+!!! success "Expected"
+ ```text
+ Unsafe feature code rejected: ...
+ ```
+ A rejected snippet raises a typed `FeatureExecutionError` (a subclass of `FireflyDataScienceError`) β it does not run, does not return a partial frame, and does not leak a stack trace from inside the model's code.
+
## Layer 2 β restricted execution
Code that passes static analysis is still not trusted. It runs via `exec` with a **minimal `__builtins__` allowlist** and a namespace that exposes only the dataframe and the two numeric libraries:
```python
# Inside FeatureCodeExecutor.execute, conceptually:
-namespace = {"df": X.copy(), "pd": pd, "np": np}
-exec(compile(code, "", "exec"), {"__builtins__": _SAFE_BUILTINS}, namespace)
+namespace = {"df": X.copy(), "pd": pd, "np": np} # (1)!
+exec(compile(code, "", "exec"), {"__builtins__": _SAFE_BUILTINS}, namespace) # (2)!
```
-`_SAFE_BUILTINS` is a hand-picked set β `abs`, `min`, `max`, `sum`, `round`, `len`, `range`, `zip`, `map`, `filter`, `sorted`, the numeric/collection constructors, and `pow`. There is no `open`, no `__import__`, no I/O. Key properties:
+1. The frame is a **copy** (`X.copy()`) β model code mutates `df` in place, never the caller's original data.
+2. The global `__builtins__` is replaced by `_SAFE_BUILTINS`, so the usual escape hatches (`__import__`, `open`, `eval`) simply do not exist in scope.
+
+`_SAFE_BUILTINS` is a hand-picked set β `abs`, `min`, `max`, `sum`, `round`, `len`, `range`, `enumerate`, `zip`, `map`, `filter`, `sorted`, the numeric/collection constructors (`float`, `int`, `bool`, `str`, `list`, `dict`, `tuple`, `set`), and `pow`. There is no `open`, no `__import__`, no I/O. Key properties:
- The frame is a **copy** (`X.copy()`) β model code cannot mutate the caller's data in place.
- The contract is **pandas/numpy transforms only**, never arbitrary capability. This is the CAAFE pattern.
@@ -78,6 +95,8 @@ code = "df['amount_per_day'] = df['amount'] / df['tenure_days'].clip(lower=1)"
X_enriched = executor.execute(code, X)
```
+The post-conditions are enforced in order: a non-`DataFrame` result raises `Feature code must leave a pandas DataFrame in `df``; a frame with no new columns raises `Feature code added no new column`; a non-numeric new column raises `New feature is not numeric`. Every failure mode is a typed `FeatureExecutionError`, so downstream estimators never receive a malformed frame.
+
## Layer 3 β the tiered sandbox
Layers 1 and 2 run **in-process**. They block the obvious capabilities, but a determined escape against a CPython process is never something to bet sensitive data on. For untrusted data, escalate isolation with `execution.sandbox` in `ExecutionConfig`:
@@ -100,23 +119,34 @@ The tiers, from least to most isolated:
| `docker` | OS-level container, no host network/FS | untrusted data |
| `e2b` | Remote ephemeral microVM | untrusted data at higher assurance |
-Set it via env or YAML β never hardcode `local` for production:
+The literal type for `sandbox` is exactly `Literal["monty", "docker", "e2b", "local"]`, defaulting to `"monty"` β any other value fails validation at load time. Set it via env or YAML β never hardcode `local` for production:
-```bash
-export FIREFLY_DATASCIENCE_EXECUTION__SANDBOX=docker
-export FIREFLY_DATASCIENCE_EXECUTION__TIMEOUT_SECONDS=30
-```
+=== "Environment variables"
-```yaml
-# firefly-datascience-prod.yaml
-execution:
- sandbox: e2b
- timeout_seconds: 30
- require_approval: true
-```
+ ```bash
+ export FIREFLY_DATASCIENCE_EXECUTION__SANDBOX=docker # (1)!
+ export FIREFLY_DATASCIENCE_EXECUTION__TIMEOUT_SECONDS=30
+ ```
+
+ 1. The `FIREFLY_DATASCIENCE_` prefix and the `__` nested delimiter come straight from `SettingsConfigDict` on `FireflyDataScienceConfig`. `EXECUTION__SANDBOX` maps onto `config.execution.sandbox`.
+
+=== "Profile YAML"
+
+ ```yaml
+ # firefly-datascience-prod.yaml
+ execution:
+ sandbox: e2b
+ timeout_seconds: 30
+ require_approval: true
+ ```
+
+ Profile overlays outrank the base `firefly-datascience.yaml`, so a `prod` profile can tighten isolation without touching the base file. See [Configuration](configuration.md) for the full precedence order.
Beyond the strongest sandbox sits **HITL** (human-in-the-loop): when `execution.require_approval` is `True` (the default), generated code is surfaced for human approval before it runs. This is the final tier β a person, not a policy, signs off.
+!!! note "Defaults are the safe end of every axis"
+ Out of the box, `sandbox = "monty"` (in-process restricted interpreter), `timeout_seconds = 60`, and `require_approval = True`. You loosen these deliberately β and only `local` removes isolation entirely.
+
## Prompt-injection-via-data defense
The subtle attack is not the model going rogue on its own; it is a **column value or header that steers the model** into writing malicious code. Firefly's answer is *defense in depth that does not trust the model's intent*:
@@ -126,7 +156,8 @@ The subtle attack is not the model going rogue on its own; it is a **column valu
3. **The numeric-new-column contract** means injected code that tries to do anything other than add a numeric feature fails the post-conditions.
4. **Sandboxing + HITL** mean that for genuinely untrusted data you route to `docker`/`e2b` and require approval β so injection cannot silently reach a capability.
-The framework cannot inspect or sanitize the *semantics* of your data. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on.
+!!! warning "The framework does not read your data's meaning"
+ Firefly cannot inspect or sanitize the *semantics* of your data. Prompt-injection defense rests on capability restriction and sandboxing, not on detecting malicious text. Treat data of unknown provenance as untrusted input: raise `execution.sandbox` and keep `require_approval` on.
## Governance β the CostBenefitGate
@@ -135,7 +166,7 @@ GenAI is **off by default** (`genai.enabled = False`) β Firefly is classical-f
```python
config.genai.enabled # False by default
config.genai.cost_benefit_gate # True β gate LLM spend on expected benefit
-config.genai.budget_usd # optional hard ceiling, e.g. 5.00
+config.genai.budget_usd # optional hard ceiling (float | None), e.g. 5.00
```
```yaml
@@ -147,7 +178,8 @@ genai:
budget_usd: 5.00
```
-The gate is a *governance* control, not a security control: it limits spend and runaway agentic loops, not capability. Keep both axes in mind β `cost_benefit_gate` governs **how much** the model runs; the executor/sandbox govern **what its output may do**.
+!!! firefly "Two orthogonal gates: how much, and what"
+ The `CostBenefitGate` is a *governance* control, not a security control: it limits spend and runaway agentic loops, not capability. Keep both axes in mind β `cost_benefit_gate` governs **how much** the model runs; the executor and sandbox govern **what its output may do**. Neither substitutes for the other.
## Limits of the trust model
@@ -159,11 +191,13 @@ Be precise about what these controls do and do not give you:
- `require_approval` is only as strong as the human approving. Do not rubber-stamp generated code.
- Secrets in the host environment are visible to `local`/`monty` execution. Do not run untrusted-data jobs in a process holding production credentials.
-**Secure default:** `genai.enabled = False`; when enabled, `sandbox = "monty"`, `require_approval = True`, `cost_benefit_gate = True`. Relax deliberately, per profile, never globally.
+!!! tip "Secure default, stated once"
+ `genai.enabled = False`; when enabled, `sandbox = "monty"`, `require_approval = True`, `cost_benefit_gate = True`. Relax deliberately, per profile, never globally.
## See also
-- [Configuration](configuration.md)
-- [Feature Engineering](genai-features.md)
-- [GenAI Accelerators](genai-features.md)
-- [Getting Started](quickstart.md)
+- [Configuration](configuration.md) β `ExecutionConfig`, `GenAIConfig`, and the resolution precedence
+- [LLM configuration](llm-configuration.md) β wiring a model once GenAI is enabled
+- [GenAI features](genai-features.md) β the CAAFE accelerator the executor protects
+- [Agentic loop](agentic-loop.md) β where generated code and the `CostBenefitGate` meet
+- [Getting started](quickstart.md) β the classical-first default path
diff --git a/docs/serving.md b/docs/serving.md
index dad03a2..f7b6710 100644
--- a/docs/serving.md
+++ b/docs/serving.md
@@ -1,15 +1,24 @@
# Serving & Lineage
-**Serve a trained model in-process by default, track experiments, and emit lineage β all behind narrow ports you can swap.**
+**Serve a trained model in-process by default, track experiments, and emit lineage β all behind narrow ports you can swap without touching your code.**
-Firefly DataScience keeps the core dependency-free. A fitted `Model` is served by a `ModelServerPort`; experiment runs go through a `TrackerPort`; data/model lineage flows through a `LineagePort`. Each port ships a zero-dependency default and an opt-in adapter behind an extra.
+Firefly DataScience keeps the core dependency-free. A fitted `Model` is served by a `ModelServerPort`; experiment runs go through a `TrackerPort`; data and model lineage flows through a `LineagePort`. Each port ships a zero-dependency default β registered automatically by the container β and an opt-in adapter behind an extra. Because every adapter implements the same port, moving from the in-process default to MLflow, BentoML, or OpenLineage is a configuration change, not a rewrite.
+
+!!! firefly "The same gate, applied to operations"
+
+ The training loop trusts only measured improvement; serving and lineage extend that discipline to
+ production. The defaults are deterministic and dependency-free, so a run that scored well in
+ development behaves identically when served. Heavier backends are opt-in and swapped behind a port β
+ you adopt MLflow or OpenLineage when they earn their keep, never by default.
-
+
## The model-server port
+A `ModelServerPort` loads a fitted `Model` and answers prediction requests. It is a `runtime_checkable` `Protocol`, so any object with a `name` attribute plus `load` and `predict` methods satisfies it β there is no base class to inherit.
+
```python
from typing import Any, Protocol, runtime_checkable
from fireflyframework_datascience.models import Model
@@ -21,7 +30,7 @@ class ModelServerPort(Protocol):
def predict(self, X: Any) -> Any: ...
```
-Any object with a `name`, `load`, and `predict` satisfies the port β no base class to inherit.
+The container registers a server for you. `ServingAutoConfiguration` provides `LocalModelServer` as the primary `ModelServerPort` bean, and only when no other bean of that type is already present (`@conditional_on_missing_bean`). Register your own adapter and it wins; otherwise the in-process default applies.
## LocalModelServer (default)
@@ -36,7 +45,13 @@ preds = server.predict(X_test)
proba = server.predict_proba(X_test) # if the estimator supports it
```
-Calling `predict` before `load` raises `FireflyDataScienceError`. The loaded model is available via `server.model`.
+Calling `predict` (or `predict_proba`) before `load` raises `FireflyDataScienceError("No model loaded β call load(model) first")`. The loaded model is available via the read-only `server.model` property, which is `None` until you call `load`.
+
+!!! note "predict_proba is not part of the port"
+
+ `predict_proba` is an extra on `LocalModelServer`, not on `ModelServerPort`. Code written against the
+ port should rely on `name`, `load`, and `predict` only. Under the hood `Model.predict_proba` raises
+ `AttributeError` if the wrapped estimator does not implement it.
### Loading a trained model and serving predictions
@@ -49,7 +64,7 @@ from fireflyframework_datascience.serving import LocalModelServer
# After training elsewhere, the Model was saved:
# model.save("artifacts/churn.joblib")
-# Load it back (only from trusted, first-party locations β joblib uses pickle):
+# Load it back (only from trusted, first-party locations β joblib uses pickle): # (1)!
model = Model.load("artifacts/churn.joblib")
server = LocalModelServer()
@@ -59,29 +74,51 @@ predictions = server.predict(X_new)
print(predictions)
```
+1. `Model.load` uses `joblib`, which uses `pickle` β and `pickle` executes arbitrary code on load. Load
+ models only from trusted, first-party locations (your own registry or artifact store), never from
+ untrusted input. See [Security](security.md) for the threat model.
+
Because `LocalModelServer` simply delegates to `Model.predict` / `Model.predict_proba`, the served output matches what the estimator produces directly.
## BentoMLModelServer (gated)
-For packaging/deployment to a BentoML service, use `BentoMLModelServer` from the adapters module. It requires the `serving` extra.
+For packaging and deployment to a BentoML service, use `BentoMLModelServer` from the adapters module. It requires the `serving` extra.
-```bash
-pip install "fireflyframework-datascience[serving]"
-```
+=== "In-process (default)"
-```python
-from fireflyframework_datascience.serving.adapters import BentoMLModelServer
+ ```python
+ from fireflyframework_datascience.serving import LocalModelServer
-server = BentoMLModelServer() # raises AdapterUnavailableError without the extra
-server.load(model)
-preds = server.predict(X_new)
-```
+ server = LocalModelServer() # name == "local", no extra dependency
+ server.load(model)
+ preds = server.predict(X_new)
+ ```
+
+=== "BentoML (gated)"
+
+ ```bash
+ pip install "fireflyframework-datascience[serving]"
+ ```
+
+ ```python
+ from fireflyframework_datascience.serving.adapters import BentoMLModelServer
-Without `bentoml` installed, construction raises `AdapterUnavailableError("BentoMLModelServer", "serving")`. Both servers expose the same `name`/`load`/`predict` surface, so swapping is a one-line change.
+ server = BentoMLModelServer() # name == "bentoml"; raises AdapterUnavailableError without the extra
+ server.load(model)
+ preds = server.predict(X_new)
+ ```
+
+Without `bentoml` installed, construction raises `AdapterUnavailableError("BentoMLModelServer", "serving")`. Both servers expose the same `name`/`load`/`predict` surface, so swapping is a one-line change. Calling `predict` before `load` on `BentoMLModelServer` raises `FireflyDataScienceError("No model loaded")`.
+
+!!! warning "BentoML packaging is a deployment concern"
+
+ `BentoMLModelServer` wraps a fitted model and integrates with BentoML's runner API when available;
+ full service packaging (bentos, runners, the HTTP server) lives in your deployment pipeline, outside
+ this reference adapter.
## Experiment tracking
-The `TrackerPort` records params, metrics, and model artifacts for a run.
+The `TrackerPort` records params, metrics, and model artifacts for a run. `start_run` returns a `RunHandle` β an opaque dataclass with `run_id` and `name`.
```python
from collections.abc import Mapping
@@ -98,6 +135,8 @@ class TrackerPort(Protocol):
def end_run(self) -> None: ...
```
+`TrackingAutoConfiguration` registers `NoOpTracker` by default (`@conditional_on_missing_bean`), and swaps in `MLflowTracker` as the primary bean only when the `tracking_enabled` config property is `True` (it defaults to `False`). You opt in to MLflow through configuration; nothing in your training code changes.
+
### NoOpTracker (default)
Records nothing (logs at debug level) and keeps the core dependency-free.
@@ -113,6 +152,11 @@ tracker.log_model(model.estimator, artifact_name="churn")
tracker.end_run()
```
+!!! note "RunHandle naming"
+
+ `NoOpTracker.start_run` always returns `run_id="noop"`. The handle's `name` is the `run_name` you pass,
+ falling back to `"run"` when you pass none.
+
### MLflowTracker (opt-in)
Logs to an MLflow backend. Requires the `tracking` extra.
@@ -135,7 +179,28 @@ tracker.log_model(model.estimator, artifact_name="churn")
tracker.end_run()
```
-Construction raises `AdapterUnavailableError("MLflowTracker", "tracking")` when `mlflow` is not installed. The API is identical to `NoOpTracker`, so code written against the port runs unchanged with either tracker.
+Construction raises `AdapterUnavailableError("MLflowTracker", "tracking")` when `mlflow` is not installed. The API is identical to `NoOpTracker`, so code written against the port runs unchanged with either tracker. `tracking_uri` defaults to `None` (MLflow's local store) and `experiment` defaults to `"firefly-datascience"`; under the hood `MLflowTracker` calls `mlflow.set_experiment(...)` on construction and `mlflow.sklearn.log_model(...)` for `log_model`.
+
+!!! tip "Same code, two backends"
+
+ Write against `TrackerPort`, develop with `NoOpTracker`, then flip `tracking_enabled` to `True` (and
+ install the `tracking` extra) to capture the very same runs in MLflow β no call-site edits.
+
+## Model registry
+
+The `RegistryPort` persists and retrieves models by name and version β a separate port from tracking, so a registry adapter can be swapped independently.
+
+```python
+from typing import Any, Protocol, runtime_checkable
+
+@runtime_checkable
+class RegistryPort(Protocol):
+ name: str
+ def register(self, model: Any, name: str) -> str: ...
+ def load(self, name: str, version: str | None = None) -> Any: ...
+```
+
+`register` returns the assigned version identifier; `load` resolves the latest version when `version` is `None`. Treat the registry as the trusted source for `Model.load` β only first-party artifact stores are safe to deserialize, because `joblib` uses `pickle`.
## Lineage
@@ -155,6 +220,8 @@ lineage = NoOpLineage() # default; lineage.name == "noop"
lineage.emit(event)
```
+`LineageEvent` is a dataclass whose `inputs`, `outputs`, and `metadata` all default to empty β only `name` is required. `LineageAutoConfiguration` registers `NoOpLineage` as the primary `LineagePort` bean by default, so lineage is always on but emits nowhere (it logs at debug level) until you provide a real backend.
+
### OpenLineageEmitter (gated)
Emits to an OpenLineage backend such as Marquez. Requires the `lineage` extra.
@@ -170,13 +237,21 @@ lineage = OpenLineageEmitter(
url="http://localhost:5000",
namespace="firefly-datascience",
)
-lineage.emit(event)
+lineage.emit(event) # lineage.name == "openlineage"
```
-Without the `openlineage` client installed, construction raises `AdapterUnavailableError("OpenLineageEmitter", "lineage")`.
+Without the `openlineage` client installed, construction raises `AdapterUnavailableError("OpenLineageEmitter", "lineage")`. `url` defaults to `"http://localhost:5000"` and `namespace` to `"firefly-datascience"`; the emitter constructs an `OpenLineageClient(url=url)` and forwards events to it.
+
+!!! success "Expected"
+
+ With the default `NoOpLineage`, `emit` returns `None` and writes a debug log line β your pipeline runs
+ unchanged whether or not a lineage backend is wired up. Swap in `OpenLineageEmitter` (same `emit`
+ surface) to send the same events to Marquez.
## See also
-- [Models & Training](automl.md)
-- [Tuning](automl.md)
-- [Getting Started](quickstart.md)
+- [Architecture](architecture.md) β the ports-and-adapters design these servers, trackers, and emitters plug into.
+- [AutoML](automl.md) β how a `Model` is trained, scored, and selected before you serve it.
+- [Configuration](configuration.md) β toggles such as `tracking_enabled` that pick the adapter.
+- [Security](security.md) β why `Model.load` must read only from trusted, first-party locations.
+- [Quickstart](quickstart.md) β train and serve a first model end to end.
diff --git a/docs/stylesheets/firefly.css b/docs/stylesheets/firefly.css
new file mode 100644
index 0000000..6ef96e7
--- /dev/null
+++ b/docs/stylesheets/firefly.css
@@ -0,0 +1,233 @@
+/* Firefly DataScience β docs theme.
+ *
+ * Brand: cyan/teal = the classical/data core; amber firefly = the gated GenAI accelerator.
+ * Boldness is spent in one place β amber appears ONLY in the `firefly` admonition that marks the
+ * recurring thesis ("the LLM proposes; the classical engine decides"). Everything else is cyan.
+ *
+ * Typography: Maven Pro (display/headings/nav) Β· Inter (body, via theme.font) Β· JetBrains Mono
+ * (code, via theme.font). Maven Pro is imported here and applied to headings only.
+ */
+
+@import url('https://fonts.googleapis.com/css2?family=Maven+Pro:wght@500;600;700;800&display=swap');
+
+/* ---- Brand tokens --------------------------------------------------------------------------- */
+:root {
+ --ff-display: "Maven Pro", -apple-system, BlinkMacSystemFont, "Segoe UI", sans-serif;
+ --ff-cyan: #06b6d4;
+ --ff-cyan-deep: #0891b2;
+ --ff-cyan-light: #67e8f9;
+ --ff-cyan-xlight: #d6fbff;
+ --ff-amber: #f6a821;
+ --ff-sky-1: #06121a;
+ --ff-sky-2: #0a1b22;
+ --ff-sky-3: #071419;
+}
+
+/* Cyan accent for both schemes (primary is set to `cyan` in mkdocs.yml; we refine the accents). */
+[data-md-color-scheme="default"] {
+ --md-accent-fg-color: var(--ff-cyan-deep);
+ --md-typeset-a-color: var(--ff-cyan-deep);
+ --ff-card-bg: #ffffff;
+ --ff-card-border: #cfe9ef;
+ --ff-rule: linear-gradient(90deg, var(--ff-cyan-light), rgba(34, 211, 238, 0.05));
+}
+[data-md-color-scheme="slate"] {
+ --md-accent-fg-color: var(--ff-cyan-light);
+ --md-typeset-a-color: var(--ff-cyan-light);
+ --md-default-bg-color: #0b1a20;
+ --md-code-bg-color: #0a1820;
+ --ff-card-bg: #0e2129;
+ --ff-card-border: #1d3a44;
+ --ff-rule: linear-gradient(90deg, var(--ff-cyan), rgba(34, 211, 238, 0.04));
+}
+
+/* ---- Typography ----------------------------------------------------------------------------- */
+.md-typeset h1,
+.md-typeset h2,
+.md-typeset h3,
+.md-typeset h4,
+.md-typeset h5,
+.md-typeset h6,
+.md-header__title,
+.md-nav__title,
+.md-ellipsis {
+ font-family: var(--ff-display);
+ letter-spacing: -0.01em;
+}
+.md-typeset h1 {
+ font-weight: 800;
+ letter-spacing: -0.02em;
+ color: var(--md-default-fg-color);
+}
+.md-typeset h2 {
+ font-weight: 700;
+ margin-top: 2.4em;
+ padding-bottom: 0.25em;
+}
+/* Signature: a short cyan accent rule under h2, echoing the banner's accent rule. */
+.md-typeset h2::after {
+ content: "";
+ display: block;
+ width: 2.4rem;
+ height: 2.5px;
+ margin-top: 0.45rem;
+ border-radius: 2px;
+ background: var(--ff-rule);
+}
+.md-typeset h3 { font-weight: 700; }
+/* The header site-title carries the brand voice. */
+.md-header__title { font-weight: 700; font-size: 1.05rem; }
+
+/* ---- Hero (homepage) ------------------------------------------------------------------------ */
+/* Visually-hidden but accessible β gives the homepage a real
(so MkDocs doesn't inject a
+ * redundant "Home" title) while the banner serves as the visual masthead. */
+.firefly-sr-only {
+ position: absolute !important;
+ width: 1px;
+ height: 1px;
+ padding: 0;
+ margin: -1px;
+ overflow: hidden;
+ clip: rect(0, 0, 0, 0);
+ white-space: nowrap;
+ border: 0;
+}
+.firefly-hero {
+ margin: 0.4rem 0 2.2rem;
+}
+.firefly-hero img {
+ width: 100%;
+ height: auto;
+ border-radius: 14px;
+ box-shadow: 0 14px 40px -18px rgba(6, 182, 212, 0.55), 0 2px 0 rgba(255, 255, 255, 0.03) inset;
+ display: block;
+}
+.firefly-cta {
+ display: flex;
+ flex-wrap: wrap;
+ gap: 0.7rem;
+ margin-top: 1.3rem;
+ align-items: center;
+}
+/* Primary CTA β filled cyan; secondary β outlined. */
+.md-typeset .firefly-cta .md-button {
+ border-radius: 9px;
+ font-family: var(--ff-display);
+ font-weight: 600;
+ letter-spacing: 0;
+ transition: transform 0.15s ease, box-shadow 0.15s ease, background-color 0.15s ease;
+}
+.md-typeset .firefly-cta .md-button--primary {
+ background: var(--ff-cyan-deep);
+ border-color: var(--ff-cyan-deep);
+ color: #fff;
+}
+.md-typeset .firefly-cta .md-button:hover {
+ transform: translateY(-1px);
+ box-shadow: 0 8px 22px -10px rgba(6, 182, 212, 0.7);
+ background: var(--ff-cyan);
+ border-color: var(--ff-cyan);
+ color: #fff;
+}
+
+/* ---- Grid cards ----------------------------------------------------------------------------- */
+.md-typeset .grid.cards > ul > li,
+.md-typeset .grid.cards > ol > li,
+.md-typeset .grid > .card {
+ border: 1px solid var(--ff-card-border);
+ border-radius: 13px;
+ padding: 1.05rem 1.15rem;
+ background: var(--ff-card-bg);
+ transition: transform 0.16s ease, box-shadow 0.16s ease, border-color 0.16s ease;
+}
+.md-typeset .grid.cards > ul > li:hover,
+.md-typeset .grid.cards > ol > li:hover,
+.md-typeset .grid > .card:hover {
+ /* Firefly glow on hover. */
+ transform: translateY(-3px);
+ border-color: var(--ff-cyan);
+ box-shadow: 0 12px 30px -16px rgba(6, 182, 212, 0.6);
+}
+.md-typeset .grid.cards .twemoji,
+.md-typeset .grid.cards :is(.lg, .xl) {
+ color: var(--ff-cyan-deep);
+}
+[data-md-color-scheme="slate"] .md-typeset .grid.cards .twemoji {
+ color: var(--ff-cyan-light);
+}
+.md-typeset .grid.cards > ul > li > hr {
+ margin: 0.7rem 0;
+ border-color: var(--ff-card-border);
+}
+
+/* ---- Code, tables, blockquotes -------------------------------------------------------------- */
+.md-typeset pre > code {
+ border-radius: 10px;
+}
+.md-typeset code {
+ border-radius: 5px;
+}
+.md-typeset table:not([class]) {
+ border-radius: 10px;
+ overflow: hidden;
+ border: 1px solid var(--ff-card-border);
+ box-shadow: none;
+}
+.md-typeset table:not([class]) th {
+ background: rgba(6, 182, 212, 0.08);
+ font-family: var(--ff-display);
+ font-weight: 700;
+}
+.md-typeset table:not([class]) tr:hover {
+ background: rgba(6, 182, 212, 0.04);
+}
+.md-typeset blockquote {
+ border-left-color: var(--ff-cyan);
+}
+
+/* ---- Signature: the `firefly` admonition (amber lightbulb) ---------------------------------- *
+ * Use `!!! firefly "β¦"` (or `??? firefly`) for the "LLM proposes; classical decides" insight. */
+:root {
+ --md-admonition-icon--firefly: url('data:image/svg+xml;charset=utf-8,');
+}
+.md-typeset .admonition.firefly,
+.md-typeset details.firefly {
+ border-color: var(--ff-amber);
+ box-shadow: 0 4px 18px -12px rgba(246, 168, 33, 0.6);
+}
+.md-typeset .firefly > .admonition-title,
+.md-typeset .firefly > summary {
+ background-color: rgba(246, 168, 33, 0.12);
+}
+.md-typeset .firefly > .admonition-title::before,
+.md-typeset .firefly > summary::before {
+ background-color: var(--ff-amber);
+ -webkit-mask-image: var(--md-admonition-icon--firefly);
+ mask-image: var(--md-admonition-icon--firefly);
+}
+
+/* ---- Footer / misc -------------------------------------------------------------------------- */
+.md-footer-meta {
+ background-color: var(--ff-sky-1);
+}
+.md-typeset hr {
+ border-bottom-color: var(--ff-card-border);
+}
+
+/* ---- Quality floor: focus + reduced motion -------------------------------------------------- */
+.md-typeset .firefly-cta .md-button:focus-visible,
+.md-typeset .grid.cards > ul > li:focus-within {
+ outline: 2px solid var(--ff-cyan);
+ outline-offset: 2px;
+}
+@media (prefers-reduced-motion: reduce) {
+ .md-typeset .firefly-cta .md-button,
+ .md-typeset .grid.cards > ul > li,
+ .md-typeset .grid.cards > ol > li {
+ transition: none;
+ }
+ .md-typeset .firefly-cta .md-button:hover,
+ .md-typeset .grid.cards > ul > li:hover {
+ transform: none;
+ }
+}
diff --git a/docs/tutorial.md b/docs/tutorial.md
index a836664..e56bc61 100644
--- a/docs/tutorial.md
+++ b/docs/tutorial.md
@@ -2,9 +2,10 @@
**A guided, end-to-end tour of Firefly DataScience β from booting the app to serving a model.**
-This tutorial mirrors the runnable script [`samples/tutorial.py`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/samples/tutorial.py), which is
-covered by a test, so everything here is guaranteed to work. It runs **offline with no LLM key** β the
-GenAI steps use deterministic stand-ins, and we show how to switch on a real LLM at the end.
+This tutorial mirrors the runnable script [`samples/tutorial.py`](https://github.com/fireflyframework/fireflyframework-datascience/blob/main/samples/tutorial.py),
+which is covered by a test (`tests/samples/test_tutorial.py`), so everything here is guaranteed to
+work. It runs **offline with no LLM key** β the GenAI steps use deterministic stand-ins, and we show
+how to switch on a real LLM at the end.
```bash
uv add 'fireflyframework-datascience[tabular]'
@@ -14,31 +15,103 @@ uv run python samples/tutorial.py
We use a synthetic **credit-risk** dataset whose default risk is driven by *debt-to-income* β a ratio
deliberately withheld from the model, so feature engineering has something real to discover.
+!!! firefly "The pattern every step rests on β the LLM proposes; the classical engine decides"
+
+ Generative AI only ever *proposes* candidates here: feature code (step 4) and model choices
+ (step 5). A deterministic classical engine cross-validates each one and a cost-benefit gate keeps
+ it only if it measurably beats the current baseline. The LLM never touches the score β the data
+ does. That is why the tour runs identically with or without an API key.
+
## 1. Boot the application
```python
from fireflyframework_datascience import FireflyDataScienceApplication
-app = FireflyDataScienceApplication.run()
+app = FireflyDataScienceApplication.run() # (1)!
```
+1. Pass `print_output=False` to suppress the banner (the script does this so its test output stays clean).
+
This prints the banner and a wiring summary, loads configuration, builds the dependency-injection
container, and discovers every adapter via entry-point auto-configuration. `app.bean_count` and
-`app.applied_auto_configurations` tell you what got wired. See [Architecture](architecture.md).
+`app.applied_auto_configurations` tell you what got wired.
+
+!!! success "Expected"
+
+ On a fresh `[tabular]` install the container wires a couple dozen beans from roughly a dozen
+ auto-configurations (exact counts depend on which extras are installed):
+
+ ```text
+ [1] App booted: 21 beans, 12 auto-configurations
+ ```
-## 2. Load and validate the data
+See [Architecture](architecture.md).
+
+## 2. Build, load, and validate the data
+
+The script generates the credit dataset with `make_credit_dataset()`. Default risk is a logistic
+function of `debt_to_income = loan_amount / income`, plus prior defaults and employment, but only the
+four raw columns are handed to the model β `debt_to_income` is the hidden driver feature engineering
+must rediscover.
```python
+import numpy as np
+import pandas as pd
+
+from fireflyframework_datascience.core.types import TaskType
+from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.validation.adapters import BasicValidator
-dataset, validation = ... # build the credit dataset (see the script)
+
+def make_credit_dataset(n: int = 800, seed: int = 11) -> Dataset:
+ rng = np.random.RandomState(seed)
+ income = rng.normal(60_000, 18_000, n).clip(15_000, None)
+ loan_amount = rng.normal(18_000, 10_000, n).clip(1_000, None)
+ employment_years = rng.uniform(0, 30, n).round(1)
+ num_prior_defaults = rng.poisson(0.6, n)
+ dti = loan_amount / income # (1)!
+ logit = -2.6 + 5.0 * dti + 1.3 * num_prior_defaults - 0.05 * employment_years + rng.normal(0, 0.25, n)
+ default = (rng.uniform(0, 1, n) < 1.0 / (1.0 + np.exp(-logit))).astype(int)
+ X = pd.DataFrame(
+ {
+ "income": income.round(2),
+ "loan_amount": loan_amount.round(2),
+ "employment_years": employment_years,
+ "num_prior_defaults": num_prior_defaults, # (2)!
+ }
+ )
+ return Dataset(
+ "credit_applicants",
+ X,
+ pd.Series(default, name="default"),
+ task=TaskType.BINARY,
+ target_name="default",
+ feature_names=list(X.columns),
+ )
+
+
+dataset = make_credit_dataset()
report = BasicValidator().validate(dataset.X, dataset.y)
-assert report.ok # no all-null columns, no null target, etc.
-train, test = dataset.train_test_split(test_size=0.25, random_state=0)
+assert report.ok # no all-null columns, no null target, etc.
+train, test = dataset.train_test_split(test_size=0.25, random_state=0) # (3)!
```
+1. `dti` drives the label but is **never** put into `X` β that is the signal step 4 has to recover.
+2. Only these four raw columns reach the model; `debt_to_income` is deliberately absent.
+3. `train_test_split` stratifies on the target for classification; here it yields 600 train / 200 test rows.
+
The `BasicValidator` catches empty data, all-null/constant columns, duplicate rows, and null targets
-before you waste time training. See [Datasets](datasets.md).
+before you waste time training.
+
+!!! success "Expected"
+
+ ```text
+ [2] Data validated: ok=True
+ ```
+
+ The dataset is 800 rows Γ 4 features, a `TaskType.BINARY` task; the split gives 600 train / 200 test.
+
+See [Datasets](datasets.md).
## 3. Classical AutoML
@@ -47,13 +120,32 @@ from fireflyframework_datascience.automl import AutoML
result = AutoML(cv=4).fit(train)
print(result.leaderboard_table())
-print(result.evaluate(test)) # holdout metrics
+print(result.evaluate(test)) # holdout metrics
```
-AutoML cross-validates each candidate trainer (RandomForest, Linear, HistGradientBoosting, and the
-boosting libraries if installed), ranks them on a task-appropriate metric (`roc_auc` for binary), and
-refits the winner. Expected: a leaderboard topped by `linear` at **roc_auc β 0.85** on holdout. See
-[Classical AutoML](automl.md).
+`AutoML` cross-validates each candidate trainer (`RandomForestTrainer`, `LinearTrainer`,
+`HistGradientBoostingTrainer`, plus the boosting libraries if installed), ranks them on a
+task-appropriate metric (`roc_auc` for binary), and refits the winner on the full training set. Each
+candidate is wrapped in an impute-and-scale preprocessing pipeline before scoring.
+
+!!! success "Expected"
+
+ A leaderboard topped by `linear`, and a holdout `roc_auc β 0.85`:
+
+ ```text
+ linear roc_auc=0.7867
+ random_forest roc_auc=0.7493
+ hist_gradient_boosting roc_auc=0.7335
+ EvaluationResult(primary=roc_auc=0.8498; accuracy=0.7900, f1=0.7853, precision=0.7851, recall=0.7900, roc_auc=0.8498, log_loss=0.4500)
+ ```
+
+!!! note
+
+ The leaderboard prints **cross-validation** scores on the training data, while `evaluate(test)`
+ reports metrics on the untouched holdout β so the headline `roc_auc` (β0.85) is higher than the
+ CV figure (β0.79). Both are real; they measure different things.
+
+See [Classical AutoML](automl.md).
## 4. GenAI feature engineering
@@ -62,19 +154,50 @@ from fireflyframework_datascience.features import StaticFeatureProposer, Feature
from fireflyframework_datascience.features.genai import GenAIFeatureEngineer
proposer = StaticFeatureProposer([
- FeatureProposal("debt_to_income", "df['debt_to_income'] = df['loan_amount'] / (df['income'] + 1)", "DTI"),
- FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected"),
+ FeatureProposal("debt_to_income", "df['debt_to_income'] = df['loan_amount'] / (df['income'] + 1)", "DTI"), # (1)!
+ FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected"), # (2)!
])
engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
print(engineered.summary())
```
-The loop is **propose β execute (safely) β measure CV lift β gate**. `debt_to_income` (the hidden
-driver) is **accepted** because it lifts the score; the constant `noise` feature is **rejected**. The
-LLM never decides β the measured score does. See [GenAI Feature Engineering](genai-features.md).
+1. The hidden driver: its CV lift clears `CostBenefitGate(min_gain=0.0)`, so it is **accepted**.
+2. A constant column adds nothing, so the gate **rejects** it β the LLM never overrides that decision.
+
+The loop is **propose β execute (safely) β measure CV lift β gate**. `debt_to_income` is accepted
+because it lifts the score; the constant `noise` feature is rejected. The LLM never decides β the
+measured score does.
+
+!!! success "Expected"
-> Here a `StaticFeatureProposer` stands in for the LLM so the tutorial runs offline. With a real model
-> you'd use `AgentFeatureProposer(model="openai:gpt-4o")` β see [Configuring the LLM](llm-configuration.md).
+ ```text
+ GenAI feature engineering: 1 accepted, 1 rejected; roc_auc 0.7875 -> 0.7889 (lift +0.0013)
+ ```
+
+ `engineered.accepted` lists `debt_to_income`; `engineered.rejected` lists `noise` with the reason
+ `no lift (0.7889 <= 0.7889)`. The lift is small but **positive and real** β the gate rejects
+ anything that does not strictly beat the running baseline.
+
+=== "Static (no LLM)"
+
+ `StaticFeatureProposer` stands in for the LLM so the tutorial runs offline with a fixed,
+ reproducible set of proposals β exactly what the snippet above uses.
+
+=== "Agent (LLM)"
+
+ With a real model you swap in `AgentFeatureProposer`, which wraps a `FireflyAgent` and is built
+ lazily (no LLM client is created at startup):
+
+ ```python
+ from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer
+
+ proposer = AgentFeatureProposer(model="openai:gpt-4o")
+ engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
+ ```
+
+ See [Configuring the LLM](llm-configuration.md).
+
+See [GenAI Feature Engineering](genai-features.md).
## 5. The agentic ML-engineering loop
@@ -84,13 +207,30 @@ from fireflyframework_datascience.engineering.loop import AgenticAutoML
proposer = SequenceProposer([SolutionCandidate("linear"), SolutionCandidate("random_forest"),
SolutionCandidate("hist_gradient_boosting")])
-run = AgenticAutoML(proposer, cv=3).solve(train)
+run = AgenticAutoML(proposer, cv=3, max_iterations=4).solve(train) # (1)!
print(run.summary())
```
-Each candidate is trained, cross-validated, and **verified** β it must beat a trivial baseline, not
-merely run (the "correctness β ran" principle) β before the best one is selected. `run.attempts` is the
-full audited trail. See [Agentic Loop](agentic-loop.md).
+1. `AgenticAutoML` seeds the population, then reflects on the attempt history up to `max_iterations`
+ times; a `patience` budget (default 3) stops the search once it stalls.
+
+Each candidate is trained, cross-validated, and **verified** by a `DeterministicVerifier` β it must
+beat a trivial `DummyClassifier(strategy="prior")` baseline, not merely run (the "correctness β ran"
+principle) β before the best one is selected. `run.attempts` is the full audited trail and
+`run.valid_attempts` are the ones that passed verification.
+
+
+
+!!! success "Expected"
+
+ ```text
+ Agentic AutoML: 3 attempts (3 verified); best=linear roc_auc=0.7897 (baseline 0.5000)
+ ```
+
+ All three seeded candidates clear the `roc_auc=0.5000` trivial baseline, so all three are verified;
+ `linear` wins.
+
+See [Agentic Loop](agentic-loop.md).
## 6. Serve the model
@@ -99,23 +239,49 @@ from fireflyframework_datascience.serving import LocalModelServer
server = LocalModelServer()
server.load(result.best_model)
-prediction = server.predict(test.X.iloc[[0]]) # score one applicant
+prediction = server.predict(test.X.iloc[[0]]) # score one applicant
+print(int(prediction[0]))
```
+`LocalModelServer` is the default, dependency-free server: it loads a fitted `Model` in the host
+process and answers `predict` / `predict_proba`. Heavier servers (e.g. `BentoMLModelServer`) live
+behind the `serving` extra.
+
+!!! success "Expected"
+
+ The first holdout applicant is scored as a non-default:
+
+ ```text
+ [6] Served prediction for one applicant: default=0
+ ```
+
See [Serving & Lineage](serving.md).
## Turn on a real LLM
+Steps 4 and 5 ran offline with deterministic stand-ins. To let a real model do the proposing, set
+your key and enable GenAI, then swap in the agent-backed proposers:
+
```bash
-export OPENAI_API_KEY=sk-... # or ANTHROPIC_API_KEY=...
+export OPENAI_API_KEY=sk-... # or ANTHROPIC_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__ENABLED=true
-export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=openai:gpt-4o
+export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=openai:gpt-4o # or anthropic:claude-sonnet-4-5
+```
+
+```python
+from fireflyframework_datascience.features.genai import AgentFeatureProposer
+from fireflyframework_datascience.engineering.loop import AgentSolutionProposer
```
-Then use `AgentFeatureProposer` / `AgentSolutionProposer` in place of the stand-ins. The full guide,
-including providers, keys, cost gating, and secure execution, is in
+Use `AgentFeatureProposer` in place of `StaticFeatureProposer` (step 4) and `AgentSolutionProposer` in
+place of `SequenceProposer` (step 5). Nothing else changes β the cost-benefit gate and the verifier
+still decide. The full guide, including providers, keys, cost gating, and secure execution, is in
[Configuring the LLM](llm-configuration.md).
## See also
-- [Quick Start](quickstart.md) Β· [Configuration](configuration.md) Β· [Use Case: Lumen](use-case-lumen.md)
+- [Quick Start](quickstart.md)
+- [Samples](samples.md)
+- [Configuration](configuration.md)
+- [Agentic Loop](agentic-loop.md)
+- [Use Case: Lumen](use-case-lumen.md)
diff --git a/docs/use-case-lumen.md b/docs/use-case-lumen.md
index ca29b42..496767c 100644
--- a/docs/use-case-lumen.md
+++ b/docs/use-case-lumen.md
@@ -1,8 +1,8 @@
-# Use Case: Lumen Lending Credit Risk
+# Use case: Lumen Lending credit risk
-**An end-to-end walkthrough of the Firefly DataScience stack β GenAI feature engineering, classical AutoML, and in-process serving β on a realistic synthetic lending dataset.**
+**An end-to-end walkthrough of the Firefly DataScience stack β GenAI feature engineering, classical AutoML, and in-process serving β on a realistic synthetic lending dataset, with no LLM API key required.**
-The `samples/lumen_credit_risk.py` sample tells one focused story: a credit-risk model where default is *secretly* driven by **debt-to-income (DTI)**, a feature the model is never handed directly. We watch the framework rediscover it from raw columns, reject a useless noise feature, let AutoML pick a winner, and score a live applicant β all without an LLM API key.
+The `samples/lumen_credit_risk.py` sample tells one focused story: a credit-risk model where default is *secretly* driven by **debt-to-income (DTI)**, a feature the model is never handed directly. We watch the framework rediscover it from raw columns, reject a useless noise feature, let AutoML pick a winner, and score a live applicant.
The pipeline has three acts:
@@ -10,13 +10,22 @@ The pipeline has three acts:
2. **Classical AutoML** trains several models and selects the best by cross-validation.
3. **Serving** loads the winner in-process and scores a new applicant.
+!!! firefly "The reproducible pattern β the LLM proposes; the classical engine decides"
+
+ A proposer (an LLM in production, a deterministic stub here) suggests `debt_to_income`,
+ `loan_per_year_employed`, and `noise`. None of them is trusted on faith: each is executed,
+ cross-validated, and kept only if it clears the [`CostBenefitGate`](genai-features.md). The
+ measured score decides β not the proposer.
+
## Run it
```bash
python samples/lumen_credit_risk.py # needs the `tabular` extra
```
-## 1. Synthetic lending data
+The script's `run()` function returns a report dict; `main()` prints it. Every step below maps to one block of that function.
+
+## 1. Synthesize the lending data
`make_lending_dataset` builds a `Dataset` of raw applicant columns. Crucially, DTI (`loan_amount / income`) is the *latent* driver of default β it shapes the labels but is **not** a column the model sees.
@@ -35,7 +44,7 @@ def make_lending_dataset(n: int = 800, seed: int = 7) -> Dataset:
# ... age, employment_years, credit_history_length, num_prior_defaults ...
dti = loan_amount / income # latent driver β NOT given to the model
- logit = -3.0 + 4.0 * dti + 0.8 * num_prior_defaults - 0.05 * employment_years
+ logit = -3.0 + 4.0 * dti + 0.8 * num_prior_defaults - 0.05 * employment_years # (1)!
prob_default = 1.0 / (1.0 + np.exp(-logit))
default = (rng.uniform(0, 1, n) < prob_default).astype(int)
@@ -50,16 +59,23 @@ def make_lending_dataset(n: int = 800, seed: int = 7) -> Dataset:
)
```
-A `Dataset` carries its `X`, `y`, `task`, and `feature_names` together. Split it the usual way:
+1. The real logit also adds `- 0.02 * credit_history_length` and a small `rng.normal(0, 0.5, n)` noise term. DTI carries the largest coefficient (`4.0`), so it dominates the label β yet it never appears as a column in `X`.
+
+The six raw columns handed to the model are `income`, `loan_amount`, `age`, `employment_years`, `credit_history_length`, and `num_prior_defaults`. A `Dataset` carries its `X`, `y`, `task`, and `feature_names` together. Split it the usual way:
```python
dataset = make_lending_dataset()
train, test = dataset.train_test_split(test_size=0.25, random_state=0)
```
+!!! note "The task is `TaskType.BINARY`"
+
+ Because the task is binary classification, the framework's default selection metric is `roc_auc`
+ β that is the score the gate and AutoML maximize throughout this run.
+
## 2. GenAI feature engineering β discover DTI, reject noise
-A *feature proposer* emits `FeatureProposal`s: a name, a line of Python that mutates a DataFrame `df`, and a rationale. In production a `GenAIFeatureProposer` asks an LLM; the sample uses `StaticFeatureProposer` so it runs with **no API key**, while exercising the exact same gate.
+A *feature proposer* emits `FeatureProposal`s: a name, a line of Python that mutates a DataFrame `df`, and a rationale. In production an `AgentFeatureProposer` asks an LLM; the sample uses `StaticFeatureProposer` so it runs with **no API key**, while exercising the exact same gate.
```python
from fireflyframework_datascience.features import FeatureProposal, StaticFeatureProposer
@@ -82,7 +98,7 @@ proposer = StaticFeatureProposer(
)
```
-`GenAIFeatureEngineer` executes each proposal's code, measures the cross-validated lift against a baseline scorer, and **accepts a feature only if it helps**:
+`GenAIFeatureEngineer` runs the propose β execute β measure β gate loop. For each proposal it executes the code (via `FeatureCodeExecutor`), measures the cross-validated score against the current baseline, and **accepts the feature only if the gate accepts the lift**:
```python
from sklearn.linear_model import LogisticRegression
@@ -92,7 +108,7 @@ def _logistic_scorer(task):
return LogisticRegression(max_iter=1000)
-fe = GenAIFeatureEngineer(proposer, scorer_estimator=_logistic_scorer, cv=4)
+fe = GenAIFeatureEngineer(proposer, scorer_estimator=_logistic_scorer, cv=4) # (1)!
engineered = fe.engineer(train)
print([a.proposal.name for a in engineered.accepted]) # ['debt_to_income', ...]
@@ -100,17 +116,31 @@ print([r.proposal.name for r in engineered.rejected]) # ['noise']
print(f"lift = {engineered.lift:+.4f}")
```
+1. `scorer_estimator` swaps the default `HistGradientBoosting*` scorer for a fast `LogisticRegression`, and `cv=4` sets the number of cross-validation folds used to measure each proposal's lift.
+
+The gate is a `CostBenefitGate` with a default `min_gain` of `0.0`: a candidate is accepted only if `candidate_score - current_score > min_gain`. That is why a feature must *strictly improve* the cross-validated score to be kept.
+
The result object exposes:
- `engineered.dataset` β the train set with accepted features added.
-- `engineered.accepted` / `engineered.rejected` β each wraps `.proposal` (so `.proposal.name`, `.proposal.code`).
-- `engineered.lift` β the net cross-validated improvement.
+- `engineered.accepted` / `engineered.rejected` β `accepted` items wrap `.proposal` plus `.score` and `.gain`; `rejected` items wrap `.proposal` plus a `.reason` (and `.score`).
+- `engineered.lift` β the net cross-validated improvement (`final_score - baseline_score`).
+- `engineered.summary()` β a one-line audit string of the whole step.
+
+`debt_to_income` is accepted because it reconstructs the latent driver and lifts the score; `noise` (a constant `0.0` column) executes fine but is rejected by the gate because it adds no lift.
+
+!!! warning "Proposed code runs through a safety analysis first"
-`debt_to_income` is accepted because it reconstructs the latent driver and lifts the score; `noise` (a constant column) is rejected because it adds nothing.
+ `FeatureCodeExecutor` is not `eval`-on-faith. Before any snippet runs it goes through a static
+ safety analysis that denies imports, dunder access, and dangerous builtins (`eval`, `exec`,
+ `open`, `__import__`, ...). The code then runs in a restricted namespace exposing only `df`,
+ `pd`, and `np`. A snippet that is unsafe, errors, adds no new column, or produces a non-numeric
+ column is turned into a `RejectedFeature` rather than crashing the loop. See
+ [Security](security.md).
## 3. Classical AutoML on the engineered features
-`AutoML.fit` trains its trainers over the engineered dataset and ranks them by cross-validation.
+`AutoML.fit` trains its trainers over the engineered dataset and ranks them by cross-validation. The default trainer set is `random_forest`, `linear`, and `hist_gradient_boosting`, so `result.best_model.name` is one of those three.
```python
from fireflyframework_datascience.automl import AutoML
@@ -118,11 +148,11 @@ from fireflyframework_datascience.automl import AutoML
automl = AutoML(cv=4)
result = automl.fit(engineered.dataset)
-print(result.best_model.name) # the winning trainer
+print(result.best_model.name) # the winning trainer (e.g. 'hist_gradient_boosting')
print(result.leaderboard_table()) # ranked comparison
```
-To evaluate on the held-out test set, the **accepted** feature code must be re-applied so train and test stay consistent. The sample uses `FeatureCodeExecutor` for this, then evaluates:
+To evaluate on the held-out test set, the **accepted** feature code must be re-applied so train and test stay consistent. The sample reuses `FeatureCodeExecutor` for this, then evaluates with `AutoMLResult.evaluate`:
```python
from fireflyframework_datascience.features.executor import FeatureCodeExecutor
@@ -137,6 +167,15 @@ evaluation = result.evaluate(engineered_test)
print(evaluation.metrics) # holdout metrics dict
```
+!!! tip "Why re-apply the code, not the values"
+
+ The accepted features were measured on the train fold. Re-executing the *same code* on the test
+ frame is what keeps train and test schemas identical β the model trained on `debt_to_income`
+ would fail on a test frame that does not have it. `engineered.accepted` is the audit trail that
+ makes this replay exact.
+
+For a binary task the holdout `evaluation.metrics` dict contains `accuracy`, `f1`, `precision`, and `recall`, plus `roc_auc` and `log_loss` when the winning model exposes `predict_proba`.
+
## 4. Serve the winner and score an applicant
`LocalModelServer` runs the winning model in-process β no network, no container β so you can score immediately.
@@ -152,28 +191,41 @@ prediction = server.predict(applicant)
print(int(prediction[0])) # 0 = no default, 1 = default
```
+=== "In-process (the sample)"
+
+ `LocalModelServer` is the default, dependency-free server. `load` holds the fitted `Model`;
+ `predict` (and `predict_proba`) delegate straight to it. Nothing leaves the host process.
+
+=== "Heavier servers"
+
+ For production deployment, BentoML/KServe (and vLLM/TGI for LLMs) adapters live behind the
+ `serving` extra. The port (`ModelServerPort`) is identical, so swapping the server does not
+ change calling code. See [Serving](serving.md).
+
## Expected output
Running the sample prints a report like:
-```text
-=== Lumen Lending β credit-risk AutoML ===
-accepted features : ['debt_to_income', 'loan_per_year_employed']
-rejected features : ['noise']
-feature-eng lift : +0.0XYZ
-winning model :
-leaderboard:
-
-holdout metrics : {'accuracy': ..., 'roc_auc': ...}
-applicant predicted default = 0
-```
+!!! success "Expected"
+
+ ```text
+ === Lumen Lending β credit-risk AutoML ===
+ accepted features : ['debt_to_income', 'loan_per_year_employed']
+ rejected features : ['noise']
+ feature-eng lift : +0.0XYZ
+ winning model :
+ leaderboard:
+
+ holdout metrics : {'accuracy': ..., 'roc_auc': ...}
+ applicant predicted default = 0
+ ```
-Exact numbers vary with your scikit-learn version, but the shape is stable: **`debt_to_income` is accepted, `noise` is rejected, the lift is positive, and a winner is served.** That is the whole point β the framework rediscovers the signal you deliberately hid.
+Exact numbers vary with your scikit-learn version, but the shape is stable: **`debt_to_income` is accepted, `noise` is rejected, the lift is positive, and a winner is served.** That is the whole point β the framework rediscovers the signal you deliberately hid, and the cost/benefit gate throws away the feature that adds nothing.
## See also
-- [GenAI Feature Engineering](genai-features.md)
+- [GenAI feature engineering](genai-features.md)
- [AutoML](automl.md)
- [Datasets](datasets.md)
- [Serving](serving.md)
-- [Getting Started](quickstart.md)
+- [Getting started](quickstart.md)
diff --git a/mkdocs.yml b/mkdocs.yml
index 6825b78..8b4d6d4 100644
--- a/mkdocs.yml
+++ b/mkdocs.yml
@@ -11,8 +11,11 @@ use_directory_urls: false
theme:
name: material
- logo: img/banner.svg
- favicon: img/banner.svg
+ logo: img/logo.svg
+ favicon: img/favicon.svg
+ font:
+ text: Inter
+ code: JetBrains Mono
palette:
- media: "(prefers-color-scheme: light)"
scheme: default
@@ -29,14 +32,21 @@ theme:
icon: material/weather-sunny
name: Switch to light mode
features:
+ - navigation.tabs
+ - navigation.tabs.sticky
- navigation.sections
+ - navigation.indexes
- navigation.top
+ - navigation.tracking
- navigation.footer
- navigation.instant
+ - navigation.instant.progress
- content.code.copy
- content.code.annotate
+ - content.tooltips
- search.suggest
- search.highlight
+ - search.share
- toc.follow
icon:
repo: fontawesome/brands/github
@@ -55,10 +65,33 @@ markdown_extensions:
- pymdownx.details
- pymdownx.tabbed:
alternate_style: true
+ - pymdownx.tasklist:
+ custom_checkbox: true
+ - pymdownx.emoji:
+ emoji_index: !!python/name:material.extensions.emoji.twemoji
+ emoji_generator: !!python/name:material.extensions.emoji.to_svg
+ - pymdownx.keys
+ - footnotes
plugins:
- search
+extra_css:
+ - stylesheets/firefly.css
+
+extra:
+ generator: false
+ social:
+ - icon: fontawesome/brands/github
+ link: https://github.com/fireflyframework/fireflyframework-datascience
+ name: Firefly DataScience on GitHub
+ - icon: material/rocket-launch-outline
+ link: https://github.com/fireflyframework/fireflyframework-agentic
+ name: Firefly Agentic β the GenAI substrate
+ - icon: fontawesome/brands/python
+ link: https://pypi.org/project/fireflyframework-datascience/
+ name: fireflyframework-datascience on PyPI
+
nav:
- Home: index.md
- Getting started: