diff --git a/CHANGELOG.md b/CHANGELOG.md index 4a95c01..45422a6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -19,6 +19,11 @@ All notable changes to `fireflyframework-datascience` are documented here. The p generated diagram set (8 diagrams: architecture, hexagonal, automl-loop, genai-fusion, agentic-loop, auto-configuration, security, ecosystem) under `docs/img/`. - **Polished README** (compelling 5-line quick start, docs-site link) and a new **`CONTRIBUTING.md`**. +- **Fully diagrammed** — all 8 diagrams embedded across the README ("how it works" visual tour) and the + docs pages; a `docs/README.md` table-of-contents for GitHub folder browsing. +- **Repository metadata** — description, homepage (docs site), and 20 topics set via `gh`. +- **Fix:** the `.gitignore` rule `datasets/` was excluding the `datasets` source module from git (it had + never been committed); anchored the data-artifact ignores to the repo root and formatted the module. ### AMLB benchmark (Tier-1) diff --git a/README.md b/README.md index 751fa96..f0bfc82 100644 --- a/README.md +++ b/README.md @@ -80,25 +80,68 @@ Add a real LLM for GenAI feature engineering and the agentic loop — see [Configuring the LLM](docs/llm-configuration.md). The full guided walkthrough is the [Tutorial](docs/tutorial.md). -## Architecture +## How it works -Five acyclic layers, mirroring `fireflyframework-agentic` with a **DataScience** layer inserted. Every -ML/MLOps library is a swappable adapter behind a `Protocol` port, registered by **entry-point -auto-configuration** and resolved through a type-hint **dependency-injection container**. +### Layered architecture + +Five acyclic layers, mirroring `fireflyframework-agentic` with a **DataScience** layer inserted: +`Core → Agent (reused) → DataScience → Intelligence → Orchestration`.

- Firefly DataScience layered architecture + Firefly DataScience layered architecture

-``` -Core → Agent (reused: agentic) → DataScience → Intelligence → Orchestration -``` +### Hexagonal ports & adapters + +Every ML/MLOps library (scikit-learn, XGBoost, AutoGluon, TabPFN, PyTorch Lightning, HuggingFace, +MLflow, BentoML, …) is a swappable adapter behind a `Protocol` port. The core stays library-agnostic. + +

+ Hexagonal ports and adapters +

+ +### Auto-configuration + +Adapters self-register via entry points and are wired by a type-hint dependency-injection container, +gated by `@conditional_on_*` — exactly like Spring Boot / pyfly. + +

+ Entry-point auto-configuration +

+ +### Classical AutoML + +

+ Classical AutoML pipeline +

+ +### Governed GenAI × classical fusion + +The LLM proposes code/features; a deterministic engine measures; a **cost/benefit gate** keeps only +what beats the seeded baseline. The LLM never decides — the measured score does. + +

+ Governed GenAI and classical fusion +

+ +### The agentic ML-engineering loop + +Propose → execute (sandboxed) → observe → **verify** (correctness ≠ ran) → reflect → select. + +

+ Agentic ML-engineering loop +

+ +### Secure by default + +

+ Secure-by-default execution tiers +

-The GenAI ↔ classical fusion is governed: the LLM proposes code; the classical engine measures; a -cost/benefit gate keeps only what beats the baseline. +### Where it fits

- Governed GenAI and classical fusion + Firefly ecosystem

## Documentation diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..9aec24a --- /dev/null +++ b/docs/README.md @@ -0,0 +1,57 @@ +# Firefly DataScience — Documentation + +**The complete documentation set.** Browse it as a rendered site at +****, or read the Markdown here. + +

+ Firefly DataScience +

+ +## Table of contents + +### Getting started +| Page | What it covers | +|---|---| +| [Home / Overview](index.md) | what the framework is, the 7 pillars, the architecture at a glance | +| [Quick Start](quickstart.md) | install, boot, your first AutoML run, the `firefly-ds` CLI | +| [Tutorial](tutorial.md) | the guided, runnable end-to-end walkthrough (offline, tested) | +| [Configuration](configuration.md) | env vars, `.env`, YAML, and profile precedence | +| [Configuring the LLM](llm-configuration.md) | providers, API keys, model selection, cost & budget gating | + +### Concepts +| Page | What it covers | +|---|---| +| [Architecture](architecture.md) | the five layers, hexagonal ports/adapters, the DI container, auto-configuration | +| [Datasets](datasets.md) | the `Dataset` container, loaders, `train_test_split`, task inference | +| [Classical AutoML](automl.md) | the `AutoML` facade, trainers, search policies, metrics, the leaderboard | +| [GenAI Feature Engineering](genai-features.md) | propose → execute → measure → gate; the `CostBenefitGate` | +| [Agentic ML-Engineering Loop](agentic-loop.md) | propose → train → verify → reflect → select | +| [Deep Learning & TabFM](deep-learning.md) | sklearn-MLP, PyTorch Lightning, HuggingFace, TabPFN | +| [Serving & Lineage](serving.md) | the in-process server, gated backends, lineage | +| [Security Model](security.md) | secure code execution, sandbox tiers, prompt-injection defense | +| [Benchmarks](benchmarks.md) | the three-tier AMLB/OpenML-anchored evaluation strategy | + +### Use case +| Page | What it covers | +|---|---| +| [Lumen Lending — Credit Risk](use-case-lumen.md) | a full, realistic walkthrough end to end | + +## Diagrams + +All diagrams are generated (WeasyPrint-safe SVG, teal palette) by +[`assets/tools/gen_diagrams.py`](../assets/tools/gen_diagrams.py) into [`img/`](img): + +| Diagram | | +|---|---| +| [Architecture](img/architecture.svg) | the five-layer design | +| [Hexagonal ports](img/hexagonal.svg) | ports & adapters around a library-agnostic core | +| [Auto-configuration](img/auto-configuration.svg) | entry-point discovery → conditions → beans | +| [AutoML pipeline](img/automl-loop.svg) | the classical AutoML flow | +| [GenAI × classical fusion](img/genai-classical-fusion.svg) | the governed fusion | +| [Agentic loop](img/agentic-loop.svg) | propose → verify → reflect → select | +| [Security tiers](img/security.svg) | the secure-by-default execution model | +| [Ecosystem](img/ecosystem.svg) | how this sits beside Agentic and PyFly | + +--- + +Copyright 2026 Firefly Software Foundation · Licensed under the Apache License 2.0 diff --git a/docs/agentic-loop.md b/docs/agentic-loop.md index 211d8e8..e601db2 100644 --- a/docs/agentic-loop.md +++ b/docs/agentic-loop.md @@ -10,6 +10,10 @@ an iteration and patience budget. The whole cycle is: **propose → train/CV → verify → reflect → select**. +

+ The agentic ML-engineering loop +

+ ## The pieces | Type | Role | diff --git a/docs/architecture.md b/docs/architecture.md index 3e2b80a..b139057 100644 --- a/docs/architecture.md +++ b/docs/architecture.md @@ -168,3 +168,11 @@ Passing `auto_configurations=[...]` **replaces** discovery entirely (handy for h - [Configuration](./configuration.md) - [Ports and adapters reference](index.md) - [Writing an auto-configuration](index.md) + +## Auto-configuration flow + +Adapters self-register via the `firefly_datascience.auto_configuration` entry-point group; the application context discovers them, evaluates their conditions, and registers the surviving beans. + +

+ Entry-point auto-configuration +

diff --git a/docs/datasets.md b/docs/datasets.md index 646d50f..afa0b55 100644 --- a/docs/datasets.md +++ b/docs/datasets.md @@ -8,6 +8,10 @@ scikit-learn are imported lazily), so the `Dataset` type and the `DatasetLoaderP usable without the `tabular` extra installed. Concrete loaders live in `fireflyframework_datascience.datasets.adapters`. +

+ Hexagonal ports and adapters +

+ ## The `Dataset` container `Dataset` is a dataclass. The only required fields are `name` and `X`. diff --git a/docs/security.md b/docs/security.md index 994c60f..d3bca2b 100644 --- a/docs/security.md +++ b/docs/security.md @@ -4,6 +4,10 @@ The GenAI accelerators (CAAFE-style automated feature engineering, agentic analysis) ask a model to *write Python that runs against your data*. That is an attack surface. The framework's job is to make the default path safe even when the model is wrong, compromised, or steered by adversarial data. This page describes the trust model, the controls that enforce it, and — importantly — where those controls stop. +

+ Secure-by-default execution tiers +

+ ## Threat model The model is **not** trusted. We assume any of: diff --git a/docs/serving.md b/docs/serving.md index 6558e3b..dad03a2 100644 --- a/docs/serving.md +++ b/docs/serving.md @@ -4,6 +4,10 @@ Firefly DataScience keeps the core dependency-free. A fitted `Model` is served by a `ModelServerPort`; experiment runs go through a `TrackerPort`; data/model lineage flows through a `LineagePort`. Each port ships a zero-dependency default and an opt-in adapter behind an extra. +

+ Firefly ecosystem +

+ ## The model-server port ```python diff --git a/mkdocs.yml b/mkdocs.yml index bb76a0a..fafdfc2 100644 --- a/mkdocs.yml +++ b/mkdocs.yml @@ -75,5 +75,8 @@ nav: - Benchmarks: benchmarks.md - Use case — Lumen: use-case-lumen.md -not_in_nav: | - /superpowers/ +# README.md is the GitHub folder index (table of contents); index.md is the site home. Keep the +# former in the repo but exclude it (and the local-only specs) from the built site. +exclude_docs: | + README.md + superpowers/