Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 41 additions & 0 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
name: Docs

on:
push:
branches: [main]
workflow_dispatch:

permissions:
contents: read
pages: write
id-token: write

# Allow only one concurrent deployment.
concurrency:
group: pages
cancel-in-progress: false

jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: astral-sh/setup-uv@v5
- uses: actions/setup-python@v5
with:
python-version: "3.13"
- run: uv sync --only-group docs
- run: uv run mkdocs build --strict
- uses: actions/upload-pages-artifact@v3
with:
path: site

deploy:
needs: build
runs-on: ubuntu-latest
environment:
name: github-pages
url: ${{ steps.deployment.outputs.page_url }}
steps:
- id: deployment
uses: actions/deploy-pages@v4
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,9 @@ wandb/
# Brand asset build cache
assets/.tools/.cache/

# mkdocs build output
site/

# OS
.DS_Store

Expand Down
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,21 @@ All notable changes to `fireflyframework-datascience` are documented here. The p

## [Unreleased]

### Documentation & developer experience

- **A tested, runnable [tutorial](docs/tutorial.md)** (`samples/tutorial.py`) — a guided end-to-end tour
(boot → load/validate → AutoML → GenAI feature engineering → agentic loop → serve) that runs offline
with no LLM key. A test guarantees it works.
- **A thorough [LLM-configuration guide](docs/llm-configuration.md)** — providers + model strings, API
keys, enabling GenAI, cost/budget gating, secure execution, and offline/test usage.
- **A professional [mkdocs Material docs site](https://fireflyframework.github.io/fireflyframework-datascience/)**
(`mkdocs.yml`, `docs` dependency group) — builds clean under `--strict`; deployed to GitHub Pages by a
new `Docs` workflow. All internal links fixed.
- **Better visuals** — a refined `assets/banner.svg` (eyebrow, data-constellation motif) and an expanded
generated diagram set (8 diagrams: architecture, hexagonal, automl-loop, genai-fusion, agentic-loop,
auto-configuration, security, ecosystem) under `docs/img/`.
- **Polished README** (compelling 5-line quick start, docs-site link) and a new **`CONTRIBUTING.md`**.

### AMLB benchmark (Tier-1)

- **`benchmarks/amlb_benchmark.py`** — runs the AutoML facade across real OpenML-CC18 tasks (with
Expand Down
74 changes: 74 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
# Contributing to Firefly DataScience

Thanks for helping build the framework. This guide gets you from clone to green PR.

## Development setup

Requires **Python 3.13** and [`uv`](https://docs.astral.sh/uv/). On macOS, the boosting libraries need
OpenMP:

```bash
brew install libomp # macOS only (xgboost / lightgbm)
git clone https://github.com/fireflyframework/fireflyframework-datascience
cd fireflyframework-datascience
uv sync --extra tabular --extra data --extra validation --group dev
```

The single framework dependency, `fireflyframework-agentic`, resolves from its public git repo.
Add extras as you work on them: `--extra dl` (PyTorch), `--extra nlp` (HuggingFace), `--extra tabfm`
(TabPFN), `--extra genai` (the agentic accelerators), or `--extra full`.

## The quality gate

Every change must pass the same gate CI runs:

```bash
uv run ruff check src/ tests/ # lint
uv run ruff format --check src/ tests/ # format
uv run pyright # type-check
uv run pytest # tests (integration/nightly excluded by default)
```

- **`-m integration`** runs network/heavy tests (OpenML, HuggingFace downloads).
- **`-m nightly`** runs long-running suites (full AMLB, GPU).

## Conventions

- **CalVer** `YY.MM.PATCH`; the version lives in `src/fireflyframework_datascience/_version.py`.
- **Apache-2.0 header** on every `.py`: `# Copyright 2026 Firefly Software Foundation.`
- **Hexagonal**: each module is a light `__init__.py` (ports = `Protocol`s + DTOs), heavy `adapters.py`
(concrete impls, lazy-importing optional libraries), and a light `auto_configuration.py` that
registers beans via `@bean`, gated by `@conditional_on_class` / `@conditional_on_property`.
- **Lazy imports** of optional heavy dependencies are deliberate (keeps the core importable without any
extra). The `PLC0415` rule is therefore relaxed for the DataScience subtree.

## Adding an adapter

1. Define (or reuse) the `Protocol` port in the module's `__init__.py`.
2. Implement the adapter in `adapters.py`, importing the heavy library *inside* the method and raising
`AdapterUnavailableError("MyAdapter", "<extra>")` when it is missing.
3. Register it in `auto_configuration.py` behind `@conditional_on_class("<library>")`.
4. Add the entry point under `[project.entry-points."firefly_datascience.auto_configuration"]`.
5. Add the optional dependency to `[project.optional-dependencies]`.
6. Write a test (mark it `integration`/`nightly` if it needs network/GPU).

## Docs

Docs are an [mkdocs Material](https://squidfunk.github.io/mkdocs-material/) site under `docs/`.

```bash
uv sync --only-group docs
uv run mkdocs serve # live preview at http://127.0.0.1:8000
uv run mkdocs build --strict # must pass (no broken links)
```

Diagrams are generated — edit `assets/tools/gen_diagrams.py`, run it, and commit the SVGs:

```bash
uv run python assets/tools/gen_diagrams.py # writes docs/img/*.svg
```

## Commits & PRs

Keep the gate green, write a clear commit message, open a PR against `main`, and make sure CI passes.
Thank you! 🐝
39 changes: 27 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,11 @@

---

> **Status:** active build. Delivered and green (ruff + pyright + 87 tests): **SP0** Foundation and
> Firefly DNA · **SP1** classical tabular AutoML · **SP2** GenAI feature engineering · **SP3** the
> agentic ML-engineering loop · **SP4** deep-learning / TabFM ports (verified sklearn-MLP; gated
> Torch/TabPFN) · **SP5** serving, lineage and the Lumen credit-risk sample. **SP6** (documentation
> book) is in progress. See [`docs/`](docs/index.md) for the full guide.
> **Status:** all sub-projects delivered and green (ruff · pyright · 90+ tests). Classical tabular
> AutoML · GenAI feature engineering · the agentic ML-engineering loop · deep learning (PyTorch
> Lightning) + NLP (HuggingFace) + vision · TabFM · serving · the OpenML-AMLB benchmark harness.
> **New here? Start with the [Tutorial](docs/tutorial.md)** or browse the
> **[documentation site](https://fireflyframework.github.io/fireflyframework-datascience/)**.

## What is this?

Expand All @@ -53,30 +53,41 @@ swappability, and security by default.
## Quick start

```bash
uv add fireflyframework-datascience # core
uv add 'fireflyframework-datascience[automl-stack]' # + classical AutoML + tracking
uv add 'fireflyframework-datascience[tabular]' # classical AutoML
# or: uv add 'fireflyframework-datascience[automl-stack]' # + TabPFN, MLflow, OpenML
```

Train, rank, and evaluate models in five lines:

```python
from fireflyframework_datascience import FireflyDataScienceApplication
from fireflyframework_datascience.automl import AutoML
from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader

app = FireflyDataScienceApplication.run() # prints banner + wiring summary
print(app.config.default_ml_framework)
train, test = SklearnDatasetLoader().load("breast_cancer").train_test_split()
result = AutoML().fit(train) # cross-validates candidates, picks the winner
print(result.leaderboard_table()) # random_forest / linear / hist_gradient_boosting …
print(result.evaluate(test)) # holdout roc_auc ≈ 0.98
```

Boot it as a Firefly application (auto-configuration + dependency injection), or use the CLI:

```bash
firefly-ds doctor # check your environment & installed adapters
firefly-ds introspect # boot the app and show discovered auto-configurations
```

Add a real LLM for GenAI feature engineering and the agentic loop — see
[Configuring the LLM](docs/llm-configuration.md). The full guided walkthrough is the
[Tutorial](docs/tutorial.md).

## Architecture

Five acyclic layers, mirroring `fireflyframework-agentic` with a **DataScience** layer inserted. Every
ML/MLOps library is a swappable adapter behind a `Protocol` port, registered by **entry-point
auto-configuration** and resolved through a type-hint **dependency-injection container**.

<p align="center">
<img src="assets/diagrams/architecture.svg" alt="Firefly DataScience layered architecture" width="70%">
<img src="docs/img/architecture.svg" alt="Firefly DataScience layered architecture" width="70%">
</p>

```
Expand All @@ -87,14 +98,18 @@ The GenAI ↔ classical fusion is governed: the LLM proposes code; the classical
cost/benefit gate keeps only what beats the baseline.

<p align="center">
<img src="assets/diagrams/genai-classical-fusion.svg" alt="Governed GenAI and classical fusion" width="70%">
<img src="docs/img/genai-classical-fusion.svg" alt="Governed GenAI and classical fusion" width="70%">
</p>

## Documentation

📖 **Full docs site:** <https://fireflyframework.github.io/fireflyframework-datascience/>

| Guide | |
|---|---|
| [Tutorial](docs/tutorial.md) | the guided end-to-end walkthrough (runs offline; tested) |
| [Quick Start](docs/quickstart.md) | install, boot, first AutoML run, the `firefly-ds` CLI |
| [Configuring the LLM](docs/llm-configuration.md) | providers, API keys, model selection, cost gating |
| [Architecture](docs/architecture.md) | layers, hexagonal ports, auto-configuration, the DI container |
| [Configuration](docs/configuration.md) | env / `.env` / YAML / profiles precedence |
| [Datasets](docs/datasets.md) | the `Dataset` container and loaders |
Expand Down
101 changes: 47 additions & 54 deletions assets/banner.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Loading