AutoML that fuses GenAI with classical ML & Deep Learning β hexagonal, secure-by-default, native to the Firefly Framework.
The LLM proposes; a deterministic classical engine decides. GenAI is a governed, measurably-gated accelerator over a battle-tested classical core β never a black box.
Copyright 2026 Firefly Software Foundation Β· Licensed under the Apache License 2.0
Status: all sub-projects delivered and green (ruff Β· pyright Β· 90+ tests). Classical tabular AutoML Β· GenAI feature engineering Β· the agentic ML-engineering loop Β· deep learning (PyTorch Lightning) + NLP (HuggingFace) + vision Β· TabFM Β· serving Β· the OpenML-AMLB benchmark harness. New here? Start with the Tutorial or browse the documentation site.
fireflyframework-datascience is a state-of-the-art Python metaframework for AutoML. It combines
GenAI (built on fireflyframework-agentic,
which wraps Pydantic AI) with traditional ML and Deep Learning, so any
team can apply data science to any project quickly β with production governance, hexagonal
swappability, and security by default.
- One reproducible pattern. The LLM proposes code/features/pipelines/seeds; a deterministic classical engine trains, scores, and selects; every GenAI step is gated behind a measured improvement over a seeded classical baseline.
- Hexagonal & swappable. Each ML/MLOps library sits behind a
Protocolport, so the core stays library-agnostic. Adapters that ship today: scikit-learn, XGBoost, LightGBM, CatBoost, TabPFN, PyTorch Lightning, HuggingFace, and MLflow. Ports with reference or planned adapters (AutoGluon, Feast, BentoML packaging, a model registry) are marked as such in the docs β the seams exist; the adapters are landing. - Firefly-native. Auto-configuration, dependency injection, a startup banner + wiring summary, CalVer, and the same CI gates as the rest of the Firefly Framework.
Beyond the engineering, Firefly DataScience is designed to change five things that decide whether data science actually delivers business value:
- Faster time-to-value β AutoML chooses and tunes the model and an agentic loop iterates, so a benchmarked, production-grade model is days of work, not quarters.
- Governed GenAI β the LLM proposes, a deterministic engine measures, and a cost/benefit gate keeps only what beats the baseline. Every decision is logged and auditable; no unproven AI output ships.
- No vendor lock-in β open (Apache-2.0) and hexagonal: every ML library and every LLM provider is a swappable adapter, and the whole framework is self-hostable.
- Lower cost & risk β classical-first (cheap, reproducible) with secure-by-default execution of any generated code; generative AI is used only where it measurably pays.
- Production-ready β serving, data validation, lineage and real benchmarks are built in, not bolted on.
Proven, not promised β unbiased and significance-tested. Under nested cross-validation (no
selection bias), Firefly's AutoML significantly beats a single LogisticRegression (Ξ +0.029, p = 0.046)
and a single XGBoost (Ξ +0.030, p = 7.5e-6), and is statistically on par with RandomForest β
adapting per dataset, up to +0.15 on non-linear phoneme. With a real LLM (claude-haiku-4-5),
governed GenAI feature engineering adds a significant +0.021 lift on a linear model (p = 0.0039) by
rediscovering a withheld driver (revenue = price Γ units) from the schema β and the cost/benefit gate
guarantees it never regresses, at < $0.01. Every number is reproducible β see the
benchmark results.
π The whole story in one document: The Complete Guide (PDF) combines the executive summary and strategic case with the architecture, a full hands-on tutorial, and the benchmark evidence β for both leaders and engineers.
uv add 'fireflyframework-datascience[tabular]' # classical AutoML
# or: uv add 'fireflyframework-datascience[automl-stack]' # + TabPFN, MLflow, OpenMLTrain, rank, and evaluate models in five lines:
from fireflyframework_datascience.automl import AutoML
from fireflyframework_datascience.datasets.adapters import SklearnDatasetLoader
train, test = SklearnDatasetLoader().load("breast_cancer").train_test_split()
result = AutoML().fit(train) # cross-validates candidates, picks the winner
print(result.leaderboard_table()) # random_forest / linear / hist_gradient_boosting β¦
print(result.evaluate(test)) # holdout roc_auc β 0.98Boot it as a Firefly application (auto-configuration + dependency injection), or use the CLI:
firefly-ds doctor # check your environment & installed adapters
firefly-ds introspect # boot the app and show discovered auto-configurationsAdd a real LLM for GenAI feature engineering and the agentic loop β see Configuring the LLM. The full guided walkthrough is the Tutorial.
Five acyclic layers, mirroring fireflyframework-agentic with a DataScience layer inserted:
Core β Agent (reused) β DataScience β Intelligence β Orchestration.
Each ML/MLOps library sits behind a Protocol port, so the core stays library-agnostic. Shipping
adapters today: scikit-learn, XGBoost, LightGBM, CatBoost, TabPFN, PyTorch Lightning, HuggingFace,
MLflow. AutoGluon, Feast, BentoML packaging and a model registry are ports with reference/planned
adapters.
Adapters self-register via entry points and are wired by a type-hint dependency-injection container,
gated by @conditional_on_* β exactly like Spring Boot / pyfly.
The LLM proposes code/features; a deterministic engine measures; a cost/benefit gate keeps only what beats the seeded baseline. The LLM never decides β the measured score does.
Propose β execute (sandboxed) β observe β verify (correctness β ran) β reflect β select.
π Full docs site: https://fireflyframework.github.io/fireflyframework-datascience/
| Guide | |
|---|---|
| Tutorial | the guided end-to-end walkthrough (runs offline; tested) |
| Samples | runnable demos β tutorial, real-LLM showcase, finance/retail |
| Quick Start | install, boot, first AutoML run, the firefly-ds CLI |
| Configuring the LLM | providers, API keys, model selection, cost gating |
| Architecture | layers, hexagonal ports, auto-configuration, the DI container |
| Configuration | env / .env / YAML / profiles precedence |
| Datasets | the Dataset container and loaders |
| Classical AutoML | the AutoML facade, trainers, search, metrics, calibration, ensembling, PR-AUC selection & CV strategies |
| Explainability | deterministic global + local feature importances (permutation, SHAP) |
| GenAI Feature Engineering | propose β execute β measure β gate; the persisted audit trail |
| Agentic ML-Engineering Loop | propose β verify β reflect β select |
| Deep Learning & TabFM | MLP, TabPFN, the PyTorch integration point |
| Serving & Lineage | in-process and gated servers, lineage |
| Security Model | secure code execution, sandbox tiers, prompt-injection defense |
| Benchmarks | the three-tier AMLB-anchored evaluation strategy |
| Use Case: Lumen Lending | the end-to-end credit-risk walkthrough |
Apache-2.0. Copyright 2026 Firefly Software Foundation.