Tutorial

A guided, end-to-end tour of Firefly DataScience — from booting the app to serving a model.

This tutorial mirrors the runnable script samples/tutorial.py, which is covered by a test (tests/samples/test_tutorial.py), so everything here is guaranteed to work. It runs offline with no LLM key — the GenAI steps use deterministic stand-ins, and we show how to switch on a real LLM at the end.

uv add 'fireflyframework-datascience[tabular]'
uv run python samples/tutorial.py

We use a synthetic credit-risk dataset whose default risk is driven by debt-to-income — a ratio deliberately withheld from the model, so feature engineering has something real to discover.

!!! firefly "The pattern every step rests on — the LLM proposes; the classical engine decides"

Generative AI only ever *proposes* candidates here: feature code (step 4) and model choices
(step 5). A deterministic classical engine cross-validates each one and a cost-benefit gate keeps
it only if it measurably beats the current baseline. The LLM never touches the score — the data
does. That is why the tour runs identically with or without an API key.

1. Boot the application

from fireflyframework_datascience import FireflyDataScienceApplication

app = FireflyDataScienceApplication.run()         # (1)!

Pass print_output=False to suppress the banner (the script does this so its test output stays clean).

This prints the banner and a wiring summary, loads configuration, builds the dependency-injection container, and discovers every adapter via entry-point auto-configuration. app.bean_count and app.applied_auto_configurations tell you what got wired.

!!! success "Expected"

On a fresh `[tabular]` install the container wires a couple dozen beans from roughly a dozen
auto-configurations (exact counts depend on which extras are installed):

```text
[1] App booted: 21 beans, 12 auto-configurations
```

See Architecture.

2. Build, load, and validate the data

The script generates the credit dataset with make_credit_dataset(). Default risk is a logistic function of debt_to_income = loan_amount / income, plus prior defaults and employment, but only the four raw columns are handed to the model — debt_to_income is the hidden driver feature engineering must rediscover.

import numpy as np
import pandas as pd

from fireflyframework_datascience.core.types import TaskType
from fireflyframework_datascience.datasets import Dataset
from fireflyframework_datascience.validation.adapters import BasicValidator


def make_credit_dataset(n: int = 800, seed: int = 11) -> Dataset:
    rng = np.random.RandomState(seed)
    income = rng.normal(60_000, 18_000, n).clip(15_000, None)
    loan_amount = rng.normal(18_000, 10_000, n).clip(1_000, None)
    employment_years = rng.uniform(0, 30, n).round(1)
    num_prior_defaults = rng.poisson(0.6, n)
    dti = loan_amount / income                                            # (1)!
    logit = -2.6 + 5.0 * dti + 1.3 * num_prior_defaults - 0.05 * employment_years + rng.normal(0, 0.25, n)
    default = (rng.uniform(0, 1, n) < 1.0 / (1.0 + np.exp(-logit))).astype(int)
    X = pd.DataFrame(
        {
            "income": income.round(2),
            "loan_amount": loan_amount.round(2),
            "employment_years": employment_years,
            "num_prior_defaults": num_prior_defaults,                      # (2)!
        }
    )
    return Dataset(
        "credit_applicants",
        X,
        pd.Series(default, name="default"),
        task=TaskType.BINARY,
        target_name="default",
        feature_names=list(X.columns),
    )


dataset = make_credit_dataset()
report = BasicValidator().validate(dataset.X, dataset.y)
assert report.ok                                  # no all-null columns, no null target, etc.
train, test = dataset.train_test_split(test_size=0.25, random_state=0)     # (3)!

dti drives the label but is never put into X — that is the signal step 4 has to recover.
Only these four raw columns reach the model; debt_to_income is deliberately absent.
train_test_split stratifies on the target for classification; here it yields 600 train / 200 test rows.

The BasicValidator catches empty data, all-null/constant columns, duplicate rows, and null targets before you waste time training.

!!! success "Expected"

```text
[2] Data validated: ok=True
```

The dataset is 800 rows × 4 features, a `TaskType.BINARY` task; the split gives 600 train / 200 test.

See Datasets.

3. Classical AutoML

from fireflyframework_datascience.automl import AutoML

result = AutoML(cv=4).fit(train)
print(result.leaderboard_table())
print(result.evaluate(test))                      # holdout metrics

AutoML cross-validates each candidate trainer (RandomForestTrainer, LinearTrainer, HistGradientBoostingTrainer, plus the boosting libraries if installed), ranks them on a task-appropriate metric (roc_auc for binary), and refits the winner on the full training set. Each candidate is wrapped in an impute-and-scale preprocessing pipeline before scoring.

!!! success "Expected"

A leaderboard topped by `linear`, and a holdout `roc_auc ≈ 0.85`:

```text
linear                   roc_auc=0.7867
random_forest            roc_auc=0.7493
hist_gradient_boosting   roc_auc=0.7335
EvaluationResult(primary=roc_auc=0.8498; accuracy=0.7900, f1=0.7853, precision=0.7851, recall=0.7900, roc_auc=0.8498, log_loss=0.4500)
```

!!! note

The leaderboard prints **cross-validation** scores on the training data, while `evaluate(test)`
reports metrics on the untouched holdout — so the headline `roc_auc` (≈0.85) is higher than the
CV figure (≈0.79). Both are real; they measure different things.

See Classical AutoML.

4. GenAI feature engineering

from fireflyframework_datascience.features import StaticFeatureProposer, FeatureProposal
from fireflyframework_datascience.features.genai import GenAIFeatureEngineer

proposer = StaticFeatureProposer([
    FeatureProposal("debt_to_income", "df['debt_to_income'] = df['loan_amount'] / (df['income'] + 1)", "DTI"),  # (1)!
    FeatureProposal("noise", "df['noise'] = 0.0", "should be rejected"),                                       # (2)!
])
engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
print(engineered.summary())

The hidden driver: its CV lift clears CostBenefitGate(min_gain=0.0), so it is accepted.
A constant column adds nothing, so the gate rejects it — the LLM never overrides that decision.

The loop is propose → execute (safely) → measure CV lift → gate. debt_to_income is accepted because it lifts the score; the constant noise feature is rejected. The LLM never decides — the measured score does.

!!! success "Expected"

```text
GenAI feature engineering: 1 accepted, 1 rejected; roc_auc 0.7875 -> 0.7889 (lift +0.0013)
```

`engineered.accepted` lists `debt_to_income`; `engineered.rejected` lists `noise` with the reason
`no lift (0.7889 <= 0.7889)`. The lift is small but **positive and real** — the gate rejects
anything that does not strictly beat the running baseline.

=== "Static (no LLM)"

`StaticFeatureProposer` stands in for the LLM so the tutorial runs offline with a fixed,
reproducible set of proposals — exactly what the snippet above uses.

=== "Agent (LLM)"

With a real model you swap in `AgentFeatureProposer`, which wraps a `FireflyAgent` and is built
lazily (no LLM client is created at startup):

```python
from fireflyframework_datascience.features.genai import AgentFeatureProposer, GenAIFeatureEngineer

proposer = AgentFeatureProposer(model="openai:gpt-4o")
engineered = GenAIFeatureEngineer(proposer, cv=4).engineer(train)
```

See [Configuring the LLM](llm-configuration.md).

See GenAI Feature Engineering.

5. The agentic ML-engineering loop

from fireflyframework_datascience.engineering import SequenceProposer, SolutionCandidate
from fireflyframework_datascience.engineering.loop import AgenticAutoML

proposer = SequenceProposer([SolutionCandidate("linear"), SolutionCandidate("random_forest"),
                             SolutionCandidate("hist_gradient_boosting")])
run = AgenticAutoML(proposer, cv=3, max_iterations=4).solve(train)        # (1)!
print(run.summary())

AgenticAutoML seeds the population, then reflects on the attempt history up to max_iterations times; a patience budget (default 3) stops the search once it stalls.

Each candidate is trained, cross-validated, and verified by a DeterministicVerifier — it must beat a trivial DummyClassifier(strategy="prior") baseline, not merely run (the "correctness ≠ ran" principle) — before the best one is selected. run.attempts is the full audited trail and run.valid_attempts are the ones that passed verification.

!!! success "Expected"

```text
Agentic AutoML: 3 attempts (3 verified); best=linear roc_auc=0.7897 (baseline 0.5000)
```

All three seeded candidates clear the `roc_auc=0.5000` trivial baseline, so all three are verified;
`linear` wins.

See Agentic Loop.

6. Serve the model

from fireflyframework_datascience.serving import LocalModelServer

server = LocalModelServer()
server.load(result.best_model)
prediction = server.predict(test.X.iloc[[0]])     # score one applicant
print(int(prediction[0]))

LocalModelServer is the default, dependency-free server: it loads a fitted Model in the host process and answers predict / predict_proba. Heavier servers (e.g. BentoMLModelServer) live behind the serving extra.

!!! success "Expected"

The first holdout applicant is scored as a non-default:

```text
[6] Served prediction for one applicant: default=0
```

See Serving & Lineage.

Turn on a real LLM

Steps 4 and 5 ran offline with deterministic stand-ins. To let a real model do the proposing, set your key and enable GenAI, then swap in the agent-backed proposers:

export OPENAI_API_KEY=sk-...                      # or ANTHROPIC_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__ENABLED=true
export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=openai:gpt-4o   # or anthropic:claude-sonnet-4-5

from fireflyframework_datascience.features.genai import AgentFeatureProposer
from fireflyframework_datascience.engineering.loop import AgentSolutionProposer

Use AgentFeatureProposer in place of StaticFeatureProposer (step 4) and AgentSolutionProposer in place of SequenceProposer (step 5). Nothing else changes — the cost-benefit gate and the verifier still decide. The full guide, including providers, keys, cost gating, and secure execution, is in Configuring the LLM.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tutorial

1. Boot the application

2. Build, load, and validate the data

3. Classical AutoML

4. GenAI feature engineering

5. The agentic ML-engineering loop

6. Serve the model

Turn on a real LLM

See also

Uh oh!

FilesExpand file tree

tutorial.md

Latest commit

History

tutorial.md

File metadata and controls

Tutorial

1. Boot the application

2. Build, load, and validate the data

3. Classical AutoML

4. GenAI feature engineering

5. The agentic ML-engineering loop

6. Serve the model

Turn on a real LLM

See also