Skip to content

Latest commit

Β 

History

History
177 lines (125 loc) Β· 7.91 KB

File metadata and controls

177 lines (125 loc) Β· 7.91 KB

Samples

Every sample is a single, runnable script that is covered by a test β€” so each one is guaranteed to work.

The five scripts live in samples/. Most run offline with no LLM key (the GenAI steps use deterministic stand-in proposers); two use a real LLM when a key is present. Pick the card that matches what you want to see, copy its run command, and go.

!!! firefly "The pattern every sample demonstrates β€” the LLM proposes; the classical engine decides"

GenAI proposes feature code and model candidates; a deterministic engine cross-validates and
measures; and the cost/benefit gate keeps only what beats a seeded baseline. The offline samples
swap the LLM for a fixed proposer so the exact same gate runs without a key β€” see
[GenAI Feature Engineering](genai-features.md) and the [Agentic Loop](agentic-loop.md).

The five samples

  • :material-map-outline:{ .lg .middle } tutorial.py β€” the full guided tour


    The whole framework end-to-end on a synthetic credit-risk dataset: boot β†’ validate β†’ classical AutoML β†’ GenAI feature engineering β†’ agentic loop β†’ serve. Runs offline; it uses deterministic stand-in proposers and prints exactly how to switch on a real LLM. Needs the tabular extra.

    uv run python samples/tutorial.py

    :octicons-arrow-right-24: Walkthrough

  • :material-bank-outline:{ .lg .middle } lumen_credit_risk.py β€” a focused use case


    One realistic (synthetic) lending dataset where default risk is driven by debt-to-income. GenAI feature engineering discovers debt_to_income, the gate keeps it because it measurably lifts the score, AutoML selects the winner, and it is served to score a new applicant. Runs offline (StaticFeatureProposer). Needs the tabular extra.

    uv run python samples/lumen_credit_risk.py

    :octicons-arrow-right-24: Use case: Lumen

  • :material-tune-vertical:{ .lg .middle } advanced_automl.py β€” production-grade selection & trust


    Every modeling/trust feature on the real breast_cancer data: a StratifiedKFold splitter, PR-AUC selection, a calibrated stacking ensemble, deterministic explainability (global importances), and a persisted audit trail of every GenAI gate decision. Runs offline, and automatically uses a real LLM to propose features when a key is present. Needs the tabular extra (explain adds SHAP).

    uv run python samples/advanced_automl.py

    :octicons-arrow-right-24: Classical AutoML

  • :material-database-outline:{ .lg .middle } industry_showcase.py β€” real public data


    The full pipeline (load β†’ validate β†’ AutoML β†’ holdout evaluation) on genuine, mixed-type data with categorical features, loaded straight from OpenML β€” no Kaggle account needed. Two cases: credit-g (German credit risk) and bank-marketing (campaign conversion). Needs the tabular and data extras and network access.

    uv run python samples/industry_showcase.py

    :octicons-arrow-right-24: Datasets

  • :material-creation-outline:{ .lg .middle } genai_llm_showcase.py β€” a real LLM


    The only sample that calls a real model (Claude / GPT / …) for both GenAI feature engineering and the agentic ML-engineering loop. Model and credentials come from the environment β€” nothing is hard-coded. Needs the tabular and genai extras and an LLM key.

    export ANTHROPIC_API_KEY=sk-ant-...  # (1)!
    uv run python samples/genai_llm_showcase.py

    :octicons-arrow-right-24: Configuring the LLM

  1. Or OPENAI_API_KEY / GEMINI_API_KEY. The model string defaults to anthropic:claude-haiku-4-5; override it with export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=.... The script exits cleanly with a message if no key is set.

Run them

=== "Offline (no key)"

```bash
uv run python samples/tutorial.py            # the full guided tour
uv run python samples/lumen_credit_risk.py   # focused credit-risk use case
uv run python samples/advanced_automl.py     # calibration, ensembling, PR-AUC, explainability, audit
uv run python samples/industry_showcase.py   # real OpenML data (needs network)
```

=== "Real LLM"

```bash
export ANTHROPIC_API_KEY=sk-ant-...                                       # or OPENAI_API_KEY=...
export FIREFLY_DATASCIENCE_GENAI__DEFAULT_MODEL=anthropic:claude-haiku-4-5  # optional; this is the default
uv run python samples/genai_llm_showcase.py   # real LLM: feature engineering + agentic loop
uv run python samples/advanced_automl.py      # real LLM proposes features; calibrated ensemble + explainability
```

!!! tip "The offline samples are the place to start"

`tutorial.py` and `lumen_credit_risk.py` run the *same* cost/benefit gate as the real-LLM
showcase β€” the only difference is who proposes the features and models. Both finish in a few
seconds offline (β‰ˆ5–10 s on a laptop), so you can see the full governance loop without spending a token.

What the real-LLM showcase produces

A representative run with anthropic:claude-haiku-4-5:

[1] GenAI feature engineering β€” the LLM proposes, the gate decides:
    βœ“ accepted loan_to_income_ratio     gain=+0.0013   df['loan_to_income_ratio'] = df['loan_amount'] / (df['income'] + 1)
    βœ“ accepted default_risk_score       gain=+0.0006   df['default_risk_score'] = df['num_prior_defaults'] * 100 + ...
    βœ— rejected employment_to_loan_ratio (no measured lift)
    β†’ 2 accepted, 4 rejected; roc_auc 0.7875 -> 0.7895

[2] Agentic ML-engineering loop β€” the LLM reflects, the engine verifies:
    Β· linear  {}  score=0.9939   ...   Β· linear {'C': 0.15, 'penalty': 'l1', ...} score=0.9955
    β†’ 9 attempts (9 verified); best=linear roc_auc=0.9955

The model proposes; the deterministic engine measures; the gate keeps only what is proven. Nothing unverified is adopted β€” exactly the governance described in GenAI Feature Engineering and the Agentic Loop.

!!! note "LLM output is non-deterministic"

The exact feature names, gains and attempt counts vary run to run β€” the LLM is free to propose
anything. What is invariant is the gate: a proposal is accepted only if it lifts a seeded,
cross-validated baseline, so the *decision* is always reproducible even when the *proposal* is not.

Benchmark scenarios

For evaluation rather than demonstration, the benchmarks/ directory holds the harnesses; all results live in benchmarks/RESULTS.md.

Script What it measures
automl_benchmark.py Tier-2 offline smoke suite (scikit-learn datasets)
amlb_benchmark.py Tier-1 OpenML-CC18 (AMLB-style), real categorical data
scientific_eval.py Nested 5-fold CV vs fixed single models + Wilcoxon significance (unbiased)
genai_value.py Controlled ablation of GenAI feature engineering with a real LLM (+ cost)
beat_baseline.py A quick cross-validated head-to-head vs a default baseline

See also

  • Tutorial β€” the step-by-step walkthrough behind tutorial.py.
  • Use case: Lumen β€” the credit-risk story behind lumen_credit_risk.py.
  • Configuring the LLM β€” providers, model strings and keys for the real-LLM sample.
  • Datasets β€” the OpenML loader used by industry_showcase.py.
  • Benchmarks β€” the evaluation harnesses and their results.