Real-LLM testing, benchmark proof, finance/retail samples by ancongui · Pull Request #6 · fireflyframework/fireflyframework-datascience

ancongui · 2026-06-25T12:44:40Z

Tested for real with a live model (`anthropic:claude-haiku-4-5`)

GenAI feature engineering — Claude proposed features; the gate accepted only those that lifted the score and rediscovered debt_to_income from the schema alone.
Agentic loop — Claude reflected on the leaderboard and tuned the best model to 0.9955.
samples/genai_llm_showcase.py + integration tests (skip without ANTHROPIC_API_KEY).

Beats a baseline, honestly (`benchmarks/beat_baseline.py`, 5-fold CV ROC-AUC)

Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets — +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain. benchmarks/RESULTS.md exposes every executed number.

Real finance & retail data

samples/industry_showcase.py — OpenML credit-g 0.82, bank-marketing 0.92 (no Kaggle needed).

Docs & cleanup

Benchmark results in docs, new docs/samples.md (+nav), README Proven, not promised. Reworded the two 'stub'/'placeholder' comments — there are no real stubs/TODOs.

Gate: ruff clean · pyright 0 · mkdocs --strict ok · 93 tests pass (7 integration deselected). No API key in the repo.

Validated for real with a live model (anthropic:claude-haiku-4-5): - GenAI feature engineering: Claude proposed features and the gate accepted only the ones that lifted the score — it rediscovered debt_to_income from the schema - agentic loop: Claude reflected on history and tuned the best model (0.9955) - samples/genai_llm_showcase.py + integration tests (skip w/o ANTHROPIC_API_KEY) Benchmark proof (benchmarks/beat_baseline.py, 5-fold CV ROC-AUC): - Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets; +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain. Honest CV protocol. - benchmarks/RESULTS.md exposes all executed numbers (Tier-1/2 + the head-to-head) Real-data samples: industry_showcase.py (OpenML credit-g 0.82, bank-marketing 0.92). Docs: benchmarks results, docs/samples.md (+nav), README 'Proven, not promised'. Cleanup: reworded the two 'stub'/'placeholder' comments (no real stubs exist). Gate: ruff clean, pyright 0, mkdocs --strict ok, 93 tests pass (7 integration deselected).

ancongui merged commit 817144b into main Jun 25, 2026

ancongui deleted the feat/real-llm-benchmarks-samples branch June 25, 2026 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Real-LLM testing, benchmark proof, finance/retail samples#6

Real-LLM testing, benchmark proof, finance/retail samples#6
ancongui merged 1 commit into
mainfrom
feat/real-llm-benchmarks-samples

ancongui commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

ancongui commented Jun 25, 2026

Tested for real with a live model (anthropic:claude-haiku-4-5)

Beats a baseline, honestly (benchmarks/beat_baseline.py, 5-fold CV ROC-AUC)

Real finance & retail data

Docs & cleanup

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Tested for real with a live model (`anthropic:claude-haiku-4-5`)

Beats a baseline, honestly (`benchmarks/beat_baseline.py`, 5-fold CV ROC-AUC)