Skip to content

Real-LLM testing, benchmark proof, finance/retail samples#6

Merged
ancongui merged 1 commit into
mainfrom
feat/real-llm-benchmarks-samples
Jun 25, 2026
Merged

Real-LLM testing, benchmark proof, finance/retail samples#6
ancongui merged 1 commit into
mainfrom
feat/real-llm-benchmarks-samples

Conversation

@ancongui

Copy link
Copy Markdown
Contributor

Tested for real with a live model (anthropic:claude-haiku-4-5)

  • GenAI feature engineering — Claude proposed features; the gate accepted only those that lifted the score and rediscovered debt_to_income from the schema alone.
  • Agentic loop — Claude reflected on the leaderboard and tuned the best model to 0.9955.
  • samples/genai_llm_showcase.py + integration tests (skip without ANTHROPIC_API_KEY).

Beats a baseline, honestly (benchmarks/beat_baseline.py, 5-fold CV ROC-AUC)

Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets — +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain. benchmarks/RESULTS.md exposes every executed number.

Real finance & retail data

samples/industry_showcase.py — OpenML credit-g 0.82, bank-marketing 0.92 (no Kaggle needed).

Docs & cleanup

Benchmark results in docs, new docs/samples.md (+nav), README Proven, not promised. Reworded the two 'stub'/'placeholder' comments — there are no real stubs/TODOs.

Gate: ruff clean · pyright 0 · mkdocs --strict ok · 93 tests pass (7 integration deselected). No API key in the repo.

Validated for real with a live model (anthropic:claude-haiku-4-5):
- GenAI feature engineering: Claude proposed features and the gate accepted only
  the ones that lifted the score — it rediscovered debt_to_income from the schema
- agentic loop: Claude reflected on history and tuned the best model (0.9955)
- samples/genai_llm_showcase.py + integration tests (skip w/o ANTHROPIC_API_KEY)

Benchmark proof (benchmarks/beat_baseline.py, 5-fold CV ROC-AUC):
- Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets;
  +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain. Honest CV protocol.
- benchmarks/RESULTS.md exposes all executed numbers (Tier-1/2 + the head-to-head)

Real-data samples: industry_showcase.py (OpenML credit-g 0.82, bank-marketing 0.92).
Docs: benchmarks results, docs/samples.md (+nav), README 'Proven, not promised'.
Cleanup: reworded the two 'stub'/'placeholder' comments (no real stubs exist).

Gate: ruff clean, pyright 0, mkdocs --strict ok, 93 tests pass (7 integration deselected).
@ancongui ancongui merged commit 817144b into main Jun 25, 2026
@ancongui ancongui deleted the feat/real-llm-benchmarks-samples branch June 25, 2026 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant