Real-LLM testing, benchmark proof, finance/retail samples#6
Merged
Conversation
Validated for real with a live model (anthropic:claude-haiku-4-5): - GenAI feature engineering: Claude proposed features and the gate accepted only the ones that lifted the score — it rediscovered debt_to_income from the schema - agentic loop: Claude reflected on history and tuned the best model (0.9955) - samples/genai_llm_showcase.py + integration tests (skip w/o ANTHROPIC_API_KEY) Benchmark proof (benchmarks/beat_baseline.py, 5-fold CV ROC-AUC): - Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets; +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain. Honest CV protocol. - benchmarks/RESULTS.md exposes all executed numbers (Tier-1/2 + the head-to-head) Real-data samples: industry_showcase.py (OpenML credit-g 0.82, bank-marketing 0.92). Docs: benchmarks results, docs/samples.md (+nav), README 'Proven, not promised'. Cleanup: reworded the two 'stub'/'placeholder' comments (no real stubs exist). Gate: ruff clean, pyright 0, mkdocs --strict ok, 93 tests pass (7 integration deselected).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tested for real with a live model (
anthropic:claude-haiku-4-5)debt_to_incomefrom the schema alone.samples/genai_llm_showcase.py+ integration tests (skip withoutANTHROPIC_API_KEY).Beats a baseline, honestly (
benchmarks/beat_baseline.py, 5-fold CV ROC-AUC)Firefly AutoML matches-or-beats a default LogisticRegression on 6/6 datasets — +0.149 on phoneme (0.962 vs 0.813), +0.029 mean gain.
benchmarks/RESULTS.mdexposes every executed number.Real finance & retail data
samples/industry_showcase.py— OpenML credit-g 0.82, bank-marketing 0.92 (no Kaggle needed).Docs & cleanup
Benchmark results in docs, new
docs/samples.md(+nav), README Proven, not promised. Reworded the two 'stub'/'placeholder' comments — there are no real stubs/TODOs.Gate: ruff clean · pyright 0 · mkdocs --strict ok · 93 tests pass (7 integration deselected). No API key in the repo.