Scientific evaluation (nested CV), GenAI-value ablation, boosting tests; fix Pages images#7
Merged
Merged
Conversation
…s; fix Pages images - benchmarks/scientific_eval.py: NESTED 5-fold CV (unbiased — inner CV selects, outer fold scores) vs LogReg/RandomForest/XGBoost + Wilcoxon. Firefly beats single LogReg (p=0.046) and single XGBoost (p=7.5e-6), on par with RandomForest; adapts per dataset. - benchmarks/genai_value.py: controlled ablation w/ real LLM — GenAI feature engineering lifts a linear model +0.0205 ROC-AUC (p=0.0039) by rediscovering revenue=price*units; gate guarantees no regression; <$0.01. Significant, metered, honest. - tests/models/test_boosting.py: explicit XGBoost/LightGBM/CatBoost fit/predict/params. - integration tests for both harnesses; RESULTS.md + docs updated with rigorous numbers. - FIX: mkdocs use_directory_urls=false so raw-HTML <img src='img/..'> render on GitHub Pages (the agentic-loop and other diagrams were 404ing under directory URLs). Gate: ruff clean, pyright 0, mkdocs --strict ok, 100 tests pass (9 integration deselected).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Scientific, unbiased evaluation
scientific_eval.py— nested 5-fold CV (inner CV selects, untouched outer fold scores — no selection bias) vs LogReg / RandomForest / XGBoost, with Wilcoxon. Firefly AutoML beats single LogReg (p=0.046) and single XGBoost (p=7.5e-6), on par with RandomForest; adapts per dataset.genai_value.py— controlled ablation with a real LLM: GenAI feature engineering lifts a linear model +0.0205 ROC-AUC (p=0.0039) by rediscoveringrevenue = price × units; the gate guarantees no regression; < $0.01.tests/models/test_boosting.py— explicit XGBoost / LightGBM / CatBoost fit, predict, params.RESULTS.md+ docs updated with the rigorous, honest numbers.Fix: diagrams not rendering on GitHub Pages
use_directory_urls: falseso raw-HTML<img src="img/..">resolve from the site root on every page (the agentic-loop and other diagrams were 404ing under directory URLs).Gate: ruff clean · pyright 0 · mkdocs --strict ok · 100 tests pass (9 integration deselected). No API key in the repo.