Skip to content

Scientific evaluation (nested CV), GenAI-value ablation, boosting tests; fix Pages images#7

Merged
ancongui merged 1 commit into
mainfrom
feat/scientific-eval
Jun 25, 2026
Merged

Scientific evaluation (nested CV), GenAI-value ablation, boosting tests; fix Pages images#7
ancongui merged 1 commit into
mainfrom
feat/scientific-eval

Conversation

@ancongui

Copy link
Copy Markdown
Contributor

Scientific, unbiased evaluation

  • scientific_eval.pynested 5-fold CV (inner CV selects, untouched outer fold scores — no selection bias) vs LogReg / RandomForest / XGBoost, with Wilcoxon. Firefly AutoML beats single LogReg (p=0.046) and single XGBoost (p=7.5e-6), on par with RandomForest; adapts per dataset.
  • genai_value.py — controlled ablation with a real LLM: GenAI feature engineering lifts a linear model +0.0205 ROC-AUC (p=0.0039) by rediscovering revenue = price × units; the gate guarantees no regression; < $0.01.
  • tests/models/test_boosting.py — explicit XGBoost / LightGBM / CatBoost fit, predict, params.
  • Integration tests for both harnesses (verified live). RESULTS.md + docs updated with the rigorous, honest numbers.

Fix: diagrams not rendering on GitHub Pages

use_directory_urls: false so raw-HTML <img src="img/.."> resolve from the site root on every page (the agentic-loop and other diagrams were 404ing under directory URLs).

Gate: ruff clean · pyright 0 · mkdocs --strict ok · 100 tests pass (9 integration deselected). No API key in the repo.

…s; fix Pages images

- benchmarks/scientific_eval.py: NESTED 5-fold CV (unbiased — inner CV selects, outer
  fold scores) vs LogReg/RandomForest/XGBoost + Wilcoxon. Firefly beats single LogReg
  (p=0.046) and single XGBoost (p=7.5e-6), on par with RandomForest; adapts per dataset.
- benchmarks/genai_value.py: controlled ablation w/ real LLM — GenAI feature engineering
  lifts a linear model +0.0205 ROC-AUC (p=0.0039) by rediscovering revenue=price*units;
  gate guarantees no regression; <$0.01. Significant, metered, honest.
- tests/models/test_boosting.py: explicit XGBoost/LightGBM/CatBoost fit/predict/params.
- integration tests for both harnesses; RESULTS.md + docs updated with rigorous numbers.
- FIX: mkdocs use_directory_urls=false so raw-HTML <img src='img/..'> render on GitHub
  Pages (the agentic-loop and other diagrams were 404ing under directory URLs).

Gate: ruff clean, pyright 0, mkdocs --strict ok, 100 tests pass (9 integration deselected).
@ancongui ancongui merged commit 24f7444 into main Jun 25, 2026
@ancongui ancongui deleted the feat/scientific-eval branch June 25, 2026 13:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant