From 40e42011cd3871e5e8350ba585f06d16f6d54a14 Mon Sep 17 00:00:00 2001 From: Eric Boothe Date: Thu, 28 May 2026 17:16:53 -0600 Subject: [PATCH 1/8] feat(bench): Outpost accuracy benchmark + per-sheet-eval fixes + searchByLabel lazy numerics MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Keystone of the next-wave effort: a repeatable accuracy + efficacy benchmark over the real ~200 MB Outpost models, plus fixes to the eval tooling that was silently broken on them. Benchmark (benchmarks/outpost-bench.mjs, `npm run bench:outpost`): - Wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth) per model; reports overall accuracy + per-sheet pass/skip + timings. Aggregate-only results -> committed benchmarks/BASELINE.md; full detail -> gitignored benchmarks/results/. No cell value/label is ever committed. - Baseline: outpost-a1 84.3%, outpost-a2 85.5% on standalone sheets (17-sheet cluster + 190 MB PP&E skipped for now). per-sheet-eval (wasn't in CI -> bugs went unnoticed): - Windows crash FIXED: it imported each sheet's compute() by a bare absolute path ("C:\..."), which Node ESM rejects on Windows -> every sheet crashed at load (0% accuracy) on Windows + the real engines. Now uses pathToFileURL(). New tests/cli/test-per-sheet-eval.mjs (6) guards it; CI runs on windows-latest. - --skip-clusters flag: record circular-cluster sheets as skipped (the current convergence re-runs the whole cluster once per member -> O(cluster²), infeasible on big models) pending the single-pass orchestrator eval. searchByLabel (query/carry): probe the matched row's columns on demand instead of scanning the whole GT per row, with a directed caseColumn probe so a far scenario column is never missed. Behavior-preserving. Findings (now in ROADMAP): the cluster is 17/21 sheets + redundantly evaluated; the 190 MB PP&E sheet exceeds the 150 MB limit; _computed-values.json is a byte-identical GT copy (not an accuracy source). gitignore: engines/ (already), benchmarks/results/, _eval_tmp/. Full `npm test` (incl. new per-sheet-eval guard) + smoke green. Co-Authored-By: Claude Opus 4.8 --- .gitignore | 8 ++ CHANGELOG.md | 49 +++++++++ PLAN.md | 24 +++++ ROADMAP.md | 33 ++++-- benchmarks/BASELINE.md | 24 +++++ benchmarks/outpost-bench.mjs | 172 ++++++++++++++++++++++++++++++ eval/per-sheet-eval.mjs | 21 +++- lib/manifest.mjs | 41 +++++-- package.json | 3 +- tests/cli/test-per-sheet-eval.mjs | 74 +++++++++++++ 10 files changed, 426 insertions(+), 23 deletions(-) create mode 100644 benchmarks/BASELINE.md create mode 100644 benchmarks/outpost-bench.mjs create mode 100644 tests/cli/test-per-sheet-eval.mjs diff --git a/.gitignore b/.gitignore index 8c6cf07..5735539 100644 --- a/.gitignore +++ b/.gitignore @@ -17,6 +17,14 @@ tests/eval-results.json # covered by *.xlsx above; this also excludes the parsed engine output dirs. engines/ +# Benchmark run detail (per-sheet/per-cell results derived from the real models +# — may contain real values/labels). Only the aggregate-only benchmarks/BASELINE.md +# is committed; raw per-run detail stays local. +benchmarks/results/ + +# per-sheet-eval scratch dir (transient child-process scripts + per-sheet GT) +_eval_tmp/ + # Transient test artifacts (scenario save/load test writes here on every run) tests/cli/fixtures/scenarios/ diff --git a/CHANGELOG.md b/CHANGELOG.md index 0bffafc..903078c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -1,5 +1,54 @@ # excel-to-engine — Changelog +## 2026-05-28 — Outpost accuracy benchmark + eval-tooling fixes + +Stood up a repeatable accuracy + efficacy benchmark over the real ~200 MB +Outpost models so improvements can be tracked over time, and fixed the eval +tooling that was silently broken on them. + +### Benchmark (`benchmarks/outpost-bench.mjs`, `npm run bench:outpost`) + +- Wraps `eval/per-sheet-eval.mjs` (live engine-vs-ground-truth) for every model + under a root dir; reports overall accuracy, per-sheet pass/skip counts, and + timings. **Aggregate-only** results go to the committed `benchmarks/BASELINE.md`; + full per-sheet detail stays in the gitignored `benchmarks/results/`. No cell + value or label is ever committed. +- **Baseline (2026-05-28):** outpost-a1 **84.3%**, outpost-a2 **85.5%** on the + standalone sheets. (The 17-sheet circular cluster and the 190 MB PP&E sheet are + skipped for now — see below.) + +### per-sheet-eval fixes (it wasn't in CI, so these went unnoticed) + +- **Windows crash fixed.** The generated per-sheet wrapper imported each sheet's + `compute()` by a bare absolute path (`"C:\\..."`), which Node ESM rejects on + Windows — so *every* sheet "crashed" at load (0% accuracy) on Windows and on + the real engines. Now uses `pathToFileURL()`. New `tests/cli/test-per-sheet-eval.mjs` + (6) guards it; CI runs it on **windows-latest** too. +- **`--skip-clusters`** flag: record circular-cluster sheets as skipped instead + of evaluating them. The current convergence path re-runs the *whole* cluster + once per member sheet (O(cluster²)), which is infeasible on big models; this + yields a fast, real number for the standalone sheets while the single-pass + orchestrator eval is built (ROADMAP). + +### searchByLabel: lazy numerics (query / carry) + +`searchByLabel` previously scanned the entire ground truth once per matched row +to collect adjacent numerics. It now probes the row's columns on demand (same +approach as the refiner), with a directed `caseColumn` lookup probing its exact +cell so a far scenario column is never missed. Behavior-preserving (query/carry/ +ai-interface suites green). + +### Findings that scope the accuracy-blocker work + +- The 190 MB PP&E sheet exceeds the 150 MB per-sheet limit → **large-sheet eval** + blocker confirmed. +- The circular cluster is **17 of 21 sheets** and is evaluated redundantly + (once per member) → the concrete reason behind "circular-cluster won't + evaluate." Single-pass orchestrator eval is the fix. +- `_computed-values.json` in these engines is **byte-identical to ground truth** + (a seeded copy), so it is not a valid accuracy source — accuracy must come from + live recompute. + ## 2026-05-28 — `init` parses the ground truth once (shared across the pipeline) The real driver behind the "~2.5 min" refine loop wasn't one command — it was diff --git a/PLAN.md b/PLAN.md index 8d36e2a..f16b337 100644 --- a/PLAN.md +++ b/PLAN.md @@ -1,5 +1,29 @@ # excel-to-engine — Plan +## Status: Outpost accuracy benchmark + eval fixes — in progress 2026-05-28 + +Standing up the multi-wave "next wave" effort on `feat/next-wave`, keystone +first: a repeatable accuracy + efficacy benchmark over the real Outpost models +(`benchmarks/outpost-bench.mjs` → `benchmarks/BASELINE.md`, aggregate-only). + +**Baseline:** outpost-a1 84.3%, outpost-a2 85.5% on standalone sheets; the +17-sheet circular cluster and the 190 MB PP&E sheet are skipped pending deeper +fixes. Landed alongside: a **Windows crash fix** in `per-sheet-eval` (bare +absolute ESM import → `pathToFileURL`; it had silently zeroed accuracy on +Windows/real engines and wasn't in CI — now guarded by `test-per-sheet-eval`), +a `--skip-clusters` flag, and the **searchByLabel lazy-numerics** wave +(query/carry stop scanning the full GT for adjacent values). + +**Wave status (this branch):** +- ✅ Keystone benchmark + baseline; ✅ searchByLabel (query/carry). +- 🔜 Accuracy blockers — now precisely diagnosed: single-pass orchestrator eval + for the 17-sheet cluster (it's re-run once per member today), large-sheet eval + (190 MB PP&E > 150 MB limit), array formulas (the Headcount sheet lives inside + the cluster). `_computed-values.json` is a GT copy — not an accuracy source. +- 🔜 Manifest-pipeline perf (generate detectors / maps cell-types on ~6M cells). +- 🔜 Polish→Publish (lib/ unit tests, npm publish prep, example project, + contributing guide). + ## Status: single GT parse per init — landed 2026-05-28 `ete init` now loads the ground truth once and shares the parsed object across diff --git a/ROADMAP.md b/ROADMAP.md index c6786bb..86be414 100644 --- a/ROADMAP.md +++ b/ROADMAP.md @@ -96,8 +96,10 @@ when we next touch the monitor server or auth surface. ~70 KB) but extracting it cheaply would couple the parser to refine's metric vocabulary. Not worth it on these models; revisit only if a genuinely giant-grid model (mostly unlabeled numeric grids) shows up. - - **Still open:** apply the same lazy-numerics path to `searchByLabel` - (`query` / `carry`) so they stop scanning the GT for adjacent values. + - **Done (2026-05-28):** applied the same lazy-numerics path to `searchByLabel` + (`query` / `carry`) — probes the matched row's columns instead of scanning + the whole GT, with a directed `caseColumn` probe so a far scenario column is + never missed. - Manifest migration tooling for model updates (vN → vN+1 shape diff). --- @@ -147,14 +149,29 @@ when we next touch the monitor server or auth surface. - **Pref compounding for long holds** — 12-year 8% compound pref = 2.52x hurdle, which exceeds many MOIC targets. Need to detect when models use quarterly cash flow waterfalls vs bullet maturity and adjust accordingly. ### Eval System -- Increase blind eval question diversity (computed questions, cross-sheet aggregations) -- Add time-period-aware questions ("What was X in Q3 2025?") -- Profile and optimize per-sheet eval for sheets >150MB +- **Done (2026-05-28):** repeatable accuracy + efficacy benchmark over the real + Outpost models — `benchmarks/outpost-bench.mjs` → `benchmarks/BASELINE.md` + (aggregate-only). Baseline: a1 84.3%, a2 85.5% on standalone sheets. Also + **fixed a Windows crash** in `per-sheet-eval` (bare absolute ESM import → + `pathToFileURL`; it had zeroed accuracy on Windows/real engines and wasn't in + CI — now guarded by `test-per-sheet-eval`, run on windows-latest). +- **Large-sheet eval (190 MB PP&E):** confirmed it exceeds the 150 MB per-sheet + limit and is skipped. Needs streaming/sharded per-sheet eval or a higher limit + with chunked compute. The standalone sheets at ~85% also need attention (array + formulas / wide-sheet disambiguation) — visible now that the eval runs. +- Increase blind eval question diversity; add time-period-aware questions. ### Convergence Loop Accuracy -- The 62-sheet circular cluster in the large model is the biggest accuracy blocker -- Investigate running eval through the orchestrator (not per-sheet isolation) for circular sheets -- Consider lazy subgraph evaluation (only compute transitive closure of target cells) +- **Diagnosed (2026-05-28):** on the real models the circular cluster is **17 of + 21 sheets**, and `per-sheet-eval` re-runs the *entire* cluster convergence once + per member sheet (O(cluster²)) — that's why clustered big models "won't + evaluate." The array-formula Headcount sheet lives inside this cluster, so it's + unmeasurable until this is fixed. `--skip-clusters` skips them for now. +- **Fix:** single-pass orchestrator eval — converge the cluster once, then score + every member sheet from that converged state (then drop `--skip-clusters` from + the benchmark). Also scope the convergence diff to written cells (it currently + diffs all ~6M seeded cells per iteration). +- Consider lazy subgraph evaluation (only compute transitive closure of targets). ## Near-Term diff --git a/benchmarks/BASELINE.md b/benchmarks/BASELINE.md new file mode 100644 index 0000000..e2a72bd --- /dev/null +++ b/benchmarks/BASELINE.md @@ -0,0 +1,24 @@ +# Outpost benchmark — baseline & history + +Real accuracy: each standalone sheet recomputed live vs ground truth via +`eval/per-sheet-eval.mjs` (numbers within 1% rel. tol, strings exact). +Circular-cluster sheets and oversized sheets are **skipped** for now (see +the Skipped column + blockers below) pending the single-pass orchestrator +eval; run with `--with-clusters` once that lands. Aggregate-only — no cell +values or full sheet inventory. Regenerate: +`node benchmarks/outpost-bench.mjs --root `. Full per-sheet detail +lands in the gitignored `benchmarks/results/`. + +_Last run: baseline-2026-05-28_ + +| Model | Accuracy | Cells matched | Sheets ≥95% | Skipped | Eval time | GT | +|-------|---------:|------:|:-----------:|:-------:|----------:|---:| +| outpost-a1 | 84.33% | 1491/1768 | 1/3 | 17 | 45s | 201.5 MB | +| outpost-a2 | 85.54% | 1686/1971 | 2/4 | 17 | 48s | 211 MB | + +## Known blocker categories + +Tracked by name because PLAN.md already calls them out; values are accuracy %, not financials. + +- **outpost-a1**: 1/3 sheets clean; blockers: Owned Asset PP&E (skipped: module too large (190MB > 150MB limit)); Headcount (skipped: circular cluster (--skip-clusters; needs single-pass orchestrator eval)) +- **outpost-a2**: 2/4 sheets clean; blockers: Owned Asset PP&E (skipped: module too large (190MB > 150MB limit)); Headcount (skipped: circular cluster (--skip-clusters; needs single-pass orchestrator eval)) diff --git a/benchmarks/outpost-bench.mjs b/benchmarks/outpost-bench.mjs new file mode 100644 index 0000000..23f8187 --- /dev/null +++ b/benchmarks/outpost-bench.mjs @@ -0,0 +1,172 @@ +#!/usr/bin/env node +/** + * Outpost benchmark — repeatable accuracy + efficacy tracking over the real + * models, so we can see whether each improvement actually moves the needle. + * + * It wraps `eval/per-sheet-eval.mjs` (which runs each sheet module live against + * ground truth, converges circular clusters, and skips/handles oversized + * sheets) for every model under a root dir, then aggregates. Accuracy is real + * (engine recompute vs ground truth, 1% rel. tol / exact strings) — NOT the + * `_computed-values.json` snapshot, which is a copy of ground truth (trivially + * 100%), nor the full-engine `run()`, which is infeasible on these models (the + * 190 MB PP&E sheet). + * + * Privacy: the real models are proprietary (gitignored). Full per-sheet detail + * (incl. per-cell failures) stays in the gitignored `benchmarks/results/`. Only + * AGGREGATE, non-identifying metrics — overall accuracy, sheet pass/skip counts, + * timings, and the already-public blocker categories (PP&E, Headcount) — go to + * the committed `benchmarks/BASELINE.md`. No cell value or label is ever + * printed or committed. + * + * Usage: + * node benchmarks/outpost-bench.mjs [--root ] [--concurrency 3] [--stamp