Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests by ebootheee · Pull Request #21 · ebootheee/excel-to-engine

ebootheee · 2026-05-28T23:18:01Z

Summary

The "next wave" effort, keystone first: a repeatable accuracy + efficacy benchmark over the real ~200 MB Outpost models, the eval-tooling fixes needed to make it produce real numbers, plus the searchByLabel and lib-unit-tests waves. Not self-merging — bringing this for your review (per "come back when it's ready for a main merge"). CI green on ubuntu + windows.

Built branch-per-improvement (each an atomic, revertable commit gated on the benchmark + tests): feat/bench → lib-tests → cluster-eval, all folded here.

Baseline (the A/B/C test) — `npm run bench:outpost`

benchmarks/outpost-bench.mjs wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth). Aggregate-only → committed benchmarks/BASELINE.md; full detail → gitignored benchmarks/results/. No cell value/label ever committed.

Model	Accuracy (standalone sheets)	Skipped	Eval
outpost-a1	84.3%	17-sheet cluster + 190 MB PP&E	45s
outpost-a2	85.5%	17-sheet cluster + PP&E	48s

What landed (all tested)

per-sheet-eval Windows crash FIXED — it imported sheet compute() by bare absolute path ("C:\..."), which Node ESM rejects on Windows → every sheet crashed at 0% on Windows + the real engines. Now pathToFileURL(). Wasn't in CI; now guarded by tests/cli/test-per-sheet-eval.mjs on windows-latest.
Scoped cluster-convergence diff — the convergence loop diffed all ~6M seeded cells per iteration; now only the cluster's written cells (ctx._written). + first-ever circular-cluster test + fixture (tests/cli/fixtures/cluster-model/).
--skip-clusters flag (benchmark uses it; clusters pending cluster-once).
searchByLabel lazy numerics — query/carry probe the matched row's columns instead of scanning the whole GT; directed caseColumn probe so a far column is never missed. Behavior-preserving.
lib/ unit tests — tests/lib/test-lib.mjs (43 known-answer cases: irr/NPV/XIRR, waterfall American/European/MOIC-hurdle + conservation invariant, calibration, sensitivity). Polish→Publish's first item.
gitignore: engines/ (real models), benchmarks/results/, _eval_tmp/.

Wave status

✅ benchmark + baseline · ✅ searchByLabel · ✅ lib unit tests
🟡 Accuracy blockers — diagnosed + partially fixed (eval Windows fix, scoped-diff, fixture). The keystone remaining = cluster-once eval: the cluster is 17/21 sheets and per-sheet-eval re-runs the whole convergence once per member (17×) → won't finish on the real model. Fix = one task per cluster (converge once, score all members); the fixture is the test oracle. Then the array-formula Headcount sheet (inside the cluster) and the 190 MB PP&E large-sheet eval become reachable.
🔜 Pipeline perf (generate detectors / maps cell-types on ~6M cells)
🔜 Polish→Publish remainder (npm publish prep, example project, contributing guide)

Validation

Full npm test (now 387 JS assertions incl. lib 43, per-sheet-eval 10, refine 14, init-shared-gt 8) + smoke 78/78 green. CI ubuntu + windows green on every commit.

🤖 Generated with Claude Code

…chByLabel lazy numerics Keystone of the next-wave effort: a repeatable accuracy + efficacy benchmark over the real ~200 MB Outpost models, plus fixes to the eval tooling that was silently broken on them. Benchmark (benchmarks/outpost-bench.mjs, `npm run bench:outpost`): - Wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth) per model; reports overall accuracy + per-sheet pass/skip + timings. Aggregate-only results -> committed benchmarks/BASELINE.md; full detail -> gitignored benchmarks/results/. No cell value/label is ever committed. - Baseline: outpost-a1 84.3%, outpost-a2 85.5% on standalone sheets (17-sheet cluster + 190 MB PP&E skipped for now). per-sheet-eval (wasn't in CI -> bugs went unnoticed): - Windows crash FIXED: it imported each sheet's compute() by a bare absolute path ("C:\..."), which Node ESM rejects on Windows -> every sheet crashed at load (0% accuracy) on Windows + the real engines. Now uses pathToFileURL(). New tests/cli/test-per-sheet-eval.mjs (6) guards it; CI runs on windows-latest. - --skip-clusters flag: record circular-cluster sheets as skipped (the current convergence re-runs the whole cluster once per member -> O(cluster²), infeasible on big models) pending the single-pass orchestrator eval. searchByLabel (query/carry): probe the matched row's columns on demand instead of scanning the whole GT per row, with a directed caseColumn probe so a far scenario column is never missed. Behavior-preserving. Findings (now in ROADMAP): the cluster is 17/21 sheets + redundantly evaluated; the 190 MB PP&E sheet exceeds the 150 MB limit; _computed-values.json is a byte-identical GT copy (not an accuracy source). gitignore: engines/ (already), benchmarks/results/, _eval_tmp/. Full `npm test` (incl. new per-sheet-eval guard) + smoke green. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…itivity The shared financial libs had no direct coverage. Add tests/lib/test-lib.mjs (43 assertions), wired into `npm test` (runs first): - irr.mjs: NPV identities; IRR of classic series (-100->+150=50%, -1000 then 200x8 ~= 11.89%, 3y bullet); Newton == bisection; NPV(IRR) ~= 0; null on no sign change; XIRR on dated flows. - waterfall.mjs: American 80/20 + 8% pref + catch-up (LP/GP splits, carry %), no-catch-up, loss case, flat-MOIC-hurdle promote (+ hold-period invariance), European builder; LP+GP = distributed conservation across structures. - calibration.mjs: nested get/set; validateOutputs pass/fail + suggested factor. - sensitivity.mjs: flattenOutputs group/type filtering. Polish->Publish: first item (lib/ unit tests) done; CI guards them now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…luster test The cluster convergence loop in per-sheet-eval checked for a fixed point by diffing EVERY cell in the context each iteration — and the context is seeded with the full (multi-million-cell) ground truth, so it was O(all cells) × up to 200 iters. Now it tracks the cells compute() writes (ctx._written) and diffs only those (the cluster's own outputs). Behavior-preserving. Adds the first circular-cluster coverage: tests/cli/fixtures/cluster-model/ (a synthetic SheetA<->SheetB model converging to a=50,b=50,c=100,d=100) and a case in test-per-sheet-eval that runs it through the convergence loop, asserting 100%. Measured on the real model: scoped-diff alone is NOT enough — per-sheet-eval re-runs the whole cluster convergence once per member sheet (17x), and engine inaccuracies keep clusters from converging (200 iters). Remaining key fix is single-pass orchestrator eval (converge once, score all members); the fixture is the ready test oracle. Benchmark still uses --skip-clusters until then. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…onsumer The downstream Mippy agent regenerated both Outpost engines from main and reported back. New ROADMAP section "Now — Outpost regeneration findings": Confirmed wins vs the old build — dates fixed (old leaked ExcelDateTime debug strings, 2,686 in A-1; new = serial numbers, 0 leaks), ~42-45% smaller (model-map.json + GT-copy _computed-values.json gone), contract maps emitted, circular refs converge, and a golden-master PASS (regenerated GT reproduces the hand-port's canonical A-1 returns to full float precision, Version Tracker row 22). New follow-ups: generation robustness on big models (dep-graph OOM + init 10-min timeout, issue #23), --output-profile to scope artifacts (#22), 11,813 _fn() unsupported-function fallbacks per engine (transpiler-coverage accuracy suspect, added to Transpiler Coverage), refiner mis-mapping returns to a "UW Comparison" tab (needs canonical-returns-tab recognition), empty named-inputs.json when no formula-referenced defined-names exist, MIP-as-output (#7). Noted a ready golden-master CI assert (diff committed named-outputs.baseCaseValue). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Fresh-agent entry point: priority-ordered backlog (P0 cluster-once eval -> generation robustness #23 -> _fn() transpiler coverage -> refiner UW-Comparison mis-map -> golden-master CI -> output-profile/large-sheet/pipeline-perf -> Polish), current state (PR #21 open with the foundation; baseline a1 84.3% / a2 85.5% standalone), run commands, and the gotchas. PLAN points to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ture set Amendment to make the e2e agent's job explicit: make the full model a reliable calibration oracle — runnable, MIP coefficients exposed as named-outputs, no stubbed value cells. New priority order (issues on ebootheee/excel-to-engine): P1 · #23 + #24 reliably emit a runnable engine.js (fix dep-graph OOM; fail loud, never a partial artifact; lock layout + content hash) P2 · #25 pin value-bearing cells (MIP Proceeds, hurdle, participation%, equity basis, valuation/shares) as named-outputs P2 · #26 emit _fn-fallbacks.json; assert no value cell uses a stub P3 · #22 output-cone scoping (nice-to-have) Supporting (trustworthiness, off critical path): golden-master CI, refiner UW-Comparison fix, deeper _fn coverage, cluster-once eval. HANDOFF.md now leads with this; ROADMAP gets a "Now — Mippy calibration oracle" section; project_mippy_contract memory updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…rule) Removed the specific Outpost A-1 gross/net MOIC & IRR values, the UW-comparison multiple, and the MIP dollar figure from HANDOFF.md + ROADMAP.md. main is a public OSS repo and CLAUDE.md forbids committing real financials. The findings (golden-master match on Version Tracker row 22; refiner mis-maps to UW Comparison; MIP is a hand-port calibration) are kept; only the figures are removed. Canonical values stay in the gitignored artifacts + local project memory and feed the golden-master test from there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Public repo: no real financials or participant names (CLAUDE.md). Before merging the next-wave PR: - Renamed benchmarks/outpost-bench.mjs -> benchmarks/bench.mjs (npm script `bench`); the benchmark now anonymizes model identity in printed + committed output (Model A/B) — real dir names stay only in the gitignored detail JSON. - Scrubbed the real model name from all committed docs (HANDOFF/ROADMAP/PLAN/ CHANGELOG/BASELINE) -> "the real PE models" / "Model A/B". Kept the test-e2e4-fixes scrub-guard that asserts template names are generic. - Removed the real return figures (MOIC/IRR, UW multiple, MIP $) committed earlier; findings stay, numbers live only in gitignored artifacts + local notes. Full npm test (387) + smoke green; BASELINE.md regenerated anonymized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ebootheee and others added 3 commits May 28, 2026 17:16

ebootheee changed the title ~~Outpost accuracy benchmark + eval-tooling fixes + searchByLabel (next-wave pt 1)~~ Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests May 28, 2026

ebootheee and others added 5 commits May 28, 2026 19:57

ebootheee merged commit c4f0c61 into main May 29, 2026
2 checks passed

ebootheee mentioned this pull request May 29, 2026

chore(privacy): remove residual real model name from CHANGELOG #27

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests#21

Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests#21
ebootheee merged 8 commits into
mainfrom
feat/next-wave

ebootheee commented May 28, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ebootheee commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Baseline (the A/B/C test) — npm run bench:outpost

What landed (all tested)

Wave status

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ebootheee commented May 28, 2026 •

edited

Loading

Baseline (the A/B/C test) — `npm run bench:outpost`