Skip to content

Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests#21

Merged
ebootheee merged 8 commits into
mainfrom
feat/next-wave
May 29, 2026
Merged

Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests#21
ebootheee merged 8 commits into
mainfrom
feat/next-wave

Conversation

@ebootheee
Copy link
Copy Markdown
Owner

@ebootheee ebootheee commented May 28, 2026

Summary

The "next wave" effort, keystone first: a repeatable accuracy + efficacy benchmark over the real ~200 MB Outpost models, the eval-tooling fixes needed to make it produce real numbers, plus the searchByLabel and lib-unit-tests waves. Not self-merging — bringing this for your review (per "come back when it's ready for a main merge"). CI green on ubuntu + windows.

Built branch-per-improvement (each an atomic, revertable commit gated on the benchmark + tests): feat/benchlib-testscluster-eval, all folded here.

Baseline (the A/B/C test) — npm run bench:outpost

benchmarks/outpost-bench.mjs wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth). Aggregate-only → committed benchmarks/BASELINE.md; full detail → gitignored benchmarks/results/. No cell value/label ever committed.

Model Accuracy (standalone sheets) Skipped Eval
outpost-a1 84.3% 17-sheet cluster + 190 MB PP&E 45s
outpost-a2 85.5% 17-sheet cluster + PP&E 48s

What landed (all tested)

  • per-sheet-eval Windows crash FIXED — it imported sheet compute() by bare absolute path ("C:\..."), which Node ESM rejects on Windows → every sheet crashed at 0% on Windows + the real engines. Now pathToFileURL(). Wasn't in CI; now guarded by tests/cli/test-per-sheet-eval.mjs on windows-latest.
  • Scoped cluster-convergence diff — the convergence loop diffed all ~6M seeded cells per iteration; now only the cluster's written cells (ctx._written). + first-ever circular-cluster test + fixture (tests/cli/fixtures/cluster-model/).
  • --skip-clusters flag (benchmark uses it; clusters pending cluster-once).
  • searchByLabel lazy numerics — query/carry probe the matched row's columns instead of scanning the whole GT; directed caseColumn probe so a far column is never missed. Behavior-preserving.
  • lib/ unit teststests/lib/test-lib.mjs (43 known-answer cases: irr/NPV/XIRR, waterfall American/European/MOIC-hurdle + conservation invariant, calibration, sensitivity). Polish→Publish's first item.
  • gitignore: engines/ (real models), benchmarks/results/, _eval_tmp/.

Wave status

  • ✅ benchmark + baseline · ✅ searchByLabel · ✅ lib unit tests
  • 🟡 Accuracy blockers — diagnosed + partially fixed (eval Windows fix, scoped-diff, fixture). The keystone remaining = cluster-once eval: the cluster is 17/21 sheets and per-sheet-eval re-runs the whole convergence once per member (17×) → won't finish on the real model. Fix = one task per cluster (converge once, score all members); the fixture is the test oracle. Then the array-formula Headcount sheet (inside the cluster) and the 190 MB PP&E large-sheet eval become reachable.
  • 🔜 Pipeline perf (generate detectors / maps cell-types on ~6M cells)
  • 🔜 Polish→Publish remainder (npm publish prep, example project, contributing guide)

Validation

Full npm test (now 387 JS assertions incl. lib 43, per-sheet-eval 10, refine 14, init-shared-gt 8) + smoke 78/78 green. CI ubuntu + windows green on every commit.

🤖 Generated with Claude Code

ebootheee and others added 3 commits May 28, 2026 17:16
…chByLabel lazy numerics

Keystone of the next-wave effort: a repeatable accuracy + efficacy benchmark
over the real ~200 MB Outpost models, plus fixes to the eval tooling that was
silently broken on them.

Benchmark (benchmarks/outpost-bench.mjs, `npm run bench:outpost`):
- Wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth) per model;
  reports overall accuracy + per-sheet pass/skip + timings. Aggregate-only
  results -> committed benchmarks/BASELINE.md; full detail -> gitignored
  benchmarks/results/. No cell value/label is ever committed.
- Baseline: outpost-a1 84.3%, outpost-a2 85.5% on standalone sheets (17-sheet
  cluster + 190 MB PP&E skipped for now).

per-sheet-eval (wasn't in CI -> bugs went unnoticed):
- Windows crash FIXED: it imported each sheet's compute() by a bare absolute
  path ("C:\..."), which Node ESM rejects on Windows -> every sheet crashed at
  load (0% accuracy) on Windows + the real engines. Now uses pathToFileURL().
  New tests/cli/test-per-sheet-eval.mjs (6) guards it; CI runs on windows-latest.
- --skip-clusters flag: record circular-cluster sheets as skipped (the current
  convergence re-runs the whole cluster once per member -> O(cluster²),
  infeasible on big models) pending the single-pass orchestrator eval.

searchByLabel (query/carry): probe the matched row's columns on demand instead
of scanning the whole GT per row, with a directed caseColumn probe so a far
scenario column is never missed. Behavior-preserving.

Findings (now in ROADMAP): the cluster is 17/21 sheets + redundantly evaluated;
the 190 MB PP&E sheet exceeds the 150 MB limit; _computed-values.json is a
byte-identical GT copy (not an accuracy source).

gitignore: engines/ (already), benchmarks/results/, _eval_tmp/.

Full `npm test` (incl. new per-sheet-eval guard) + smoke green.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…itivity

The shared financial libs had no direct coverage. Add tests/lib/test-lib.mjs
(43 assertions), wired into `npm test` (runs first):

- irr.mjs: NPV identities; IRR of classic series (-100->+150=50%, -1000 then
  200x8 ~= 11.89%, 3y bullet); Newton == bisection; NPV(IRR) ~= 0; null on no
  sign change; XIRR on dated flows.
- waterfall.mjs: American 80/20 + 8% pref + catch-up (LP/GP splits, carry %),
  no-catch-up, loss case, flat-MOIC-hurdle promote (+ hold-period invariance),
  European builder; LP+GP = distributed conservation across structures.
- calibration.mjs: nested get/set; validateOutputs pass/fail + suggested factor.
- sensitivity.mjs: flattenOutputs group/type filtering.

Polish->Publish: first item (lib/ unit tests) done; CI guards them now.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…luster test

The cluster convergence loop in per-sheet-eval checked for a fixed point by
diffing EVERY cell in the context each iteration — and the context is seeded
with the full (multi-million-cell) ground truth, so it was O(all cells) × up to
200 iters. Now it tracks the cells compute() writes (ctx._written) and diffs
only those (the cluster's own outputs). Behavior-preserving.

Adds the first circular-cluster coverage: tests/cli/fixtures/cluster-model/ (a
synthetic SheetA<->SheetB model converging to a=50,b=50,c=100,d=100) and a case
in test-per-sheet-eval that runs it through the convergence loop, asserting 100%.

Measured on the real model: scoped-diff alone is NOT enough — per-sheet-eval
re-runs the whole cluster convergence once per member sheet (17x), and engine
inaccuracies keep clusters from converging (200 iters). Remaining key fix is
single-pass orchestrator eval (converge once, score all members); the fixture
is the ready test oracle. Benchmark still uses --skip-clusters until then.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ebootheee ebootheee changed the title Outpost accuracy benchmark + eval-tooling fixes + searchByLabel (next-wave pt 1) Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests May 28, 2026
ebootheee and others added 5 commits May 28, 2026 19:57
…onsumer

The downstream Mippy agent regenerated both Outpost engines from main and
reported back. New ROADMAP section "Now — Outpost regeneration findings":

Confirmed wins vs the old build — dates fixed (old leaked ExcelDateTime debug
strings, 2,686 in A-1; new = serial numbers, 0 leaks), ~42-45% smaller
(model-map.json + GT-copy _computed-values.json gone), contract maps emitted,
circular refs converge, and a golden-master PASS (regenerated GT reproduces the
hand-port's canonical A-1 returns to full float precision, Version Tracker row 22).

New follow-ups: generation robustness on big models (dep-graph OOM + init 10-min
timeout, issue #23), --output-profile to scope artifacts (#22), 11,813 _fn()
unsupported-function fallbacks per engine (transpiler-coverage accuracy suspect,
added to Transpiler Coverage), refiner mis-mapping returns to a "UW Comparison"
tab (needs canonical-returns-tab recognition), empty named-inputs.json when no
formula-referenced defined-names exist, MIP-as-output (#7). Noted a ready
golden-master CI assert (diff committed named-outputs.baseCaseValue).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fresh-agent entry point: priority-ordered backlog (P0 cluster-once eval ->
generation robustness #23 -> _fn() transpiler coverage -> refiner UW-Comparison
mis-map -> golden-master CI -> output-profile/large-sheet/pipeline-perf ->
Polish), current state (PR #21 open with the foundation; baseline a1 84.3% /
a2 85.5% standalone), run commands, and the gotchas. PLAN points to it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture set

Amendment to make the e2e agent's job explicit: make the full model a reliable
calibration oracle — runnable, MIP coefficients exposed as named-outputs, no
stubbed value cells. New priority order (issues on ebootheee/excel-to-engine):

  P1 · #23 + #24  reliably emit a runnable engine.js (fix dep-graph OOM; fail
                  loud, never a partial artifact; lock layout + content hash)
  P2 · #25        pin value-bearing cells (MIP Proceeds, hurdle, participation%,
                  equity basis, valuation/shares) as named-outputs
  P2 · #26        emit _fn-fallbacks.json; assert no value cell uses a stub
  P3 · #22        output-cone scoping (nice-to-have)

Supporting (trustworthiness, off critical path): golden-master CI, refiner
UW-Comparison fix, deeper _fn coverage, cluster-once eval.

HANDOFF.md now leads with this; ROADMAP gets a "Now — Mippy calibration oracle"
section; project_mippy_contract memory updated.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rule)

Removed the specific Outpost A-1 gross/net MOIC & IRR values, the UW-comparison
multiple, and the MIP dollar figure from HANDOFF.md + ROADMAP.md. main is a
public OSS repo and CLAUDE.md forbids committing real financials. The findings
(golden-master match on Version Tracker row 22; refiner mis-maps to UW
Comparison; MIP is a hand-port calibration) are kept; only the figures are
removed. Canonical values stay in the gitignored artifacts + local project
memory and feed the golden-master test from there.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Public repo: no real financials or participant names (CLAUDE.md). Before merging
the next-wave PR:

- Renamed benchmarks/outpost-bench.mjs -> benchmarks/bench.mjs (npm script `bench`);
  the benchmark now anonymizes model identity in printed + committed output
  (Model A/B) — real dir names stay only in the gitignored detail JSON.
- Scrubbed the real model name from all committed docs (HANDOFF/ROADMAP/PLAN/
  CHANGELOG/BASELINE) -> "the real PE models" / "Model A/B". Kept the
  test-e2e4-fixes scrub-guard that asserts template names are generic.
- Removed the real return figures (MOIC/IRR, UW multiple, MIP $) committed
  earlier; findings stay, numbers live only in gitignored artifacts + local notes.

Full npm test (387) + smoke green; BASELINE.md regenerated anonymized.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ebootheee ebootheee merged commit c4f0c61 into main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant