Next-wave: Outpost accuracy benchmark + eval fixes + searchByLabel + lib tests#21
Merged
Conversation
…chByLabel lazy numerics
Keystone of the next-wave effort: a repeatable accuracy + efficacy benchmark
over the real ~200 MB Outpost models, plus fixes to the eval tooling that was
silently broken on them.
Benchmark (benchmarks/outpost-bench.mjs, `npm run bench:outpost`):
- Wraps eval/per-sheet-eval.mjs (live engine-vs-ground-truth) per model;
reports overall accuracy + per-sheet pass/skip + timings. Aggregate-only
results -> committed benchmarks/BASELINE.md; full detail -> gitignored
benchmarks/results/. No cell value/label is ever committed.
- Baseline: outpost-a1 84.3%, outpost-a2 85.5% on standalone sheets (17-sheet
cluster + 190 MB PP&E skipped for now).
per-sheet-eval (wasn't in CI -> bugs went unnoticed):
- Windows crash FIXED: it imported each sheet's compute() by a bare absolute
path ("C:\..."), which Node ESM rejects on Windows -> every sheet crashed at
load (0% accuracy) on Windows + the real engines. Now uses pathToFileURL().
New tests/cli/test-per-sheet-eval.mjs (6) guards it; CI runs on windows-latest.
- --skip-clusters flag: record circular-cluster sheets as skipped (the current
convergence re-runs the whole cluster once per member -> O(cluster²),
infeasible on big models) pending the single-pass orchestrator eval.
searchByLabel (query/carry): probe the matched row's columns on demand instead
of scanning the whole GT per row, with a directed caseColumn probe so a far
scenario column is never missed. Behavior-preserving.
Findings (now in ROADMAP): the cluster is 17/21 sheets + redundantly evaluated;
the 190 MB PP&E sheet exceeds the 150 MB limit; _computed-values.json is a
byte-identical GT copy (not an accuracy source).
gitignore: engines/ (already), benchmarks/results/, _eval_tmp/.
Full `npm test` (incl. new per-sheet-eval guard) + smoke green.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…itivity The shared financial libs had no direct coverage. Add tests/lib/test-lib.mjs (43 assertions), wired into `npm test` (runs first): - irr.mjs: NPV identities; IRR of classic series (-100->+150=50%, -1000 then 200x8 ~= 11.89%, 3y bullet); Newton == bisection; NPV(IRR) ~= 0; null on no sign change; XIRR on dated flows. - waterfall.mjs: American 80/20 + 8% pref + catch-up (LP/GP splits, carry %), no-catch-up, loss case, flat-MOIC-hurdle promote (+ hold-period invariance), European builder; LP+GP = distributed conservation across structures. - calibration.mjs: nested get/set; validateOutputs pass/fail + suggested factor. - sensitivity.mjs: flattenOutputs group/type filtering. Polish->Publish: first item (lib/ unit tests) done; CI guards them now. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…luster test The cluster convergence loop in per-sheet-eval checked for a fixed point by diffing EVERY cell in the context each iteration — and the context is seeded with the full (multi-million-cell) ground truth, so it was O(all cells) × up to 200 iters. Now it tracks the cells compute() writes (ctx._written) and diffs only those (the cluster's own outputs). Behavior-preserving. Adds the first circular-cluster coverage: tests/cli/fixtures/cluster-model/ (a synthetic SheetA<->SheetB model converging to a=50,b=50,c=100,d=100) and a case in test-per-sheet-eval that runs it through the convergence loop, asserting 100%. Measured on the real model: scoped-diff alone is NOT enough — per-sheet-eval re-runs the whole cluster convergence once per member sheet (17x), and engine inaccuracies keep clusters from converging (200 iters). Remaining key fix is single-pass orchestrator eval (converge once, score all members); the fixture is the ready test oracle. Benchmark still uses --skip-clusters until then. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…onsumer The downstream Mippy agent regenerated both Outpost engines from main and reported back. New ROADMAP section "Now — Outpost regeneration findings": Confirmed wins vs the old build — dates fixed (old leaked ExcelDateTime debug strings, 2,686 in A-1; new = serial numbers, 0 leaks), ~42-45% smaller (model-map.json + GT-copy _computed-values.json gone), contract maps emitted, circular refs converge, and a golden-master PASS (regenerated GT reproduces the hand-port's canonical A-1 returns to full float precision, Version Tracker row 22). New follow-ups: generation robustness on big models (dep-graph OOM + init 10-min timeout, issue #23), --output-profile to scope artifacts (#22), 11,813 _fn() unsupported-function fallbacks per engine (transpiler-coverage accuracy suspect, added to Transpiler Coverage), refiner mis-mapping returns to a "UW Comparison" tab (needs canonical-returns-tab recognition), empty named-inputs.json when no formula-referenced defined-names exist, MIP-as-output (#7). Noted a ready golden-master CI assert (diff committed named-outputs.baseCaseValue). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fresh-agent entry point: priority-ordered backlog (P0 cluster-once eval -> generation robustness #23 -> _fn() transpiler coverage -> refiner UW-Comparison mis-map -> golden-master CI -> output-profile/large-sheet/pipeline-perf -> Polish), current state (PR #21 open with the foundation; baseline a1 84.3% / a2 85.5% standalone), run commands, and the gotchas. PLAN points to it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ture set Amendment to make the e2e agent's job explicit: make the full model a reliable calibration oracle — runnable, MIP coefficients exposed as named-outputs, no stubbed value cells. New priority order (issues on ebootheee/excel-to-engine): P1 · #23 + #24 reliably emit a runnable engine.js (fix dep-graph OOM; fail loud, never a partial artifact; lock layout + content hash) P2 · #25 pin value-bearing cells (MIP Proceeds, hurdle, participation%, equity basis, valuation/shares) as named-outputs P2 · #26 emit _fn-fallbacks.json; assert no value cell uses a stub P3 · #22 output-cone scoping (nice-to-have) Supporting (trustworthiness, off critical path): golden-master CI, refiner UW-Comparison fix, deeper _fn coverage, cluster-once eval. HANDOFF.md now leads with this; ROADMAP gets a "Now — Mippy calibration oracle" section; project_mippy_contract memory updated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rule) Removed the specific Outpost A-1 gross/net MOIC & IRR values, the UW-comparison multiple, and the MIP dollar figure from HANDOFF.md + ROADMAP.md. main is a public OSS repo and CLAUDE.md forbids committing real financials. The findings (golden-master match on Version Tracker row 22; refiner mis-maps to UW Comparison; MIP is a hand-port calibration) are kept; only the figures are removed. Canonical values stay in the gitignored artifacts + local project memory and feed the golden-master test from there. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Public repo: no real financials or participant names (CLAUDE.md). Before merging the next-wave PR: - Renamed benchmarks/outpost-bench.mjs -> benchmarks/bench.mjs (npm script `bench`); the benchmark now anonymizes model identity in printed + committed output (Model A/B) — real dir names stay only in the gitignored detail JSON. - Scrubbed the real model name from all committed docs (HANDOFF/ROADMAP/PLAN/ CHANGELOG/BASELINE) -> "the real PE models" / "Model A/B". Kept the test-e2e4-fixes scrub-guard that asserts template names are generic. - Removed the real return figures (MOIC/IRR, UW multiple, MIP $) committed earlier; findings stay, numbers live only in gitignored artifacts + local notes. Full npm test (387) + smoke green; BASELINE.md regenerated anonymized. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The "next wave" effort, keystone first: a repeatable accuracy + efficacy benchmark over the real ~200 MB Outpost models, the eval-tooling fixes needed to make it produce real numbers, plus the searchByLabel and lib-unit-tests waves. Not self-merging — bringing this for your review (per "come back when it's ready for a main merge"). CI green on ubuntu + windows.
Built branch-per-improvement (each an atomic, revertable commit gated on the benchmark + tests):
feat/bench→lib-tests→cluster-eval, all folded here.Baseline (the A/B/C test) —
npm run bench:outpostbenchmarks/outpost-bench.mjswrapseval/per-sheet-eval.mjs(live engine-vs-ground-truth). Aggregate-only → committedbenchmarks/BASELINE.md; full detail → gitignoredbenchmarks/results/. No cell value/label ever committed.What landed (all tested)
compute()by bare absolute path ("C:\..."), which Node ESM rejects on Windows → every sheet crashed at 0% on Windows + the real engines. NowpathToFileURL(). Wasn't in CI; now guarded bytests/cli/test-per-sheet-eval.mjson windows-latest.ctx._written). + first-ever circular-cluster test + fixture (tests/cli/fixtures/cluster-model/).--skip-clustersflag (benchmark uses it; clusters pending cluster-once).caseColumnprobe so a far column is never missed. Behavior-preserving.tests/lib/test-lib.mjs(43 known-answer cases: irr/NPV/XIRR, waterfall American/European/MOIC-hurdle + conservation invariant, calibration, sensitivity). Polish→Publish's first item.engines/(real models),benchmarks/results/,_eval_tmp/.Wave status
Validation
Full
npm test(now 387 JS assertions incl. lib 43, per-sheet-eval 10, refine 14, init-shared-gt 8) +smoke78/78 green. CI ubuntu + windows green on every commit.🤖 Generated with Claude Code