ebootheee · ebootheee · May 29, 2026 · May 28, 2026 · May 28, 2026 · May 28, 2026
diff --git a/.gitignore b/.gitignore
@@ -17,6 +17,14 @@ tests/eval-results.json
 # covered by *.xlsx above; this also excludes the parsed engine output dirs.
 engines/
 
+# Benchmark run detail (per-sheet/per-cell results derived from the real models
+# — may contain real values/labels). Only the aggregate-only benchmarks/BASELINE.md
+# is committed; raw per-run detail stays local.
+benchmarks/results/
+
+# per-sheet-eval scratch dir (transient child-process scripts + per-sheet GT)
+_eval_tmp/
+
 # Transient test artifacts (scenario save/load test writes here on every run)
 tests/cli/fixtures/scenarios/
 

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,11 +1,168 @@
 # excel-to-engine — Changelog
 
+## 2026-05-28 — Privacy scrub: genericize the real model name + figures
+
+This repo is public; CLAUDE.md forbids committing real financials or participant
+names. Two cleanups before merging the next-wave PR:
+
+- **Removed the real return figures** (gross/net MOIC & IRR, the UW-comparison
+  multiple, the MIP dollar amount) from the committed docs. The findings stay
+  (golden-master match on Version Tracker row 22; refiner UW-Comparison mis-map;
+  MIP is a hand-port calibration) — only the numbers are gone. Canonical values
+  live in the gitignored artifacts + local notes and feed the golden-master test
+  from there.
+- **Genericized the real model name** out of all committed files: renamed
+  `benchmarks/outpost-bench.mjs` → `benchmarks/bench.mjs` (npm script `bench`),
+  and the benchmark now **anonymizes model identity** in printed + committed
+  output (Model A, Model B, …) — real dir names stay only in the gitignored
+  detail JSON. Prose in HANDOFF/ROADMAP/PLAN/CHANGELOG now says "the real PE
+  models" / "Model A/B". (The `test-e2e4-fixes` scrub-guard that asserts template
+  names are generic is intentionally kept.)
+
+## 2026-05-28 — Mippy calibration-oracle feature set (priority amendment)
+
+Refined the "fully ready for Mippy" target: the e2e agent's job is to make the
+full model a **reliable calibration oracle** — runnable, MIP coefficients exposed
+as named-outputs, no stubbed value cells. Documented the priority order in
+ROADMAP ("Now — Mippy calibration oracle") and HANDOFF.md, and in the
+`project_mippy_contract` memory:
+
+- **P1 · #23 + #24** — reliably emit a runnable `engine.js` (fix dep-graph OOM;
+  fail loud, never a partial artifact; lock layout + content hash).
+- **P2 · #25** — pin value-bearing cells (per-class MIP Proceeds, hurdle,
+  participation %, equity basis, valuation/shares) as named-outputs.
+- **P2 · #26** — emit `_fn-fallbacks.json`; assert no value cell uses an
+  unsupported-function stub.
+- **P3 · #22** — output-cone scoping (nice-to-have).
+
+Supporting/trustworthiness (off critical path): golden-master CI, refiner
+UW-Comparison fix, deeper `_fn` coverage, cluster-once eval.
+
+## 2026-05-28 — HANDOFF.md (fresh-agent entry point)
+
+Added `HANDOFF.md` — the prioritized next-session plan (P0 cluster-once eval →
+generation robustness #23 → `_fn()` transpiler coverage → refiner UW-Comparison
+fix → golden-master CI → output-profile/large-sheet/perf → Polish), with current
+state, run commands, and the gotchas (gitignored real models, the GT-copy
+`_computed-values.json`, the per-sheet-eval Windows fix, the bench
+`discoverModels` gate vs the `-v2` regen). PLAN points to it.
+
+## 2026-05-28 — Roadmap: PE-model regeneration findings (Mippy consumer)
+
+The downstream Mippy agent regenerated both PE engines from `main` and
+reported back. Captured the findings in ROADMAP.md ("Now — PE-model regeneration
+findings"). Confirmed wins vs the old build: **dates fixed** (old leaked
+`ExcelDateTime { … }` debug strings — 2,686 in A-1; new emits serial numbers, 0
+leaks), **~42–45% smaller** (model-map.json + the GT-copy `_computed-values.json`
+gone), contract maps emitted, circular refs converge, and a **golden-master PASS**
+— the regenerated ground truth reproduces the hand-port's canonical A-1 returns
+to full float precision (Version Tracker row 22). New follow-ups: generation
+robustness on big models (dep-graph OOM + `init` 10-min timeout — issue #23),
+`--output-profile` to scope artifacts (#22), the **11,813 `_fn()` unsupported-
+function fallbacks** per engine (transpiler-coverage accuracy suspect), the
+refiner mis-mapping returns to a "UW Comparison" tab, empty `named-inputs.json`
+when no formula-referenced defined-names exist, and MIP-as-output (#7). A
+ready-made golden-master CI assert (diff committed `named-outputs.baseCaseValue`)
+is noted.
+
+## 2026-05-28 — Circular-cluster eval: scoped convergence diff + first cluster test
+
+Progress on the circular-cluster accuracy blocker (the 17-of-21-sheet cluster on
+the real models that wouldn't evaluate).
+
+- **Scoped convergence diff.** The cluster convergence loop in
+  `per-sheet-eval.mjs` checked for a fixed point by diffing **every** cell in the
+  context each iteration — and the context is seeded with the full (multi-million-
+  cell) ground truth, so that was O(all cells) × up to 200 iterations. It now
+  tracks the cells `compute()` actually writes (`ctx._written`) and diffs only
+  those (the cluster's own outputs). Behavior-preserving; large constant-factor
+  win on big clusters.
+- **First circular-cluster test + fixture.** `tests/cli/fixtures/cluster-model/`
+  is a synthetic 2-sheet circular model (SheetA ↔ SheetB, converges to
+  a=50,b=50,c=100,d=100). `tests/cli/test-per-sheet-eval.mjs` now evaluates it
+  through the convergence loop and asserts 100% — the cluster path had no
+  coverage before, and this guards the scoped-diff change.
+
+**Still the key fix (cluster-once):** measured on the real model, scoped-diff
+alone is *not* enough — `per-sheet-eval` re-runs the entire cluster convergence
+**once per member sheet** (17×), and engine inaccuracies keep some clusters from
+converging (200 iters). The remaining work is single-pass orchestrator eval:
+converge the cluster once, then score every member from that converged state
+(one task per cluster, not per sheet). The fixture above is the ready-made test
+oracle. Until then the benchmark runs with `--skip-clusters`.
+
+## 2026-05-28 — Unit tests for lib/ (Polish→Publish)
+
+The shared financial libraries had no direct coverage. Added
+`tests/lib/test-lib.mjs` (43 known-answer assertions), wired into `npm test`
+(runs first) so CI guards them on every push:
+
+- **`lib/irr.mjs`** — NPV/NPV-derivative identities; IRR of classic cash-flow
+  series (−100→+150 = 50%, −1000 then 200×8 ≈ 11.89%, 3-year bullet); Newton ≡
+  bisection agreement; NPV(IRR) ≈ 0; null on no-sign-change; XIRR on dated flows.
+- **`lib/waterfall.mjs`** — American 80/20 + 8% pref + catch-up (LP/GP splits,
+  carry %), no-catch-up variant, loss case (no carry), the flat-MOIC-hurdle
+  promote (incl. the hold-period-independence invariant), European builder; the
+  LP+GP = distributed conservation invariant across structures.
+- **`lib/calibration.mjs`** — nested get/set; `validateOutputs` pass/fail +
+  suggested corrective factor.
+- **`lib/sensitivity.mjs`** — `flattenOutputs` group/type filtering.
+
+## 2026-05-28 — PE-model accuracy benchmark + eval-tooling fixes
+
+Stood up a repeatable accuracy + efficacy benchmark over the real ~200 MB
+PE models so improvements can be tracked over time, and fixed the eval
+tooling that was silently broken on them.
+
+### Benchmark (`benchmarks/bench.mjs`, `npm run bench`)
+
+- Wraps `eval/per-sheet-eval.mjs` (live engine-vs-ground-truth) for every model
+  under a root dir; reports overall accuracy, per-sheet pass/skip counts, and
+  timings. **Aggregate-only** results go to the committed `benchmarks/BASELINE.md`;
+  full per-sheet detail stays in the gitignored `benchmarks/results/`. No cell
+  value or label is ever committed.
+- **Baseline (2026-05-28):** Model A **84.3%**, Model B **85.5%** on the
+  standalone sheets. (The 17-sheet circular cluster and the 190 MB PP&E sheet are
+  skipped for now — see below.)
+
+### per-sheet-eval fixes (it wasn't in CI, so these went unnoticed)
+
+- **Windows crash fixed.** The generated per-sheet wrapper imported each sheet's
+  `compute()` by a bare absolute path (`"C:\\..."`), which Node ESM rejects on
+  Windows — so *every* sheet "crashed" at load (0% accuracy) on Windows and on
+  the real engines. Now uses `pathToFileURL()`. New `tests/cli/test-per-sheet-eval.mjs`
+  (6) guards it; CI runs it on **windows-latest** too.
+- **`--skip-clusters`** flag: record circular-cluster sheets as skipped instead
+  of evaluating them. The current convergence path re-runs the *whole* cluster
+  once per member sheet (O(cluster²)), which is infeasible on big models; this
+  yields a fast, real number for the standalone sheets while the single-pass
+  orchestrator eval is built (ROADMAP).
+
+### searchByLabel: lazy numerics (query / carry)
+
+`searchByLabel` previously scanned the entire ground truth once per matched row
+to collect adjacent numerics. It now probes the row's columns on demand (same
+approach as the refiner), with a directed `caseColumn` lookup probing its exact
+cell so a far scenario column is never missed. Behavior-preserving (query/carry/
+ai-interface suites green).
+
+### Findings that scope the accuracy-blocker work
+
+- The 190 MB PP&E sheet exceeds the 150 MB per-sheet limit → **large-sheet eval**
+  blocker confirmed.
+- The circular cluster is **17 of 21 sheets** and is evaluated redundantly
+  (once per member) → the concrete reason behind "circular-cluster won't
+  evaluate." Single-pass orchestrator eval is the fix.
+- `_computed-values.json` in these engines is **byte-identical to ground truth**
+  (a seeded copy), so it is not a valid accuracy source — accuracy must come from
+  live recompute.
+
 ## 2026-05-28 — `init` parses the ground truth once (shared across the pipeline)
 
 The real driver behind the "~2.5 min" refine loop wasn't one command — it was
 that `ete init` runs **generate → refine → doctor → maps** in sequence and
 **each independently re-read and re-parsed the full ground truth** from disk. On
-the real ~200 MB Outpost models that's four parses of a 200 MB+ file at ~3.6 s
+the real ~200 MB PE models that's four parses of a 200 MB+ file at ~3.6 s
 each, plus each command's own O(N) scan.
 
 ### What changed
@@ -23,7 +180,7 @@ each, plus each command's own O(N) scan.
 
 ### Why not the row-values artifact (Tier B)
 
-Measured on both real ~200 MB Outpost models: they're **dense-label** (≈90% of
+Measured on both real ~200 MB PE models: they're **dense-label** (≈90% of
 rows labeled, ≈93% of numerics on labeled rows), not the giant-grid case Tier B's
 big win assumed. A general row-values artifact would be ≈30% of GT (≈60% of the
 post-#17 compact GT) — only ~1.6× on refine while inflating output ~60%, fighting

diff --git a/HANDOFF.md b/HANDOFF.md
@@ -0,0 +1,131 @@
+# HANDOFF — excel-to-engine next session
+
+Start-here doc for a fresh agent. Read this, then `ROADMAP.md` (full backlog),
+`PLAN.md` (status), `benchmarks/BASELINE.md` (accuracy numbers), and your two
+project memory files (the Mippy contract + the real-model shape/baseline notes,
+auto-loaded from your memory index).
+
+_Last updated: 2026-05-28._
+
+## The job, in one line
+
+**Make the full PE model a reliable Mippy calibration oracle: runnable,
+with the MIP coefficients exposed as named-outputs, and no stubbed value cells.**
+Everything Mippy-specific stays in Mippy — this repo just produces a trustworthy,
+sample-able engine + contract.
+
+## Where things stand
+
+**Merged to `main` this session:** artifact slimming (#17), GitHub Actions CI
+(#18, ubuntu+windows), `refine` consumes `_labels.json` + lazy numerics (#19),
+single-GT-parse per `init` (#20).
+
+**Open PR — review/merge first:** **#21 `feat/next-wave`** (CI green). Contains
+the the PE model accuracy **benchmark + baseline**, a **per-sheet-eval Windows crash
+fix**, `searchByLabel` lazy numerics, **lib/ unit tests** (43), the
+**scoped cluster-convergence diff** + the first circular-cluster fixture/test,
+and the Mippy regeneration findings in ROADMAP. **If #21 isn't merged yet, branch
+off `feat/next-wave`; otherwise off `main`.**
+
+**Baseline (real models, `npm run bench`):** Model A **84.3%**,
+Model B **85.5%** — standalone sheets only (cluster + 190 MB PP&E skipped).
+
+## How to run
+
+```bash
+npm test                 # full JS suite (387 assertions)
+npm run smoke            # chunked-engine accuracy 78/78
+npm run bench --  --root "<abs path>/engines"   # accuracy + efficacy on the real models
+node eval/per-sheet-eval.mjs <chunkedDir> --concurrency 3 [--skip-clusters]
+cd pipelines/rust && cargo build --release   # the parser
+```
+
+The real PE models live in the **gitignored** `engines/` dir (proprietary —
+never commit values/labels). The Mippy agent's fresh regen is in
+`the regenerated `-v2` engine dirs` (the *better* build: dates fixed, slimmed) alongside
+the old `the `engines/` model dirs`.
+
+## P1–P3 — Mippy calibration-oracle feature set (do in this order)
+
+All filed on ebootheee/excel-to-engine. Done-criteria are the contract.
+
+### P1 · #23 + #24 — reliably emit a runnable `engine.js` ★ blocks everything
+A clean `ete init` on a real model currently **does not finish**: the Rust parser
+is OOM-killed at the cell-level dependency-graph step, and `ete init` hits its
+10-min `spawnSync` cap → `engine.js` (the `run()` orchestrator) + the
+`dependency-graph.json` closures **don't land** (written after the OOM step).
+- **Done =** `chunked/engine.js` with `export function run()` exists on **every**
+  build; the build **errors hard** if it can't — **never a partial artifact**.
+- #24 also: **lock the artifact layout + emit a content hash** so downstream
+  consumes without per-version reconciliation.
+- Without a runnable engine we can't sample MIP to calibrate/validate — this
+  gates everything below.
+- Files: `pipelines/rust/` (dep-graph build: stream/incrementalize or raise
+  headroom; fail-loud), `cli/commands/init.mjs` (configurable timeout; don't
+  swallow a failed emit).
+
+### P2 · #25 — pin the value-bearing cells as named-outputs
+Per-class **MIP Proceeds**, **hurdle/threshold**, **participation %**, **equity
+basis**, **valuation / shares** — not just MOIC/IRR.
+- **Done =** those appear in `named-outputs.json` with base-case values. **These
+  ARE the parametric coefficients Mippy calibrates against.**
+- Files: `lib/manifest-maps.mjs` (`enumerateOutputCells` — extend beyond
+  MOIC/IRR/TV/carry; `customCells` is the current escape hatch),
+  `cli/commands/manifest*.mjs`. Pin per-model (the auto-manifest mis-maps —
+  see the refiner fix under "supporting").
+
+### P2 · #26 — `_fn` fallback audit: emit `_fn-fallbacks.json` (correctness gate)
+- **Done =** we can **assert no MIP / value / return cell resolves through an
+  unsupported-function stub.** (Auditing/gating the value cells — distinct from
+  fixing all 11,813 fallbacks, which is the deeper transpiler work below.)
+- Files: `pipelines/rust/` (emit the audit during transpile) + a check that the
+  P2/#25 named-output cells aren't in it.
+
+### P3 (nice-to-have) · #22 — output-cone scoping
+Scope generated artifacts to the consumer's need (skip the ~752 MB per-sheet
+emit). Makes the oracle cheaper to run; **not required** — we don't ship the blob.
+
+## Supporting work — makes the oracle *trustworthy* (after P1, alongside P2/P3)
+
+These aren't on Mippy's critical path but back the "reliable" in "reliable
+calibration oracle":
+- **Golden-master CI assert** — A-1's regenerated GT matches the hand-port's
+  canonical gross/net MOIC & IRR (Version Tracker row 22) to full float
+  precision. Add a CI test diffing those `named-outputs.baseCaseValue`s. The
+  canonical figures live in the gitignored `named-outputs.json` + project memory
+  — **do NOT commit the figures to this public repo.** Pairs with #25/#26.
+- **Refiner mis-maps returns to a "UW Comparison" tab** instead of the canonical
+  Version Tracker returns — `SUMMARY_SHEET_PATTERN` over-ranks it. Fix so #25's
+  value cells pin to canonical/Version-Tracker tabs without manual per-model
+  pinning. Add a manifest invariant. File: `cli/commands/manifest-refine.mjs`.
+- **Deeper transpiler coverage** — the 11,813 `_fn()` offenders behind #26's
+  audit; inventory by frequency, implement top ones. `pipelines/rust/src/`.
+- **Cluster-once eval** (our accuracy harness, not Mippy's path): the 17-sheet
+  cluster is unmeasured because `per-sheet-eval` re-runs the whole convergence
+  once per member (17×). Make it one task per cluster (converge once, score all),
+  then drop `--skip-clusters` and re-baseline. Lets us *verify* the oracle's
+  cluster math. Fixture oracle ready: `tests/cli/fixtures/cluster-model/`. (The
+  shipped `engine.js` `run()` converges clusters itself — this is measurement.)
+- **Large-sheet eval** (190 MB PP&E > 150 MB limit) and **manifest-pipeline
+  perf** (generate detectors / maps cell-types / refine fallback on ~6M cells).
+
+## Polish → Publish
+lib/ unit tests done. Remaining: npm publish prep (`bin`, `files`, metadata),
+synthetic example project, contributing guide. Lower: empty `named-inputs.json`
+fallback (no formula-referenced defined-names in the PE workbooks);
+MIP-as-output beyond the pinned cells is a model-owner question.
+
+## Gotchas (will bite you)
+
+- **`engines/` is gitignored** (real financials). Read-only; aggregate metrics
+  only. `_eval_tmp/` + `benchmarks/results/` are gitignored too.
+- **`_computed-values.json` in these engines is a byte-identical COPY of ground
+  truth** (seeded). NOT a valid accuracy source — use live recompute.
+- **per-sheet-eval was Windows-broken** (bare absolute ESM import → `pathToFileURL`
+  fix; guarded by `tests/cli/test-per-sheet-eval.mjs` on windows CI). Don't
+  reintroduce bare absolute `import` paths.
+- **`benchmarks/bench.mjs` `discoverModels()` gates on `engine.js`** — but
+  the `-v2` regen dirs may LACK it (the #23 OOM) while having `_graph.json` +
+  `sheets/`. If the bench skips `-v2`, relax the gate. (Fixing #23 makes this moot.)
+- **CI runs ubuntu + windows** — child-process/path/parser code must work on both.
+- After any change, update CHANGELOG/PLAN/ROADMAP per CLAUDE.md.