Skip to content

fix(parser): eliminate 12h partition-phase hang on large models#30

Merged
ebootheee merged 1 commit into
mainfrom
claude/eloquent-easley-09c529
May 29, 2026
Merged

fix(parser): eliminate 12h partition-phase hang on large models#30
ebootheee merged 1 commit into
mainfrom
claude/eloquent-easley-09c529

Conversation

@ebootheee
Copy link
Copy Markdown
Owner

What hung

A clean ete init on the full real models hung ~12h in the chunked emitter, right after [chunked] Partitioning 20 sheets.... That log line brackets partition_sheets() — which never returned. The next phase's "Emitting N sheet modules" log was never reached, so the stall was entirely inside partitioning (a misdiagnosis trap: "hung after Partitioning…" means the partition call itself, not the later par_iter sheet-module builder).

Root cause — a Round-2 regression

partition_sheets only needs sheet-level edges (which other sheets a formula references), but it called the range-expanding extract_refs. Commit 59810f6 (the cell-level dependency-graph feature) taught is_cell_ref to accept ranges — correct for that graph — but as a side effect extract_refs now expands every range to up to 1000 individual cell strings, pushes them, and de-dupes them. partition then throws away all the same-sheet ones. On the 1.62M-formula PP&E sheet (one rayon task = one core) that's ~10⁹ throwaway allocations → swap thrash → the 12h stall. The same trap sat one step downstream in detect_intra_sheet_cycles (would've been the next wall in the sheet-module phase).

Fix

  • collect_sheet_deps() (new) — scans for Name! / 'Sheet Name'! tokens and records just the sheet name; no range expansion, no per-cell allocation. Tokenization mirrors extract_refs + the same cell-ref gate, so the detected dependency set is provably identical (parity test). partition_sheets now uses it.
  • extract_refs_shallow() (new, top-left endpoint only) — used by detect_intra_sheet_cycles, restoring the pre-Round-2 behavior the known-good engines were built with (and avoiding spurious self-cycles from B10=SUM(B1:B10)).
  • write_dependency_graph keeps the full expander — the dependency-graph.json / closures contract is unchanged. Added per-sheet progress to it (the heaviest remaining sequential step) so a slow-but-progressing pass no longer reads as a hang.

Behavior-preserving for the sheet-dependency set and engine output; ~2000× faster on range-heavy formulas (synthetic parity test: 2.66s → 1.3ms for 2,000 formulas in debug, gap widens under the memory pressure that caused the real hang).

Validation

  • cargo test 17/17 (incl. parity + scale demonstration)
  • npm run smoke 78/78 (engine accuracy, regenerated via the patched binary)
  • test:depgraph 11/11 (full-expansion contract intact), test:runnable 20/20, test:engine 21/21 (cross-sheet cluster convergence intact)
  • full npm test; clean build, zero warnings

Not addressed here (separate, still-open scaling walls → #22)

This unblocks the build. Making the result cheap to run as a calibration oracle is the next piece:

  • engine.js eagerly imports ~800 MB of sheet modules (Node load-time wall)
  • partition_sheets still clone()s every cell (peak-memory doubling)

Both fold into #22 (output-cone scoping / lazy sheet loading); ROADMAP/PLAN/HANDOFF updated to record them.

Note for re-running the regen: rebuild the release parser (cd pipelines/rust && cargo build --release) before ete init — the fix lives in the binary.

🤖 Generated with Claude Code

A clean `ete init` on the full real models hung ~12h in the chunked emitter,
right after `[chunked] Partitioning N sheets...`. The stall was inside
partition_sheets (before the dependency-graph step P1/#23 fixed), so P1's
streaming fix simply exposed it as the next wall.

partition_sheets only needs sheet-level edges, but called the range-expanding
extract_refs. After Round 2 (59810f6) taught is_cell_ref to accept ranges, that
exploded every range to <=1000 cell strings per formula, pushed and de-duped
them, then discarded all the same-sheet ones -- ~1e9 throwaway allocations on the
1.62M-formula PP&E sheet, single-core -> swap thrash.

Fix: new collect_sheet_deps() records just referenced sheet names (no expansion,
no per-cell allocation); detect_intra_sheet_cycles switched to a new
extract_refs_shallow() (top-left only) -- the same blowup would have hit the
sheet-module phase next. write_dependency_graph keeps the full expander, so the
dependency-graph.json contract is unchanged. Added per-sheet progress to that
step so a slow-but-progressing pass no longer reads as a hang.

~2000x faster on range-heavy formulas (synthetic parity test). Validated: cargo
test 17/17, smoke 78/78, test:depgraph/runnable/engine 11/20/21, full npm test.
Residual scaling walls (cell cloning, ~800MB engine.js import) tracked under #22.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ebootheee ebootheee merged commit a944b7f into main May 29, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant