fix(parser): eliminate 12h partition-phase hang on large models#30
Merged
Conversation
A clean `ete init` on the full real models hung ~12h in the chunked emitter, right after `[chunked] Partitioning N sheets...`. The stall was inside partition_sheets (before the dependency-graph step P1/#23 fixed), so P1's streaming fix simply exposed it as the next wall. partition_sheets only needs sheet-level edges, but called the range-expanding extract_refs. After Round 2 (59810f6) taught is_cell_ref to accept ranges, that exploded every range to <=1000 cell strings per formula, pushed and de-duped them, then discarded all the same-sheet ones -- ~1e9 throwaway allocations on the 1.62M-formula PP&E sheet, single-core -> swap thrash. Fix: new collect_sheet_deps() records just referenced sheet names (no expansion, no per-cell allocation); detect_intra_sheet_cycles switched to a new extract_refs_shallow() (top-left only) -- the same blowup would have hit the sheet-module phase next. write_dependency_graph keeps the full expander, so the dependency-graph.json contract is unchanged. Added per-sheet progress to that step so a slow-but-progressing pass no longer reads as a hang. ~2000x faster on range-heavy formulas (synthetic parity test). Validated: cargo test 17/17, smoke 78/78, test:depgraph/runnable/engine 11/20/21, full npm test. Residual scaling walls (cell cloning, ~800MB engine.js import) tracked under #22. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What hung
A clean
ete initon the full real models hung ~12h in the chunked emitter, right after[chunked] Partitioning 20 sheets.... That log line bracketspartition_sheets()— which never returned. The next phase's "Emitting N sheet modules" log was never reached, so the stall was entirely inside partitioning (a misdiagnosis trap: "hung after Partitioning…" means the partition call itself, not the later par_iter sheet-module builder).Root cause — a Round-2 regression
partition_sheetsonly needs sheet-level edges (which other sheets a formula references), but it called the range-expandingextract_refs. Commit59810f6(the cell-level dependency-graph feature) taughtis_cell_refto accept ranges — correct for that graph — but as a side effectextract_refsnow expands every range to up to 1000 individual cell strings, pushes them, and de-dupes them. partition then throws away all the same-sheet ones. On the 1.62M-formula PP&E sheet (one rayon task = one core) that's ~10⁹ throwaway allocations → swap thrash → the 12h stall. The same trap sat one step downstream indetect_intra_sheet_cycles(would've been the next wall in the sheet-module phase).Fix
collect_sheet_deps()(new) — scans forName!/'Sheet Name'!tokens and records just the sheet name; no range expansion, no per-cell allocation. Tokenization mirrorsextract_refs+ the same cell-ref gate, so the detected dependency set is provably identical (parity test).partition_sheetsnow uses it.extract_refs_shallow()(new, top-left endpoint only) — used bydetect_intra_sheet_cycles, restoring the pre-Round-2 behavior the known-good engines were built with (and avoiding spurious self-cycles fromB10=SUM(B1:B10)).write_dependency_graphkeeps the full expander — thedependency-graph.json/ closures contract is unchanged. Added per-sheet progress to it (the heaviest remaining sequential step) so a slow-but-progressing pass no longer reads as a hang.Behavior-preserving for the sheet-dependency set and engine output; ~2000× faster on range-heavy formulas (synthetic parity test: 2.66s → 1.3ms for 2,000 formulas in debug, gap widens under the memory pressure that caused the real hang).
Validation
cargo test17/17 (incl. parity + scale demonstration)npm run smoke78/78 (engine accuracy, regenerated via the patched binary)test:depgraph11/11 (full-expansion contract intact),test:runnable20/20,test:engine21/21 (cross-sheet cluster convergence intact)npm test; clean build, zero warningsNot addressed here (separate, still-open scaling walls → #22)
This unblocks the build. Making the result cheap to run as a calibration oracle is the next piece:
engine.jseagerly imports ~800 MB of sheet modules (Node load-time wall)partition_sheetsstillclone()s every cell (peak-memory doubling)Both fold into #22 (output-cone scoping / lazy sheet loading); ROADMAP/PLAN/HANDOFF updated to record them.
🤖 Generated with Claude Code