perf(codegen): chunk the ENTIRE module-init function, not just the string loop (#5391)#5418
Conversation
… just strings) (#5391) The string loop was chunked but the ~250K closure/class/function registration calls still dumped into the single __perry_init_strings block (405K instrs, one basic block, ~32MB), and clang -O0 is catastrophically superlinear on a single huge block (~36 min). Route ALL init ops — string allocation AND every registration loop — through InitChunker, which spills them into small *_chunkN functions called in order from the entry. All ops are independent (each writes its own global / a runtime registry; no SSA crosses), so chunking is safe and order-preserving.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughIntroduces a new private ChangesInitChunker-based chunk splitting in emit_string_pool
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Problem
#5407 split the string-allocation loop of
__perry_init_strings_<prefix>into chunk functions. But that function ALSO emits every closure/class/function registration call (js_register_closure_length,js_register_function_source,js_register_class_id,js_build_class_keys_array, …), and those ~250K calls still dumped into the function's single basic block.On a large bundle that left
__perry_init_stringsas one ~32MB basic block of ~405K instructions (55Kregister_closure_length+ 55Kregister_closure_strict+ 55Kregister_closure_arity+ 43Kregister_function_source+ …).clang -O0— forced for oversized modules (#4880) — is catastrophically superlinear on a single huge basic block: this one function alone took ~36 minutes to compile, dominating the entire build and defeating the codegen-unit memory win (it's one indivisible function, so unit-splitting can't touch it).Empirically isolated (synthetic
clang -c -O0timings): a 10.7MB single-basic-block function = 15s and climbing superlinearly, vs. the same byte size as branchy functions = ~2s. The init function is the pathological single-block case.Fix
Introduce
InitChunkerand route all init operations — string allocation AND every one of the 21 registration loops — through it, spilling them into a sequence of small__perry_init_strings_<prefix>_chunkNfunctions (default 4000 ops/chunk,PERRY_STRING_INIT_CHUNK_SIZE). The entry function just calls the chunks in order.Every init op is independent — each writes its own global or a runtime registry; no SSA value flows between ops — so chunking at op boundaries is safe and order-preserving (chunks run in sequence; ops run in order within a chunk).
Result (measured on a 13MB bundle)
__perry_init_stringsmain function: 32MB → 0.01MB (now 104 chunk calls); 104*_chunkNfunctions of ~1.4MB each.Follow-up to #5407 (same #5391 effort).
Summary by CodeRabbit