codegen: split large modules into N codegen units (bound clang peak memory) (#5391) by proggeramlug · Pull Request #5407 · PerryTS/perry

proggeramlug · 2026-06-18T12:00:28Z

What

Split large modules into N independently-compiled codegen units so clang -c's peak memory is bounded to ~whole/N instead of OOMing on one giant translation unit. Closes the structural half of the single-TU wall in #5391 (follow-on to the IR-efficiency roadmap #5334).

The 13MB @anthropic-ai/claude-code cli.js lowers to one ~1.1GB / 28.7M-line .ll; clang -c needs ~15GB RSS to compile it — the hard wall that stops the bundle producing a binary. The IR-efficiency levers (#5334 A–D) trimmed ~12%, but the floor is structural: one TU.

How

LlModule::render_codegen_units(n) renders the populated module into N .ll texts:

functions bin-packed by estimated byte size (greedy largest-first) so no unit is outsized;
the full string-constant + global set carried in every unit with local/bare-external definitions promoted to linkonce_odr (linker keeps one copy; globals are ~1.4% of a large module's IR, so duplication is cheap);
each unit gets external declares for every function it calls but doesn't define (deduped by name; existing declarations win), so cross-unit calls resolve; the lone internal init/wrapper is promoted to external;
shared attribute groups + metadata replicated so #N/!N references resolve.

linker::compile_units_to_object compiles each unit with clang -c (sequential → peak RSS ≈ one unit) and merges the objects with ld -r into one — so compile_module(...) -> Result<Vec<u8>> and the existing one-object link path are unchanged.

Also fixes a latent duplicate-symbol bug (two minified classes both named j → colliding mangled method symbol, defined twice → LLVM redefinition error). to_ir and the unit renderer now emit each symbol once. This bit the single-TU path too; it only surfaced once a real bundle reached clang.

Gating

PERRY_CODEGEN_UNITS=N forces N (1 disables); otherwise auto by callable count — 1 unit below a floor (default, zero behavior change), then ceil(callables / PERRY_CODEGEN_UNIT_SIZE) capped at 48.

Validation

Correctness: a multi-function program with cross-unit calls + shared string/class globals compiles, links, and runs identically at 1, 2, 3, and 4 units (hi world / 20). Unit tests cover the partition (define-once, cross-unit declares, linkonce_odr promotion), size-balancing (a giant function is isolated), dedup (both render paths), and n=1 == to_ir.
The memory wall is gone: the cli.js bundle compiled 23/24 units at ~3GB peak clang RSS (vs ~15GB on the single TU) before I stopped it — proving the structural fix.

Known remaining wall (out of scope here)

The bundle has a single 68MB minified IIFE. clang -O0 on one 68MB function is pathologically slow (time, not memory) — the intra-function #4880 problem, which inter-function splitting cannot divide. Size-balancing isolates it (so the other 23 units stay fast + bounded), but producing a full bundle binary needs function-level splitting, tracked separately.

Follow-ups

Declare only referenced symbols per unit (vs all); bounded-parallel unit compiles for wall-clock speedup. Refs #5391.

Summary by CodeRabbit

New Features
- Added configurable LLVM codegen-unit splitting for large modules (with environment overrides), including per-unit compilation merged into a single object and earlier native-representation artifact emission.
- Improved cross-unit correctness with de-duplicated function symbols and safe promotion of shared globals/strings.
- Reduced monolithic string initialization via chunked setup (configurable chunk size).
- Added IR-size estimation for functions and exposed module function count.
Tests
- Expanded coverage for multi-unit packing, cross-unit references, symbol-collision handling, and chunked string initialization behavior.

coderabbitai · 2026-06-18T12:00:47Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 4238dc44-3bd2-442f-a162-61afa40dd0e8

📥 Commits

Reviewing files that changed from the base of the PR and between 3e4f12e and 05fd8ef.

📒 Files selected for processing (6)

crates/perry-codegen/src/codegen/helpers.rs
crates/perry-codegen/src/codegen/mod.rs
crates/perry-codegen/src/codegen/string_pool.rs
crates/perry-codegen/src/function.rs
crates/perry-codegen/src/linker.rs
crates/perry-codegen/src/module.rs

🚧 Files skipped from review as they are similar to previous changes (5)

crates/perry-codegen/src/codegen/helpers.rs
crates/perry-codegen/src/codegen/string_pool.rs
crates/perry-codegen/src/function.rs
crates/perry-codegen/src/module.rs
crates/perry-codegen/src/codegen/mod.rs

📝 Walkthrough

Walkthrough

Implements LLVM codegen unit splitting for large Perry modules. A new heuristic (decide_codegen_units) selects unit count based on callable count and env-var overrides. Module IR emission is refactored to partition functions via greedy bin-packing by estimated IR size, with link-safe linkage promotion, cross-unit declare synthesis, and symbol deduplication. String initialization is split into multiple chunk functions to reduce per-unit IR. A new linker function (compile_units_to_object) compiles each unit and merges them via ld -r. The compile_module pipeline is wired to use the multi-unit path when applicable.

Changes

LLVM Codegen Unit Splitting

Layer / File(s)	Summary
IR size estimation and unit-count heuristic `crates/perry-codegen/src/function.rs`, `crates/perry-codegen/src/codegen/helpers.rs`	`LlFunction::estimated_ir_bytes` approximates function IR size by summing per-instruction and per-block overhead; `decide_codegen_units` selects unit count from callable count with env-var overrides (`PERRY_CODEGEN_UNITS`, `PERRY_CODEGEN_UNIT_SIZE`) clamped to `[1, 48]`.
Module IR refactoring for multi-unit safety `crates/perry-codegen/src/module.rs`	Adds private helpers for linkage stripping, global/string promotion, external-declare synthesis, and forced external linkage. Introduces `deduped_function_refs()` to prevent symbol collisions; updates `to_ir()` to use deduplicated function lists for consistent declare filtering and definition emission across full-module and split paths. Factors setjmp attributes and metadata into `push_attrs_and_metadata` for reuse. Exposes `LlModule::function_count()` for unit balancing.
Module unit partitioning with bin-packing `crates/perry-codegen/src/module.rs`	Refactors `render_codegen_units(n)` to partition functions via greedy bin-packing by estimated IR bytes, promotes shared globals and strings to `linkonce_odr` for safe duplication, synthesizes external declares for cross-unit references with dedup, forces external linkage on internal/private function definitions. Tests cover 2-unit partitioning with cross-unit declares, symbol collision handling, greedy bin-packing isolation of large functions, and single-unit equivalence to `to_ir()`.
Multi-unit object compilation and partial link `crates/perry-codegen/src/linker.rs`	`compile_units_to_object` compiles each unit's IR to a temp `.o` via `compile_ll_to_object`, merges via `ld -r` (configurable via `PERRY_LD`), returns merged bytes on success, reports link failure with stderr, and conditionally cleans up intermediates based on `PERRY_LLVM_KEEP_IR`.
String pool initialization chunking `crates/perry-codegen/src/codegen/string_pool.rs`	`emit_string_pool` partitions string initialization into multiple `__perry_init_strings_<prefix>_chunk<N>` functions (chunk size configurable via `PERRY_STRING_INIT_CHUNK_SIZE`, default 4000) and updates the main init function to sequentially call each chunk, reducing per-unit monolithic init function IR size.
compile_module pipeline integration `crates/perry-codegen/src/codegen/mod.rs`	Moves native-representation artifact write before the early-return branch; computes `n_units` (forced to 1 for `emit_ir_only`, otherwise via `decide_codegen_units`); for `n_units > 1` renders units and returns early via `compile_units_to_object`, leaving the single-unit path as fallback.

Sequence Diagram(s)

sequenceDiagram
  participant compile_module
  participant decide_codegen_units
  participant LlModule
  participant compile_units_to_object
  participant compile_ll_to_object
  participant ld_r as ld -r

  compile_module->>decide_codegen_units: module_callable_count(hir)
  decide_codegen_units-->>compile_module: n_units

  alt n_units > 1
    compile_module->>LlModule: render_codegen_units(n_units)
    LlModule-->>compile_module: Vec<String> unit IR texts
    compile_module->>compile_units_to_object: units, target_triple
    loop each unit
      compile_units_to_object->>compile_ll_to_object: unit IR → .o bytes
      compile_units_to_object->>compile_units_to_object: write /tmp/unit_N.o
    end
    compile_units_to_object->>ld_r: -r unit_0.o unit_1.o …
    ld_r-->>compile_units_to_object: merged object bytes
    compile_units_to_object-->>compile_module: Result<Vec<u8>>
  else n_units == 1
    compile_module->>LlModule: to_ir()
    LlModule-->>compile_module: IR string
    compile_module->>compile_ll_to_object: IR string, target_triple
    compile_ll_to_object-->>compile_module: object bytes
  end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related issues

codegen: split large modules into parallel codegen units (bound clang peak memory; remove the single-TU wall) #5391: This PR directly implements the codegen-unit splitting feature planned in that issue, including the decide_codegen_units heuristic, compile_units_to_object linker support, render_codegen_units refactoring, and cross-unit declaration/linkage promotion logic.
codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334: This PR implements codegen-unit splitting as a compilation strategy to handle large modules by breaking them into smaller chunks, directly addressing the massive IR bloat and clang memory consumption described in that issue.

Possibly related PRs

PerryTS/perry#5308: Both PRs change how per-function LLVM symbol names are generated and kept unique to prevent symbol-collision issues during codegen (main PR via cross-unit deduped declare/define emission; retrieved PR via injective scoped_fn_name/__dupN disambiguation).

Poem

🐇 Hop, hop, the IR grows too wide,
So I split it up and let the units glide.
Each bucket bins its bytes with greedy care,
ld -r stitches objects through the air.
No more giant monolith in memory's lair —
Just nimble units, linked without despair! 🌿

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately and specifically describes the main change: splitting large modules into N codegen units to bound clang peak memory, directly matching the PR's core objective.
Description check	✅ Passed	The PR description covers all required template sections: Summary (Why), Changes (What), Related issue (`#5391`), Test plan with validation checkboxes, and Checklist. All sections are substantively completed.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/codegen-units

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (1)

crates/perry-codegen/src/function.rs (1)

368-380: ⚡ Quick win

Include entry_post_init_setup in the IR-size estimate.

Line 378 counts entry_allocas, but to_ir() also emits entry_post_init_setup into block 0. Excluding it can underweight init-heavy functions and skew unit balancing.

Suggested patch

     pub fn estimated_ir_bytes(&self) -> usize {
         let body: usize = self
             .blocks
             .iter()
             .map(|b| {
                 b.instructions_iter().map(|i| i.len() + 1).sum::<usize>()
                     + b.label.len()
                     + 4
             })
             .sum();
         let allocas: usize = self.entry_allocas.iter().map(|a| a.len() + 1).sum();
-        body + allocas + self.name.len() + 64
+        let post_init_setup: usize = self.entry_post_init_setup.iter().map(|l| l.len() + 1).sum();
+        body + allocas + post_init_setup + self.name.len() + 64
     }

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@crates/perry-codegen/src/function.rs` around lines 368 - 380, The
estimated_ir_bytes() method calculates IR size for entry_allocas but does not
account for entry_post_init_setup, which is also emitted into block 0 by the
to_ir() function. This causes the size estimate to be inaccurate for init-heavy
functions. Add a calculation for entry_post_init_setup similar to the
entry_allocas calculation (iterating through each element, summing up their
lengths plus 1 for each item), and include this sum in the final body + allocas
+ name.len() + 64 calculation to provide a more complete IR-size estimate.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/perry-codegen/src/codegen/mod.rs`:
- Around line 2521-2531: The n_units decision logic in the codegen units block
only checks emit_ir_only but neglects to check PERRY_SAVE_LL, which according to
the comment should also use the single-text path. Update the condition that sets
n_units to 1 to also check whether PERRY_SAVE_LL is set, so that when the save
operation is enabled, it takes the whole-module text path and doesn't return
early before reaching the save block.

In `@crates/perry-codegen/src/linker.rs`:
- Around line 303-309: The temporary object files (perry_cgu_*.o) created in the
loop starting at the compile_ll_to_object call are not cleaned up when early
failures occur through the ? operator, leaving orphaned files in /tmp. Implement
a cleanup mechanism that ensures all object files in obj_paths are deleted
regardless of whether compilation succeeds or fails. This could be done by using
a scope guard or error handler that guarantees cleanup runs even when ?
operators cause early returns before reaching the intended cleanup block at line
335-340. Make sure the cleanup logic removes all accumulated object files from
obj_paths whenever the function exits due to error.
- Around line 312-318: The linker selection in the partial linking step (where
Command::new(&ld) is created) currently ignores target_triple and always uses
the host linker via PERRY_LD environment variable or hardcoded "ld" string. This
fails for cross-compilation targets like MSVC/COFF or non-GNU targets. Replace
the hardcoded linker selection logic by introducing target-aware selection logic
that examines target_triple (the same parameter used for per-unit compilation)
and selects the appropriate linker tool (e.g., lld-link for COFF targets,
link.exe for MSVC, or target-specific ld variants). Include a clear error or
diagnostic when the target is not supported or no compatible partial linker is
available.

In `@crates/perry-codegen/src/module.rs`:
- Around line 549-555: The test in the render_codegen_units_dedup function
currently does not validate the multi-unit deduplication path because after
deduplication, all definitions are collapsed into a single unit rather than
being distributed across multiple units. Modify the test setup to create a
scenario where the duplicate symbol perry_method_j__foo appears across multiple
codegen units (not consolidated into one), so that the assertion at the end
actually exercises and validates the multi-unit deduplication behavior instead
of only testing the single-unit case.

---

Nitpick comments:
In `@crates/perry-codegen/src/function.rs`:
- Around line 368-380: The estimated_ir_bytes() method calculates IR size for
entry_allocas but does not account for entry_post_init_setup, which is also
emitted into block 0 by the to_ir() function. This causes the size estimate to
be inaccurate for init-heavy functions. Add a calculation for
entry_post_init_setup similar to the entry_allocas calculation (iterating
through each element, summing up their lengths plus 1 for each item), and
include this sum in the final body + allocas + name.len() + 64 calculation to
provide a more complete IR-size estimate.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: df5fb6c3-227d-4f6f-aea7-424c8f3b5224

📥 Commits

Reviewing files that changed from the base of the PR and between 54ebb6c and a8b8f57.

📒 Files selected for processing (5)

crates/perry-codegen/src/codegen/helpers.rs
crates/perry-codegen/src/codegen/mod.rs
crates/perry-codegen/src/function.rs
crates/perry-codegen/src/linker.rs
crates/perry-codegen/src/module.rs

coderabbitai · 2026-06-18T12:08:11Z

+    let combined = tmp_dir.join(format!("perry_cgu_{}_{}_combined.o", pid, nonce));
+    let ld = env::var("PERRY_LD").unwrap_or_else(|_| "ld".to_string());
+    let mut cmd = Command::new(&ld);
+    cmd.arg("-r").arg("-o").arg(&combined);
+    for p in &obj_paths {
+        cmd.arg(p);
+    }


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Description: Inspect whether the split partial-link path propagates target/toolchain selection. # Expected: current output shows the per-unit clang path receives `target_triple`, # while the partial-link command is always `PERRY_LD`/`ld -r`. rg -n -C3 'compile_units_to_object|PERRY_LD|arg\("-r"\)|target_triple' --type=rust

Repository: PerryTS/perry

Length of output: 50369

🏁 Script executed:

sed -n '200,400p' crates/perry-codegen/src/linker.rs | cat -n

Repository: PerryTS/perry

Length of output: 9679

🏁 Script executed:

rg 'lld|link\.exe|arm.*ld|aarch64.*ld' crates/perry-codegen/src/linker.rs

Repository: PerryTS/perry

Length of output: 320

Route the partial linker through target-aware selection logic.

The per-unit compilation correctly passes target_triple to clang -c (line 104), but the subsequent merge at lines 313–318 always invokes host ld -r via PERRY_LD or hardcoded "ld". When compiling for MSVC/COFF or cross-targets (e.g., host=Linux, target=Windows), the resulting COFF objects cannot be merged with GNU ld, causing link failure or producing incompatible output.

Select the linker based on target_triple to use the correct tool (lld-link, link.exe, or target-specific ld variants), or disable splitting with a clear diagnostic when no compatible partial linker is available.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/perry-codegen/src/linker.rs` around lines 312 - 318, The linker selection in the partial linking step (where Command::new(&ld) is created) currently ignores target_triple and always uses the host linker via PERRY_LD environment variable or hardcoded "ld" string. This fails for cross-compilation targets like MSVC/COFF or non-GNU targets. Replace the hardcoded linker selection logic by introducing target-aware selection logic that examines target_triple (the same parameter used for per-unit compilation) and selects the appropriate linker tool (e.g., lld-link for COFF targets, link.exe for MSVC, or target-specific ld variants). Include a clear error or diagnostic when the target is not supported or no compatible partial linker is available.

A large bundle interns ~190K strings, and emitting every string's init (from-bytes + nanbox + global store + GC-root register) into one `__perry_init_strings` block produced a single ~68MB function. clang -O0 (forced for oversized modules, #4880) can't compile a function that large in practical time — its per-function passes are superlinear in size — so this was the residual wall after codegen-unit splitting removed the module MEMORY wall: a single giant FUNCTION that inter-function splitting can't divide. The string inits are independent (each handle is stored to its own global; no SSA value flows between iterations), so emit them into chunk functions of PERRY_STRING_INIT_CHUNK_SIZE entries (default 4000) and call those in sequence from `__perry_init_strings`. The ~48 chunks are each ~1.4MB and, combined with codegen-unit splitting (#5407), bin-pack evenly across units instead of one unit carrying the 68MB monolith. Order is preserved (chunks called in sequence, ops in order within a chunk), so GC-root registration and any ordering stay identical. Behavior-neutral for ordinary modules (a handful of strings → one chunk). Full codegen suite green.

proggeramlug · 2026-06-18T13:56:16Z

Update: + string-pool init splitting (function-level)

Pushed a 4th commit that splits the generated string-pool initializer. On the bundle that function was a single 68MB monolith (190K independent string inits in one block) — clang -O0 (forced for oversized modules, #4880) is superlinear in single-function size, so even after codegen-unit splitting removed the module memory wall, that one function was an unbounded time wall. The inits are independent (each handle stored to its own global; no SSA flows between iterations), so they now emit as ~48 chunk functions (PERRY_STRING_INIT_CHUNK_SIZE, default 4000) called in sequence — bin-packing evenly across units. Validated: correct runtime output with multiple chunks + CGU split; behavior-neutral for ordinary modules.

Empirical status on the cli.js bundle

Module memory wall: solved. All units compile at ~2GB peak clang RSS (vs ~15GB single-TU).
Generated-init time wall: solved. The 68MB string-init is gone (→ ~48 × 1.4MB chunks).
Remaining (separate, deeper): the bundle still contains several large user functions — an 18MB minified closure, an 11.5MB main, a 7.6MB closure. clang -O0 time is superlinear per function, so each takes 10–40 min. Splitting arbitrary user functions needs true liveness-based outlining (the string-init shortcut relies on the ops being SSA-independent), or shrinking them via more aggressive IR reduction (extending the codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334 full-outline levers to GET/method/general-property). Tracked as follow-up to codegen: split large modules into parallel codegen units (bound clang peak memory; remove the single-TU wall) #5391.

This PR is a correct, self-contained structural win on its own: it removes the module memory wall and the worst single (generated) function, both proven.

… modules (#5391 path 2) (#5410) Extends the #5334 lever-B full-outline from field-SET to field-GET. After codegen-unit splitting (#5407) removed the module memory wall and the string-init split removed the 68MB generated function, the residual wall is the bundle's large minified USER functions (7-18MB closures): clang -O0 time is superlinear in single-function size. Full-outlining the IC diamonds shrinks those functions at their source. When full-outline is enabled (size-gated), the class-field-GET diamond (inline precheck + guard call + fast slot load + by-name fallback + phi) collapses to one `js_class_field_get_ic(...)` call returning the field value. The runtime helper runs the same guard, reads the field slot as f64 on a pass (a plain number is self-boxing in nan-boxing, so raw-f64 and boxed slots read identically — matching the inline `load double`), and records + reads by name on a miss. The outline drops the inline path's static raw-number type hint (result treated as a general JS value), which is value-correct. class-field GET is the largest remaining diamond kind on the bundle (~39K sites). Gated by the existing PERRY_FULL_OUTLINE_IC; default off = unchanged. Unit test covers both gate states; full codegen suite green. Co-authored-by: Ralph Küpper <ralph2@skelpo.com>

…ERRY_SAVE_LL; clean temp objects on failure (#5407)

proggeramlug · 2026-06-18T18:46:47Z

Thanks — addressed the review:

Major (cross-target ld -r): codegen-unit splitting now gates off for any explicit --target (and for emit_ir_only/PERRY_SAVE_LL), falling back to a single TU, since the host ld -r only merges host-format objects. Host builds (the large-bundle case) still split.
PERRY_SAVE_LL skip: same gate now disables splitting when it's set, so the whole-module dump works.
Temp leak on failure: the unit objects are tracked and cleaned on every exit path (extracted an inner worker so cleanup runs even on a mid-sequence ?).
Weak test note: the dup-symbol test deliberately validates the dedup (one symbol → one unit); the genuine cross-unit define/declare split is covered by render_codegen_units_partitions_and_links (two distinct functions → two units).

A large bundle interns ~190K strings, and emitting every string's init (from-bytes + nanbox + global store + GC-root register) into one `__perry_init_strings` block produced a single ~68MB function. clang -O0 (forced for oversized modules, #4880) can't compile a function that large in practical time — its per-function passes are superlinear in size — so this was the residual wall after codegen-unit splitting removed the module MEMORY wall: a single giant FUNCTION that inter-function splitting can't divide. The string inits are independent (each handle is stored to its own global; no SSA value flows between iterations), so emit them into chunk functions of PERRY_STRING_INIT_CHUNK_SIZE entries (default 4000) and call those in sequence from `__perry_init_strings`. The ~48 chunks are each ~1.4MB and, combined with codegen-unit splitting (#5407), bin-pack evenly across units instead of one unit carrying the 68MB monolith. Order is preserved (chunks called in sequence, ops in order within a chunk), so GC-root registration and any ordering stay identical. Behavior-neutral for ordinary modules (a handful of strings → one chunk). Full codegen suite green.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@crates/perry-codegen/src/codegen/mod.rs`:
- Around line 2535-2538: The n_units decision logic at the assignment of n_units
should check three conditions to use single-unit mode (n_units = 1) instead of
deciding based on module_callable_count: the existing opts.emit_ir_only check, a
new check for the PERRY_SAVE_LL environment variable being set, and a
cross-compilation check for when opts.target.is_none(). Modify the condition so
that if any of these three conditions are true, set n_units to 1, otherwise call
decide_codegen_units with the hir. This ensures that PERRY_SAVE_LL paths reach
the save logic at line 2558 and that cross-compilation without explicit target
override uses the single linker invocation path.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d540f14-6d75-442e-ae98-a1863bfae08e

📥 Commits

Reviewing files that changed from the base of the PR and between 6b63b2b and 3e4f12e.

📒 Files selected for processing (6)

crates/perry-codegen/src/codegen/helpers.rs
crates/perry-codegen/src/codegen/mod.rs
crates/perry-codegen/src/codegen/string_pool.rs
crates/perry-codegen/src/function.rs
crates/perry-codegen/src/linker.rs
crates/perry-codegen/src/module.rs

🚧 Files skipped from review as they are similar to previous changes (5)

crates/perry-codegen/src/codegen/helpers.rs
crates/perry-codegen/src/linker.rs
crates/perry-codegen/src/function.rs
crates/perry-codegen/src/module.rs
crates/perry-codegen/src/codegen/string_pool.rs

coderabbitai · 2026-06-18T18:55:04Z

+    let n_units = if opts.emit_ir_only {
+        1
+    } else {
+        decide_codegen_units(module_callable_count(hir))


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if cross-compilation gating is implemented elsewhere rg -n "PERRY_CODEGEN_UNITS|render_codegen_units|compile_units_to_object" --type rust -C 3 | head -100

Repository: PerryTS/perry

Length of output: 7863

🏁 Script executed:

# Check the full context around the n_units decision in codegen/mod.rs sed -n '2520,2560p' crates/perry-codegen/src/codegen/mod.rs

Repository: PerryTS/perry

Length of output: 1664

🏁 Script executed:

# Check the full compile_units_to_object implementation in linker.rs sed -n '280,320p' crates/perry-codegen/src/linker.rs

Repository: PerryTS/perry

Length of output: 1884

🏁 Script executed:

# Look for any environment variable checks related to PERRY_SAVE_LL or cross-compilation rg -n "PERRY_SAVE_LL|is_cross|target_triple" crates/perry-codegen/src/ --type rust

Repository: PerryTS/perry

Length of output: 4435

🏁 Script executed:

# Check git log for PR references to cross-compilation and codegen units git log --oneline --all --grep="5391\|cross-compil\|codegen.*unit" 2>/dev/null | head -20

Repository: PerryTS/perry

Length of output: 914

🏁 Script executed:

# Check if there are any comments in the code mentioning cross-compilation + codegen splitting rg -n "cross.*compil|target.*arch" crates/perry-codegen/src/codegen/mod.rs crates/perry-codegen/src/linker.rs

Repository: PerryTS/perry

Length of output: 619

🏁 Script executed:

# Check commit messages for `#5391` to see if cross-compilation gating was mentioned git log --all --format="%B" --grep="5391" 2>/dev/null | head -100

Repository: PerryTS/perry

Length of output: 5531

🏁 Script executed:

# Look for FIXME/TODO comments about cross-compilation in codegen units code rg -n "FIXME|TODO|XXX" crates/perry-codegen/src/codegen/mod.rs crates/perry-codegen/src/linker.rs | grep -i "cross\|target\|compil\|unit"

Repository: PerryTS/perry

Length of output: 39

🏁 Script executed:

# Check if ld -r command gets any target-specific flags when cross-compiling sed -n '310,340p' crates/perry-codegen/src/linker.rs

Repository: PerryTS/perry

Length of output: 1075

Add PERRY_SAVE_LL check to n_units decision logic.

The code comment at line 2532 states that both emit_ir_only and PERRY_SAVE_LL should use the single-text path, but the n_units decision at lines 2535–2537 only checks emit_ir_only. When PERRY_SAVE_LL is set, the code incorrectly proceeds to codegen splitting instead of taking the single-text path where PERRY_SAVE_LL is actually handled (line 2558). This causes the directory to be never populated because the split path returns early without reaching the PERRY_SAVE_LL save logic.

Additionally, cross-compilation with splitting should be gated: when cross-compiling (no --target flag override from host), ld -r at line ~320 uses the host linker by default, which may not support the target object format. Pass opts.target.is_some() into the n_units decision to force single-unit mode.

🔧 Proposed fix

+ let save_ll = std::env::var_os("PERRY_SAVE_LL").is_some(); + let is_cross_compile = opts.target.is_some(); - let n_units = if opts.emit_ir_only { + let n_units = if opts.emit_ir_only || save_ll || is_cross_compile { 1 } else { decide_codegen_units(module_callable_count(hir)) };

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@crates/perry-codegen/src/codegen/mod.rs` around lines 2535 - 2538, The n_units decision logic at the assignment of n_units should check three conditions to use single-unit mode (n_units = 1) instead of deciding based on module_callable_count: the existing opts.emit_ir_only check, a new check for the PERRY_SAVE_LL environment variable being set, and a cross-compilation check for when opts.target.is_none(). Modify the condition so that if any of these three conditions are true, set n_units to 1, otherwise call decide_codegen_units with the hir. This ensures that PERRY_SAVE_LL paths reach the save logic at line 2558 and that cross-compilation without explicit target override uses the single linker invocation path.

A single LLVM translation unit per module makes clang OOM on large inputs (a 13MB minified bundle lowers to one ~1.1GB .ll; clang -c needs ~15GB RSS). This adds codegen-unit splitting: the populated LlModule renders to N independent .ll texts, each compiled by clang -c separately (peak RSS ~whole/N) and merged with `ld -r` into one object — so compile_module's single-Vec<u8> contract and the existing one-object link path are unchanged. Mechanism (LlModule::render_codegen_units): - functions partitioned into N contiguous buckets; - the full string-constant + global set carried in every unit with local and bare-external DEFINITIONS promoted to linkonce_odr (linker keeps one copy; globals are ~1.4% of a large module's IR, so duplication is cheap); - each unit gets external declares for every function it calls but does not define (deduped by name; existing declarations win), so cross-unit calls resolve; the lone internal init/wrapper is promoted to external; - shared attribute groups + metadata replicated so #N/!N references resolve. Gating (decide_codegen_units): PERRY_CODEGEN_UNITS=N forces N (1 disables); otherwise auto by callable count — 1 unit below a floor, then ceil(callables / PERRY_CODEGEN_UNIT_SIZE) capped at 48. Default is 1 unit for ordinary modules — zero behavior change. Validated: a 2-unit split of a multi-function program with cross-unit calls + shared string/class globals compiles, links, and runs identically to the 1-unit build (`hi world` / 20). Unit tests cover the partition (define-once, cross-unit declares, linkonce_odr promotion) and the n=1 == to_ir identity. Follow-on to the IR-efficiency roadmap #5334; tracks #5391.

Minified bundles can contain two distinct classes that sanitize to the same name (e.g. two classes `j`), so their mangled method symbols collide (`perry_method_..._j__getElementsByTagName` defined twice). LLVM rejects the redefinition and the whole module fails to compile — independent of codegen-unit splitting (the single-TU `to_ir` had the same latent bug; it only surfaced once a real bundle reached clang). Add `deduped_function_refs` (first occurrence per name) and route both `to_ir` and `render_codegen_units` through it, so each symbol is emitted once. Calls to the duplicate resolve to the first definition — a dispatch ambiguity limited to genuinely name-colliding members; proper disambiguation by class id is a separate concern. No-op for ordinary modules (no duplicate names). Unit test covers both render paths.

Contiguous chunk-by-function-count partitioning clumped functions unevenly: on the cli.js bundle one unit ballooned to 139MB (it caught the biggest functions) while others were ~50MB, and clang -O0 time is superlinear in unit size, so that one unit dominated the whole compile. Switch to greedy largest-first bin-packing by `LlFunction::estimated_ir_bytes` (a cheap instruction-byte sum, no second render): each function is placed into the currently-smallest unit, so big functions are isolated and the rest stay even. This bounds the max unit size to ~total/N + the single largest function. NB: a function larger than total/N is irreducible by inter-function splitting — the cli.js bundle has a single 68MB minified IIFE whose clang -O0 time is the residual bottleneck (the intra-function #4880 problem, distinct from the module memory wall this feature removes). Balancing isolates it so the other units compile fast and in bounded memory. Unit test: a giant function + several tiny ones splits so the giant is not clumped with the small set.

A large bundle interns ~190K strings, and emitting every string's init (from-bytes + nanbox + global store + GC-root register) into one `__perry_init_strings` block produced a single ~68MB function. clang -O0 (forced for oversized modules, #4880) can't compile a function that large in practical time — its per-function passes are superlinear in size — so this was the residual wall after codegen-unit splitting removed the module MEMORY wall: a single giant FUNCTION that inter-function splitting can't divide. The string inits are independent (each handle is stored to its own global; no SSA value flows between iterations), so emit them into chunk functions of PERRY_STRING_INIT_CHUNK_SIZE entries (default 4000) and call those in sequence from `__perry_init_strings`. The ~48 chunks are each ~1.4MB and, combined with codegen-unit splitting (#5407), bin-pack evenly across units instead of one unit carrying the 68MB monolith. Order is preserved (chunks called in sequence, ops in order within a chunk), so GC-root registration and any ordering stay identical. Behavior-neutral for ordinary modules (a handful of strings → one chunk). Full codegen suite green.

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

proggeramlug mentioned this pull request Jun 18, 2026

perf(codegen): outline array-literal construction for oversized modules (#5391) #5412

Merged

proggeramlug pushed a commit that referenced this pull request Jun 18, 2026

review(coderabbit): gate codegen-unit split off for cross-targets + P…

6b63b2b

…ERRY_SAVE_LL; clean temp objects on failure (#5407)

proggeramlug force-pushed the feat/codegen-units branch from 6b63b2b to 3e4f12e Compare June 18, 2026 18:49

coderabbitai Bot reviewed Jun 18, 2026

View reviewed changes

Ralph Küpper added 5 commits June 18, 2026 21:03

style: cargo fmt (codegen string_pool/function/linker/module)

05fd8ef

proggeramlug force-pushed the feat/codegen-units branch from 3e4f12e to 05fd8ef Compare June 18, 2026 19:07

proggeramlug merged commit 0cc6ee1 into main Jun 18, 2026
15 checks passed

proggeramlug deleted the feat/codegen-units branch June 18, 2026 19:16

This was referenced Jun 18, 2026

perf(codegen): chunk the ENTIRE module-init function, not just the string loop (#5391) #5418

Merged

codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334

Closed

Uh oh!

Conversation

proggeramlug commented Jun 18, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

How

Gating

Validation

Known remaining wall (out of scope here)

Follow-ups

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

proggeramlug commented Jun 18, 2026

Update: + string-pool init splitting (function-level)

Empirical status on the cli.js bundle

Uh oh!

proggeramlug commented Jun 18, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

proggeramlug commented Jun 18, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 18, 2026 •

edited

Loading