Skip to content

codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334

@proggeramlug

Description

@proggeramlug

Summary

Compiling a large minified bundle (the 13MB @anthropic-ai/claude-code cli.js) generates ~31.7M lines / 1.25GB of LLVM IR — a ~96× expansion over the 13MB source. clang needs ~15GB RSS to compile that single module, even at -O0. This is not sustainable: the IR volume (and clang memory/time) scales with the inline-everything codegen, not with the program's real complexity.

This issue tracks making perry's IR architecturally efficient for large/dynamic modules, with a hard constraint: no runtime performance regression on real code.

Where the 1.25GB goes (measured on the real bundle IR)

pattern count what it is
js_typed_feedback_*_guard 365,158 inline IC-guard diamond at every property/field access
class_field_set.fast 238,282 inline field-set fast-path blocks
js_object_set_field_by_name 117,462 their fallback arms
js_gc_note_slot_layout + js_write_barrier_slot 310,541 GC write-barrier machinery at every store
bitcast double 1,571,150 nan-box conversions on ~every value
call @js_* 4,436,767 runtime calls
define (function bodies) 91,869 the actual code (a small fraction of the total)

Root cause: dynamic codegen of untyped code. Minified JS has no type annotations, so static_type_of resolves to Any almost everywhere and every operation lowers to the dynamic form — nan-box + inline cache + runtime call — inlined at every site.

Roadmap (prioritized by impact × feasibility × perf-neutrality)

Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)

  • A. Outline the cold IC paths, always. The guard-miss + js_object_set_field_by_name fallback + merge blocks are never hot. Collapse them to one @js_field_{get,set}_slow call; keep the fast store/load inline. Zero perf cost (cold-only). Lowest-risk, pure-win first step.
  • B. Adaptive full-outline for oversized modules. When a module crosses a size threshold (mirroring the existing ll_o0_threshold_bytes), outline the entire IC (incl. the fast path) to one call. A full outline measured ~1% slower on a 30M-iteration field-write micro-benchmark — but that cannot manifest on real I/O-bound code, and only applies to pathologically-large modules. Net: zero perf impact on normal apps, compact IR on bundles. Same size/speed policy perry already uses for -O.

Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the -O0 path huge modules force)

  • C. Nan-box round-trip elimination. 1.57M bitcast double (~5% of all IR). Untyped lowering boxes→unboxes→boxes the same value across adjacent ops. Track box-state at codegen and skip the round-trips. -O2 folds these; -O0 (which oversized modules force) does not — so this is pure win for exactly the case that matters.
  • D. GC-barrier elimination for non-pointer stores. 310K gc_note+write_barrier. Raw-f64/primitive fields can't hold pointers → no barrier needed. perry already knows class_field_declared_type (the raw-f64 path); extend it to suppress the barrier. Correctness-gated by field type.

Tier 3 — type-directed specialization (the long-term architecture; a perf win; limited on minified JS)

  • E. Propagate types → native ops. Where a value is statically a number/string/known-shape (real .ts, tsgo, local inference), emit fadd/fcmp/direct-slot instead of js_add/IC/call @js_*. Attacks the 4.4M calls + 365K guards at the source. Makes typed programs both smaller and faster.

Principle

perry's codegen is "dynamic everything." Sustainable = specialize where the type is known (Tier 3), outline the residual dynamic machinery into shared helpers (Tier 1), and stop emitting redundant scaffolding (Tier 2).

Sequencing

For untyped bundles, Tier 1+2 do the heavy lifting and need no type info: A → C → D → B. Each is independently shippable and measured for IR-line reduction + a flat hot-loop benchmark. Tier 3 is the strategic follow-on.

Context: surfaced while taking a real 13MB app all the way through perry's pipeline (parse → lower → transform → codegen → clang). Every discrete/correctness wall is already fixed (separate PRs); this issue is the remaining efficiency frontier.

Metadata

Metadata

Assignees

No one assigned

    Labels

    performanceRuntime or compile-time performance

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions