Summary
Compiling a large minified bundle (the 13MB @anthropic-ai/claude-code cli.js) generates ~31.7M lines / 1.25GB of LLVM IR — a ~96× expansion over the 13MB source. clang needs ~15GB RSS to compile that single module, even at -O0. This is not sustainable: the IR volume (and clang memory/time) scales with the inline-everything codegen, not with the program's real complexity.
This issue tracks making perry's IR architecturally efficient for large/dynamic modules, with a hard constraint: no runtime performance regression on real code.
Where the 1.25GB goes (measured on the real bundle IR)
| pattern |
count |
what it is |
js_typed_feedback_*_guard |
365,158 |
inline IC-guard diamond at every property/field access |
class_field_set.fast |
238,282 |
inline field-set fast-path blocks |
js_object_set_field_by_name |
117,462 |
their fallback arms |
js_gc_note_slot_layout + js_write_barrier_slot |
310,541 |
GC write-barrier machinery at every store |
bitcast double |
1,571,150 |
nan-box conversions on ~every value |
call @js_* |
4,436,767 |
runtime calls |
define (function bodies) |
91,869 |
the actual code (a small fraction of the total) |
Root cause: dynamic codegen of untyped code. Minified JS has no type annotations, so static_type_of resolves to Any almost everywhere and every operation lowers to the dynamic form — nan-box + inline cache + runtime call — inlined at every site.
Roadmap (prioritized by impact × feasibility × perf-neutrality)
Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)
- A. Outline the cold IC paths, always. The guard-miss +
js_object_set_field_by_name fallback + merge blocks are never hot. Collapse them to one @js_field_{get,set}_slow call; keep the fast store/load inline. Zero perf cost (cold-only). Lowest-risk, pure-win first step.
- B. Adaptive full-outline for oversized modules. When a module crosses a size threshold (mirroring the existing
ll_o0_threshold_bytes), outline the entire IC (incl. the fast path) to one call. A full outline measured ~1% slower on a 30M-iteration field-write micro-benchmark — but that cannot manifest on real I/O-bound code, and only applies to pathologically-large modules. Net: zero perf impact on normal apps, compact IR on bundles. Same size/speed policy perry already uses for -O.
Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the -O0 path huge modules force)
- C. Nan-box round-trip elimination. 1.57M
bitcast double (~5% of all IR). Untyped lowering boxes→unboxes→boxes the same value across adjacent ops. Track box-state at codegen and skip the round-trips. -O2 folds these; -O0 (which oversized modules force) does not — so this is pure win for exactly the case that matters.
- D. GC-barrier elimination for non-pointer stores. 310K
gc_note+write_barrier. Raw-f64/primitive fields can't hold pointers → no barrier needed. perry already knows class_field_declared_type (the raw-f64 path); extend it to suppress the barrier. Correctness-gated by field type.
Tier 3 — type-directed specialization (the long-term architecture; a perf win; limited on minified JS)
- E. Propagate types → native ops. Where a value is statically a number/string/known-shape (real
.ts, tsgo, local inference), emit fadd/fcmp/direct-slot instead of js_add/IC/call @js_*. Attacks the 4.4M calls + 365K guards at the source. Makes typed programs both smaller and faster.
Principle
perry's codegen is "dynamic everything." Sustainable = specialize where the type is known (Tier 3), outline the residual dynamic machinery into shared helpers (Tier 1), and stop emitting redundant scaffolding (Tier 2).
Sequencing
For untyped bundles, Tier 1+2 do the heavy lifting and need no type info: A → C → D → B. Each is independently shippable and measured for IR-line reduction + a flat hot-loop benchmark. Tier 3 is the strategic follow-on.
Context: surfaced while taking a real 13MB app all the way through perry's pipeline (parse → lower → transform → codegen → clang). Every discrete/correctness wall is already fixed (separate PRs); this issue is the remaining efficiency frontier.
Summary
Compiling a large minified bundle (the 13MB
@anthropic-ai/claude-codecli.js) generates ~31.7M lines / 1.25GB of LLVM IR — a ~96× expansion over the 13MB source. clang needs ~15GB RSS to compile that single module, even at-O0. This is not sustainable: the IR volume (and clang memory/time) scales with the inline-everything codegen, not with the program's real complexity.This issue tracks making perry's IR architecturally efficient for large/dynamic modules, with a hard constraint: no runtime performance regression on real code.
Where the 1.25GB goes (measured on the real bundle IR)
js_typed_feedback_*_guardclass_field_set.fastjs_object_set_field_by_namejs_gc_note_slot_layout+js_write_barrier_slotbitcast doublecall @js_*define(function bodies)Root cause: dynamic codegen of untyped code. Minified JS has no type annotations, so
static_type_ofresolves toAnyalmost everywhere and every operation lowers to the dynamic form — nan-box + inline cache + runtime call — inlined at every site.Roadmap (prioritized by impact × feasibility × perf-neutrality)
Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)
js_object_set_field_by_namefallback + merge blocks are never hot. Collapse them to one@js_field_{get,set}_slowcall; keep the fast store/load inline. Zero perf cost (cold-only). Lowest-risk, pure-win first step.ll_o0_threshold_bytes), outline the entire IC (incl. the fast path) to one call. A full outline measured ~1% slower on a 30M-iteration field-write micro-benchmark — but that cannot manifest on real I/O-bound code, and only applies to pathologically-large modules. Net: zero perf impact on normal apps, compact IR on bundles. Same size/speed policy perry already uses for-O.Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the
-O0path huge modules force)bitcast double(~5% of all IR). Untyped lowering boxes→unboxes→boxes the same value across adjacent ops. Track box-state at codegen and skip the round-trips.-O2folds these;-O0(which oversized modules force) does not — so this is pure win for exactly the case that matters.gc_note+write_barrier. Raw-f64/primitive fields can't hold pointers → no barrier needed. perry already knowsclass_field_declared_type(the raw-f64 path); extend it to suppress the barrier. Correctness-gated by field type.Tier 3 — type-directed specialization (the long-term architecture; a perf win; limited on minified JS)
.ts,tsgo, local inference), emitfadd/fcmp/direct-slot instead ofjs_add/IC/call @js_*. Attacks the 4.4M calls + 365K guards at the source. Makes typed programs both smaller and faster.Principle
perry's codegen is "dynamic everything." Sustainable = specialize where the type is known (Tier 3), outline the residual dynamic machinery into shared helpers (Tier 1), and stop emitting redundant scaffolding (Tier 2).
Sequencing
For untyped bundles, Tier 1+2 do the heavy lifting and need no type info: A → C → D → B. Each is independently shippable and measured for IR-line reduction + a flat hot-loop benchmark. Tier 3 is the strategic follow-on.
Context: surfaced while taking a real 13MB app all the way through perry's pipeline (parse → lower → transform → codegen → clang). Every discrete/correctness wall is already fixed (separate PRs); this issue is the remaining efficiency frontier.