codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize

## Summary

Compiling a large minified bundle (the 13MB `@anthropic-ai/claude-code` cli.js) generates **~31.7M lines / 1.25GB of LLVM IR** — a ~96× expansion over the 13MB source. clang needs **~15GB RSS** to compile that single module, even at `-O0`. This is not sustainable: the IR volume (and clang memory/time) scales with the inline-everything codegen, not with the program's real complexity.

This issue tracks making perry's IR **architecturally efficient** for large/dynamic modules, with a hard constraint: **no runtime performance regression on real code.**

## Where the 1.25GB goes (measured on the real bundle IR)

| pattern | count | what it is |
|---|---|---|
| `js_typed_feedback_*_guard` | 365,158 | inline IC-guard diamond at every property/field access |
| `class_field_set.fast` | 238,282 | inline field-set fast-path blocks |
| `js_object_set_field_by_name` | 117,462 | their fallback arms |
| `js_gc_note_slot_layout` + `js_write_barrier_slot` | 310,541 | GC write-barrier machinery at every store |
| `bitcast double` | 1,571,150 | nan-box conversions on ~every value |
| `call @js_*` | 4,436,767 | runtime calls |
| `define` (function bodies) | 91,869 | the actual code (a small fraction of the total) |

Root cause: **dynamic codegen of untyped code.** Minified JS has no type annotations, so `static_type_of` resolves to `Any` almost everywhere and every operation lowers to the dynamic form — nan-box + inline cache + runtime call — inlined at every site.

## Roadmap (prioritized by impact × feasibility × perf-neutrality)

### Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)
- **A. Outline the cold IC paths, always.** The guard-miss + `js_object_set_field_by_name` fallback + merge blocks are never hot. Collapse them to one `@js_field_{get,set}_slow` call; keep the fast store/load **inline**. Zero perf cost (cold-only). *Lowest-risk, pure-win first step.*
- **B. Adaptive full-outline for oversized modules.** When a module crosses a size threshold (mirroring the existing `ll_o0_threshold_bytes`), outline the *entire* IC (incl. the fast path) to one call. A full outline measured ~1% slower on a 30M-iteration field-write micro-benchmark — but that cannot manifest on real I/O-bound code, and only applies to pathologically-large modules. Net: zero perf impact on normal apps, compact IR on bundles. Same size/speed policy perry already uses for `-O`.

### Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the `-O0` path huge modules force)
- **C. Nan-box round-trip elimination.** 1.57M `bitcast double` (~5% of all IR). Untyped lowering boxes→unboxes→boxes the same value across adjacent ops. Track box-state at codegen and skip the round-trips. `-O2` folds these; **`-O0` (which oversized modules force) does not** — so this is pure win for exactly the case that matters.
- **D. GC-barrier elimination for non-pointer stores.** 310K `gc_note`+`write_barrier`. Raw-f64/primitive fields can't hold pointers → no barrier needed. perry already knows `class_field_declared_type` (the raw-f64 path); extend it to suppress the barrier. Correctness-gated by field type.

### Tier 3 — type-directed specialization (the long-term architecture; a perf *win*; limited on minified JS)
- **E. Propagate types → native ops.** Where a value is statically a number/string/known-shape (real `.ts`, `tsgo`, local inference), emit `fadd`/`fcmp`/direct-slot instead of `js_add`/IC/`call @js_*`. Attacks the 4.4M calls + 365K guards at the source. Makes typed programs both smaller and faster.

## Principle
perry's codegen is **"dynamic everything."** Sustainable = **specialize where the type is known (Tier 3), outline the residual dynamic machinery into shared helpers (Tier 1), and stop emitting redundant scaffolding (Tier 2).**

## Sequencing
For untyped bundles, Tier 1+2 do the heavy lifting and need no type info: **A → C → D → B.** Each is independently shippable and measured for IR-line reduction + a flat hot-loop benchmark. Tier 3 is the strategic follow-on.

Context: surfaced while taking a real 13MB app all the way through perry's pipeline (parse → lower → transform → codegen → clang). Every discrete/correctness wall is already fixed (separate PRs); this issue is the remaining *efficiency* frontier.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334

Summary

Where the 1.25GB goes (measured on the real bundle IR)

Roadmap (prioritized by impact × feasibility × perf-neutrality)

Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)

Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the `-O0` path huge modules force)

Tier 3 — type-directed specialization (the long-term architecture; a perf win; limited on minified JS)

Principle

Sequencing

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

pattern	count	what it is
`js_typed_feedback_*_guard`	365,158	inline IC-guard diamond at every property/field access
`class_field_set.fast`	238,282	inline field-set fast-path blocks
`js_object_set_field_by_name`	117,462	their fallback arms
`js_gc_note_slot_layout` + `js_write_barrier_slot`	310,541	GC write-barrier machinery at every store
`bitcast double`	1,571,150	nan-box conversions on ~every value
`call @js_*`	4,436,767	runtime calls
`define` (function bodies)	91,869	the actual code (a small fraction of the total)

Uh oh!

codegen: IR is ~96x bloated for large/untyped modules (1.25GB / ~15GB clang RSS for a 13MB bundle) — outline dynamic machinery + specialize #5334

Description

Summary

Where the 1.25GB goes (measured on the real bundle IR)

Roadmap (prioritized by impact × feasibility × perf-neutrality)

Tier 1 — outline the dynamic machinery (works on untyped/minified JS; biggest line counts)

Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the -O0 path huge modules force)

Tier 3 — type-directed specialization (the long-term architecture; a perf win; limited on minified JS)

Principle

Sequencing

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Tier 2 — emit less scaffolding (perf-neutral or a win; directly helps the `-O0` path huge modules force)