Seam-inlining not firing on the gust hot path (mix/transition survive as calls)

# Proposal for `pulseengine/loom`: seam-inlining is not firing on the gust hot path

**Context.** Benchmarked loom v1.1.14 as the wasm→wasm stage in the gust dissolve pipeline
(meld → loom → synth → cortex-m3). Honest finding up front: **the loom output is already
good at the wasm level.** The 3.9× `gust_poll` size gap vs native LLVM is overwhelmingly
synth's ARM-lowering problem (separate proposal), not loom's. This proposal is the smaller,
loom-scoped opportunity.

## What loom did well (so we don't regress it)

The `$gust_poll` body loom emitted is compact and competitive with the LLVM IR shape:

- **8 wasm locals**, no obvious redundant locals.
- **Branch-free wrap via `select`** — matches LLVM's `it ne; movne` if-conversion:
  ```wat
  local.get 5  i32.const 1  i32.add  local.tee 5
  local.get 5  i32.const 6  i32.eq   select        ;; (idx+1 == 6) ? 0 : idx+1
  ```
- **Offset loads** preserved: `i32.load offset=52`, `i32.load16_u offset=4`,
  `i32.load8_u offset=4` — the structure synth *should* fold into addressing modes.
- `i32.rem_u` left intact (LLVM strength-reduces it; that's a synth lowering choice).

No dead code, no obvious local bloat. Good.

## The loom-scoped gap: the two call seams are NOT inlined

`$gust_poll` still contains three real `call`s that survived loom:

```wat
call $_ZN10kiln_async4task18TaskTable$LT$_$GT$10transition…   ;; called TWICE (id=1, id=2)
call $_ZN15gust_wasm_scout3mix…                               ;; the failsafe mixer
```

(confirmed by `wasm-tools print` of `gust.loom.wasm`: `transition`, `mix`,
`panic_bounds_check`, and `panic_fmt` are all still separate functions, and the
corresponding relocations `R_ARM_THM_CALL func_0/func_1/func_4/func_5` survive into the
final `.o`.)

This matters two ways:

1. **`mix` is `#[inline]` in the source and is tiny** (a few adds, a shift, two clamps).
   LLVM inlined it into `gust_poll` (the native `gust_poll` has *no* call to a separate
   mixer — it's the `bl` at 0x64/0xb0/0xc2 region that LLVM turned into inlined arithmetic
   + the closure). In the dissolved path it stays an out-of-line `call`, which (a) keeps a
   second function body alive in the kernel `.text` and (b) blocks synth from regalloc-ing
   across the seam.

2. **The two `transition` calls** are the scheduler closure seam — exactly the
   "seam-dissolve / full inlining" loom#219 is meant to handle. Until they inline, synth
   has to set up a full call frame (spill live values around the `bl`) at each site, which
   amplifies synth's already-poor spill behavior.

### Evidence — the call frames synth is forced to build around the un-inlined seams

Because the `mix`/`transition` calls survive, synth materializes ABI argument shuffles and
spills at each site, e.g. around the second `transition` call (dissolved 0x49c–0x4c4):

```asm
add.w  r2, r8, #0x4
ldr.w  r1, [sp, #0xc]
ldr.w  r4, [sp, #0x1c]
ldr.w  r5, [sp, #0x20]
movw   r6, #0x2
str.w  r6, [sp]          ; 5th arg via stack
str.w  r0, [sp, #0x28]   ; caller-save spill across the call
mov    r0, r2  / mov r1,r1 / mov r2,r4 / mov r3,r5
bl     <transition>
```

If `mix` were inlined by loom, synth would never emit the `gust_mix`-style 44 B wrapper
frame inside the hot path, and the cross-seam values could stay in registers.

## Recommendation

1. **Make seam-inlining actually fire for `#[inline]`-annotated and small-leaf callees on
   the hot path (loom#219).** At minimum, inline `mix` — it's `#[inline]`, single-caller
   from `gust_poll`, and trivially small. The size win in the `.o` is the whole `mix`
   body's duplicated-call overhead plus the wrapper frame synth builds for it.
2. **Inline the two `transition` closure-seam calls** so synth sees a single straight-line
   loop body — this is the prerequisite that lets synth's (future) regalloc keep loop
   state in registers across what are currently call boundaries.
3. **Add a dissolve-pipeline check / report** that flags surviving `call`s to
   `#[inline]`-marked or single-use small functions on exported hot-path entries, so this
   regression is visible in CI rather than only in a hand-run `compare-codegen.sh`.

**Honest sizing caveat:** because the dominant 3.9× cost is synth-side spilling, inlining
alone will *not* close the gap — but it is the necessary enabler. With synth's regalloc
fix (separate proposal) **plus** seam inlining, the cross-seam register pressure relief
compounds: the single biggest synth win (eliminating per-call spill/reload of loop-live
values) is only reachable once loom removes the call boundaries inside the loop.


---
_Filed from gale's `gust` dissolve benchmark (`benches/gust/compare-codegen.sh`). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with `llvm-objdump -d --triple=thumbv7m`. Measured on synth v0.11.50 / loom v1.1.14._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Seam-inlining not firing on the gust hot path (mix/transition survive as calls) #226

Proposal for `pulseengine/loom`: seam-inlining is not firing on the gust hot path

What loom did well (so we don't regress it)

The loom-scoped gap: the two call seams are NOT inlined

Evidence — the call frames synth is forced to build around the un-inlined seams

Recommendation

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Seam-inlining not firing on the gust hot path (mix/transition survive as calls) #226

Description

Proposal for pulseengine/loom: seam-inlining is not firing on the gust hot path

What loom did well (so we don't regress it)

The loom-scoped gap: the two call seams are NOT inlined

Evidence — the call frames synth is forced to build around the un-inlined seams

Recommendation

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Proposal for `pulseengine/loom`: seam-inlining is not firing on the gust hot path