Skip to content

Seam-inlining not firing on the gust hot path (mix/transition survive as calls) #226

Description

@avrabe

Proposal for pulseengine/loom: seam-inlining is not firing on the gust hot path

Context. Benchmarked loom v1.1.14 as the wasm→wasm stage in the gust dissolve pipeline
(meld → loom → synth → cortex-m3). Honest finding up front: the loom output is already
good at the wasm level.
The 3.9× gust_poll size gap vs native LLVM is overwhelmingly
synth's ARM-lowering problem (separate proposal), not loom's. This proposal is the smaller,
loom-scoped opportunity.

What loom did well (so we don't regress it)

The $gust_poll body loom emitted is compact and competitive with the LLVM IR shape:

  • 8 wasm locals, no obvious redundant locals.
  • Branch-free wrap via select — matches LLVM's it ne; movne if-conversion:
    local.get 5  i32.const 1  i32.add  local.tee 5
    local.get 5  i32.const 6  i32.eq   select        ;; (idx+1 == 6) ? 0 : idx+1
  • Offset loads preserved: i32.load offset=52, i32.load16_u offset=4,
    i32.load8_u offset=4 — the structure synth should fold into addressing modes.
  • i32.rem_u left intact (LLVM strength-reduces it; that's a synth lowering choice).

No dead code, no obvious local bloat. Good.

The loom-scoped gap: the two call seams are NOT inlined

$gust_poll still contains three real calls that survived loom:

call $_ZN10kiln_async4task18TaskTable$LT$_$GT$10transition;; called TWICE (id=1, id=2)
call $_ZN15gust_wasm_scout3mix;; the failsafe mixer

(confirmed by wasm-tools print of gust.loom.wasm: transition, mix,
panic_bounds_check, and panic_fmt are all still separate functions, and the
corresponding relocations R_ARM_THM_CALL func_0/func_1/func_4/func_5 survive into the
final .o.)

This matters two ways:

  1. mix is #[inline] in the source and is tiny (a few adds, a shift, two clamps).
    LLVM inlined it into gust_poll (the native gust_poll has no call to a separate
    mixer — it's the bl at 0x64/0xb0/0xc2 region that LLVM turned into inlined arithmetic

    • the closure). In the dissolved path it stays an out-of-line call, which (a) keeps a
      second function body alive in the kernel .text and (b) blocks synth from regalloc-ing
      across the seam.
  2. The two transition calls are the scheduler closure seam — exactly the
    "seam-dissolve / full inlining" loom#219 is meant to handle. Until they inline, synth
    has to set up a full call frame (spill live values around the bl) at each site, which
    amplifies synth's already-poor spill behavior.

Evidence — the call frames synth is forced to build around the un-inlined seams

Because the mix/transition calls survive, synth materializes ABI argument shuffles and
spills at each site, e.g. around the second transition call (dissolved 0x49c–0x4c4):

add.w  r2, r8, #0x4
ldr.w  r1, [sp, #0xc]
ldr.w  r4, [sp, #0x1c]
ldr.w  r5, [sp, #0x20]
movw   r6, #0x2
str.w  r6, [sp]          ; 5th arg via stack
str.w  r0, [sp, #0x28]   ; caller-save spill across the call
mov    r0, r2  / mov r1,r1 / mov r2,r4 / mov r3,r5
bl     <transition>

If mix were inlined by loom, synth would never emit the gust_mix-style 44 B wrapper
frame inside the hot path, and the cross-seam values could stay in registers.

Recommendation

  1. Make seam-inlining actually fire for #[inline]-annotated and small-leaf callees on
    the hot path (loom#219).
    At minimum, inline mix — it's #[inline], single-caller
    from gust_poll, and trivially small. The size win in the .o is the whole mix
    body's duplicated-call overhead plus the wrapper frame synth builds for it.
  2. Inline the two transition closure-seam calls so synth sees a single straight-line
    loop body — this is the prerequisite that lets synth's (future) regalloc keep loop
    state in registers across what are currently call boundaries.
  3. Add a dissolve-pipeline check / report that flags surviving calls to
    #[inline]-marked or single-use small functions on exported hot-path entries, so this
    regression is visible in CI rather than only in a hand-run compare-codegen.sh.

Honest sizing caveat: because the dominant 3.9× cost is synth-side spilling, inlining
alone will not close the gap — but it is the necessary enabler. With synth's regalloc
fix (separate proposal) plus seam inlining, the cross-seam register pressure relief
compounds: the single biggest synth win (eliminating per-call spill/reload of loop-live
values) is only reachable once loom removes the call boundaries inside the loop.


Filed from gale's gust dissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with llvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions