Proposal for pulseengine/loom: seam-inlining is not firing on the gust hot path
Context. Benchmarked loom v1.1.14 as the wasm→wasm stage in the gust dissolve pipeline
(meld → loom → synth → cortex-m3). Honest finding up front: the loom output is already
good at the wasm level. The 3.9× gust_poll size gap vs native LLVM is overwhelmingly
synth's ARM-lowering problem (separate proposal), not loom's. This proposal is the smaller,
loom-scoped opportunity.
What loom did well (so we don't regress it)
The $gust_poll body loom emitted is compact and competitive with the LLVM IR shape:
- 8 wasm locals, no obvious redundant locals.
- Branch-free wrap via
select — matches LLVM's it ne; movne if-conversion:
local.get 5 i32.const 1 i32.add local.tee 5
local.get 5 i32.const 6 i32.eq select ;; (idx+1 == 6) ? 0 : idx+1
- Offset loads preserved:
i32.load offset=52, i32.load16_u offset=4,
i32.load8_u offset=4 — the structure synth should fold into addressing modes.
i32.rem_u left intact (LLVM strength-reduces it; that's a synth lowering choice).
No dead code, no obvious local bloat. Good.
The loom-scoped gap: the two call seams are NOT inlined
$gust_poll still contains three real calls that survived loom:
call $_ZN10kiln_async4task18TaskTable$LT$_$GT$10transition… ;; called TWICE (id=1, id=2)
call $_ZN15gust_wasm_scout3mix… ;; the failsafe mixer
(confirmed by wasm-tools print of gust.loom.wasm: transition, mix,
panic_bounds_check, and panic_fmt are all still separate functions, and the
corresponding relocations R_ARM_THM_CALL func_0/func_1/func_4/func_5 survive into the
final .o.)
This matters two ways:
-
mix is #[inline] in the source and is tiny (a few adds, a shift, two clamps).
LLVM inlined it into gust_poll (the native gust_poll has no call to a separate
mixer — it's the bl at 0x64/0xb0/0xc2 region that LLVM turned into inlined arithmetic
- the closure). In the dissolved path it stays an out-of-line
call, which (a) keeps a
second function body alive in the kernel .text and (b) blocks synth from regalloc-ing
across the seam.
-
The two transition calls are the scheduler closure seam — exactly the
"seam-dissolve / full inlining" loom#219 is meant to handle. Until they inline, synth
has to set up a full call frame (spill live values around the bl) at each site, which
amplifies synth's already-poor spill behavior.
Evidence — the call frames synth is forced to build around the un-inlined seams
Because the mix/transition calls survive, synth materializes ABI argument shuffles and
spills at each site, e.g. around the second transition call (dissolved 0x49c–0x4c4):
add.w r2, r8, #0x4
ldr.w r1, [sp, #0xc]
ldr.w r4, [sp, #0x1c]
ldr.w r5, [sp, #0x20]
movw r6, #0x2
str.w r6, [sp] ; 5th arg via stack
str.w r0, [sp, #0x28] ; caller-save spill across the call
mov r0, r2 / mov r1,r1 / mov r2,r4 / mov r3,r5
bl <transition>
If mix were inlined by loom, synth would never emit the gust_mix-style 44 B wrapper
frame inside the hot path, and the cross-seam values could stay in registers.
Recommendation
- Make seam-inlining actually fire for
#[inline]-annotated and small-leaf callees on
the hot path (loom#219). At minimum, inline mix — it's #[inline], single-caller
from gust_poll, and trivially small. The size win in the .o is the whole mix
body's duplicated-call overhead plus the wrapper frame synth builds for it.
- Inline the two
transition closure-seam calls so synth sees a single straight-line
loop body — this is the prerequisite that lets synth's (future) regalloc keep loop
state in registers across what are currently call boundaries.
- Add a dissolve-pipeline check / report that flags surviving
calls to
#[inline]-marked or single-use small functions on exported hot-path entries, so this
regression is visible in CI rather than only in a hand-run compare-codegen.sh.
Honest sizing caveat: because the dominant 3.9× cost is synth-side spilling, inlining
alone will not close the gap — but it is the necessary enabler. With synth's regalloc
fix (separate proposal) plus seam inlining, the cross-seam register pressure relief
compounds: the single biggest synth win (eliminating per-call spill/reload of loop-live
values) is only reachable once loom removes the call boundaries inside the loop.
Filed from gale's gust dissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble with llvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.
Proposal for
pulseengine/loom: seam-inlining is not firing on the gust hot pathContext. Benchmarked loom v1.1.14 as the wasm→wasm stage in the gust dissolve pipeline
(meld → loom → synth → cortex-m3). Honest finding up front: the loom output is already
good at the wasm level. The 3.9×
gust_pollsize gap vs native LLVM is overwhelminglysynth's ARM-lowering problem (separate proposal), not loom's. This proposal is the smaller,
loom-scoped opportunity.
What loom did well (so we don't regress it)
The
$gust_pollbody loom emitted is compact and competitive with the LLVM IR shape:select— matches LLVM'sit ne; movneif-conversion:i32.load offset=52,i32.load16_u offset=4,i32.load8_u offset=4— the structure synth should fold into addressing modes.i32.rem_uleft intact (LLVM strength-reduces it; that's a synth lowering choice).No dead code, no obvious local bloat. Good.
The loom-scoped gap: the two call seams are NOT inlined
$gust_pollstill contains three realcalls that survived loom:(confirmed by
wasm-tools printofgust.loom.wasm:transition,mix,panic_bounds_check, andpanic_fmtare all still separate functions, and thecorresponding relocations
R_ARM_THM_CALL func_0/func_1/func_4/func_5survive into thefinal
.o.)This matters two ways:
mixis#[inline]in the source and is tiny (a few adds, a shift, two clamps).LLVM inlined it into
gust_poll(the nativegust_pollhas no call to a separatemixer — it's the
blat 0x64/0xb0/0xc2 region that LLVM turned into inlined arithmeticcall, which (a) keeps asecond function body alive in the kernel
.textand (b) blocks synth from regalloc-ingacross the seam.
The two
transitioncalls are the scheduler closure seam — exactly the"seam-dissolve / full inlining" loom#219 is meant to handle. Until they inline, synth
has to set up a full call frame (spill live values around the
bl) at each site, whichamplifies synth's already-poor spill behavior.
Evidence — the call frames synth is forced to build around the un-inlined seams
Because the
mix/transitioncalls survive, synth materializes ABI argument shuffles andspills at each site, e.g. around the second
transitioncall (dissolved 0x49c–0x4c4):If
mixwere inlined by loom, synth would never emit thegust_mix-style 44 B wrapperframe inside the hot path, and the cross-seam values could stay in registers.
Recommendation
#[inline]-annotated and small-leaf callees onthe hot path (loom#219). At minimum, inline
mix— it's#[inline], single-callerfrom
gust_poll, and trivially small. The size win in the.ois the wholemixbody's duplicated-call overhead plus the wrapper frame synth builds for it.
transitionclosure-seam calls so synth sees a single straight-lineloop body — this is the prerequisite that lets synth's (future) regalloc keep loop
state in registers across what are currently call boundaries.
calls to#[inline]-marked or single-use small functions on exported hot-path entries, so thisregression is visible in CI rather than only in a hand-run
compare-codegen.sh.Honest sizing caveat: because the dominant 3.9× cost is synth-side spilling, inlining
alone will not close the gap — but it is the necessary enabler. With synth's regalloc
fix (separate proposal) plus seam inlining, the cross-seam register pressure relief
compounds: the single biggest synth win (eliminating per-call spill/reload of loop-live
values) is only reachable once loom removes the call boundaries inside the loop.
Filed from gale's
gustdissolve benchmark (benches/gust/compare-codegen.sh). Reproduce: build native thumbv7m vs wasm→loom→synth→cortex-m3, disassemble withllvm-objdump -d --triple=thumbv7m. Measured on synth v0.11.50 / loom v1.1.14.