Context: loom currently --passes inline dissolves the gale C↔Rust decide seam into the kernel-primitive shims (sem/mutex/pipe/…). It inlines the call but doesn't yet dissolve the seam — the decide's u64-packed-return ABI survives the inline as a pack-immediately-unpack round-trip. This is the bulk of our gap vs LLVM-LTO (sem handoff: 860 cyc wasm-cross-LTO vs 471 LLVM-LTO = 1.83×; G474RE silicon).
Evidence (wasm-IR, post-loom-inline — frozen repro attached)
repro-loom-seam-sroa/sem.loom.wasm (the dissolved z_impl_k_sem_give) still contains, on the same value:
i64.extend_i32_u ; i64.shl ; i64.or ← decide PACKS {action, new_count} into a u64
...
i64.and ; i64.shr_u ← shim UNPACKS it back to scalars
(local i32 i32 i64) ← the dead i64 carrier local
The u64 is constructed from two i32s and immediately decomposed — textbook SROA / scalar-forwarding. LLVM-LTO sees through this and it vanishes; we keep it.
Cost it produces (synth ARM backend, z_impl_k_sem_give = 83 insns / 540 B object)
- u64 round-trip:
str.w r0,[sp,#0x8]; str.w r1,[sp,#0xc] (spill the packed halves) then and.w r4,r0,#0xff (unpack action) — pure ABI residue.
- wasm-local reload churn: 19/83 insns (23%) are
[sp] traffic; the pointer arg is reloaded 5× from [sp,#0x68] (each local.get → memory).
- const re-materialization: 11
movw/movt (the 0xff/0x1 unpack masks + linmem base, not hoisted).
Recommended passes (all wasm-IR → benefit ARM and RISC-V backends; complementary to synth#209's backend regalloc)
- SROA / scalar-forwarding through the inlined seam (highest value) — when an
i64 is built by extend/shl/or and consumed by and/shr_u with no escape, forward the scalar components directly and drop the carrier local. Kills the pack/unpack outright. This is what turns "inlined" into "dissolved."
- wasm-local mem2reg / promotion + coalescing — promote non-escaping single-assignment locals and shrink the live set so the 5× pointer reload + store-then-reload chains collapse to register keeps.
- const dedup / hoisting — materialize each constant once (also addresses the documented 61% const-redundancy in
flat_flight).
Parity read
Bit-exact 1.0× vs LLVM-LTO is unlikely (their regalloc is decades-tuned), but within ~20% is achievable, and the in-context overhead is already +11% (composed flight_control bench). Levers 1+2 alone should take the sem body from 83 insns toward ~55–60 (drop the u64 round-trip + most [sp] churn) — a large chunk of the 1.83×.
Kill-criterion
On a build with pass #1: repro-loom-seam-sroa/sem.loom.wasm → synth shows no i64 pack/unpack in the dissolved body, the i64 local is gone, and the ARM body drops below ~70 insns. I'm the on-silicon gate (G474RE) — I'll re-measure sem 860 + mutex 472 + the composed-bench deltas the moment a build lands.
Repro: gale-smart-data/.../wasm-testbed/repro-loom-seam-sroa/ (sem.loom.wasm + shim). xref synth#209 (backend regalloc/const-CSE — the complementary lever).
Context: loom currently
--passes inlinedissolves the gale C↔Rust decide seam into the kernel-primitive shims (sem/mutex/pipe/…). It inlines the call but doesn't yet dissolve the seam — the decide'su64-packed-return ABI survives the inline as a pack-immediately-unpack round-trip. This is the bulk of our gap vs LLVM-LTO (sem handoff: 860 cyc wasm-cross-LTO vs 471 LLVM-LTO = 1.83×; G474RE silicon).Evidence (wasm-IR, post-loom-inline — frozen repro attached)
repro-loom-seam-sroa/sem.loom.wasm(the dissolvedz_impl_k_sem_give) still contains, on the same value:The
u64is constructed from two i32s and immediately decomposed — textbook SROA / scalar-forwarding. LLVM-LTO sees through this and it vanishes; we keep it.Cost it produces (synth ARM backend,
z_impl_k_sem_give= 83 insns / 540 B object)str.w r0,[sp,#0x8]; str.w r1,[sp,#0xc](spill the packed halves) thenand.w r4,r0,#0xff(unpack action) — pure ABI residue.[sp]traffic; the pointer arg is reloaded 5× from[sp,#0x68](eachlocal.get→ memory).movw/movt(the0xff/0x1unpack masks + linmem base, not hoisted).Recommended passes (all wasm-IR → benefit ARM and RISC-V backends; complementary to synth#209's backend regalloc)
i64is built byextend/shl/orand consumed byand/shr_uwith no escape, forward the scalar components directly and drop the carrier local. Kills the pack/unpack outright. This is what turns "inlined" into "dissolved."flat_flight).Parity read
Bit-exact 1.0× vs LLVM-LTO is unlikely (their regalloc is decades-tuned), but within ~20% is achievable, and the in-context overhead is already +11% (composed flight_control bench). Levers 1+2 alone should take the sem body from 83 insns toward ~55–60 (drop the u64 round-trip + most
[sp]churn) — a large chunk of the 1.83×.Kill-criterion
On a build with pass #1:
repro-loom-seam-sroa/sem.loom.wasm→ synth shows noi64pack/unpack in the dissolved body, thei64local is gone, and the ARM body drops below ~70 insns. I'm the on-silicon gate (G474RE) — I'll re-measure sem 860 + mutex 472 + the composed-bench deltas the moment a build lands.Repro:
gale-smart-data/.../wasm-testbed/repro-loom-seam-sroa/(sem.loom.wasm+ shim). xref synth#209 (backend regalloc/const-CSE — the complementary lever).