Skip to content

opt: dissolve the seam fully — post-inline SROA/scalar-forwarding eliminates the u64 ABI pack/unpack (+ wasm-local mem2reg) → toward LLVM-LTO parity #219

Description

@avrabe

Context: loom currently --passes inline dissolves the gale C↔Rust decide seam into the kernel-primitive shims (sem/mutex/pipe/…). It inlines the call but doesn't yet dissolve the seam — the decide's u64-packed-return ABI survives the inline as a pack-immediately-unpack round-trip. This is the bulk of our gap vs LLVM-LTO (sem handoff: 860 cyc wasm-cross-LTO vs 471 LLVM-LTO = 1.83×; G474RE silicon).

Evidence (wasm-IR, post-loom-inline — frozen repro attached)

repro-loom-seam-sroa/sem.loom.wasm (the dissolved z_impl_k_sem_give) still contains, on the same value:

i64.extend_i32_u ; i64.shl ; i64.or     ← decide PACKS {action, new_count} into a u64
...
i64.and ; i64.shr_u                      ← shim UNPACKS it back to scalars
(local i32 i32 i64)                      ← the dead i64 carrier local

The u64 is constructed from two i32s and immediately decomposed — textbook SROA / scalar-forwarding. LLVM-LTO sees through this and it vanishes; we keep it.

Cost it produces (synth ARM backend, z_impl_k_sem_give = 83 insns / 540 B object)

  • u64 round-trip: str.w r0,[sp,#0x8]; str.w r1,[sp,#0xc] (spill the packed halves) then and.w r4,r0,#0xff (unpack action) — pure ABI residue.
  • wasm-local reload churn: 19/83 insns (23%) are [sp] traffic; the pointer arg is reloaded from [sp,#0x68] (each local.get → memory).
  • const re-materialization: 11 movw/movt (the 0xff/0x1 unpack masks + linmem base, not hoisted).

Recommended passes (all wasm-IR → benefit ARM and RISC-V backends; complementary to synth#209's backend regalloc)

  1. SROA / scalar-forwarding through the inlined seam (highest value) — when an i64 is built by extend/shl/or and consumed by and/shr_u with no escape, forward the scalar components directly and drop the carrier local. Kills the pack/unpack outright. This is what turns "inlined" into "dissolved."
  2. wasm-local mem2reg / promotion + coalescing — promote non-escaping single-assignment locals and shrink the live set so the 5× pointer reload + store-then-reload chains collapse to register keeps.
  3. const dedup / hoisting — materialize each constant once (also addresses the documented 61% const-redundancy in flat_flight).

Parity read

Bit-exact 1.0× vs LLVM-LTO is unlikely (their regalloc is decades-tuned), but within ~20% is achievable, and the in-context overhead is already +11% (composed flight_control bench). Levers 1+2 alone should take the sem body from 83 insns toward ~55–60 (drop the u64 round-trip + most [sp] churn) — a large chunk of the 1.83×.

Kill-criterion

On a build with pass #1: repro-loom-seam-sroa/sem.loom.wasm → synth shows no i64 pack/unpack in the dissolved body, the i64 local is gone, and the ARM body drops below ~70 insns. I'm the on-silicon gate (G474RE) — I'll re-measure sem 860 + mutex 472 + the composed-bench deltas the moment a build lands.

Repro: gale-smart-data/.../wasm-testbed/repro-loom-seam-sroa/ (sem.loom.wasm + shim). xref synth#209 (backend regalloc/const-CSE — the complementary lever).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions