opt: dissolve the seam fully — post-inline SROA/scalar-forwarding eliminates the u64 ABI pack/unpack (+ wasm-local mem2reg) → toward LLVM-LTO parity

**Context:** loom currently `--passes inline` dissolves the gale C↔Rust decide seam into the kernel-primitive shims (sem/mutex/pipe/…). It **inlines the call** but doesn't yet **dissolve the seam** — the decide's `u64`-packed-return ABI survives the inline as a pack-immediately-unpack round-trip. This is the bulk of our gap vs LLVM-LTO (sem handoff: **860 cyc wasm-cross-LTO vs 471 LLVM-LTO = 1.83×**; G474RE silicon).

## Evidence (wasm-IR, post-loom-inline — frozen repro attached)
`repro-loom-seam-sroa/sem.loom.wasm` (the dissolved `z_impl_k_sem_give`) still contains, on the **same value**:
```
i64.extend_i32_u ; i64.shl ; i64.or     ← decide PACKS {action, new_count} into a u64
...
i64.and ; i64.shr_u                      ← shim UNPACKS it back to scalars
(local i32 i32 i64)                      ← the dead i64 carrier local
```
The `u64` is constructed from two i32s and immediately decomposed — textbook SROA / scalar-forwarding. LLVM-LTO sees through this and it vanishes; we keep it.

## Cost it produces (synth ARM backend, `z_impl_k_sem_give` = 83 insns / 540 B object)
- **u64 round-trip**: `str.w r0,[sp,#0x8]; str.w r1,[sp,#0xc]` (spill the packed halves) then `and.w r4,r0,#0xff` (unpack action) — pure ABI residue.
- **wasm-local reload churn**: 19/83 insns (23%) are `[sp]` traffic; the pointer arg is reloaded **5×** from `[sp,#0x68]` (each `local.get` → memory).
- **const re-materialization**: 11 `movw/movt` (the `0xff`/`0x1` unpack masks + linmem base, not hoisted).

## Recommended passes (all wasm-IR → benefit ARM **and** RISC-V backends; complementary to synth#209's backend regalloc)
1. **SROA / scalar-forwarding through the inlined seam** *(highest value)* — when an `i64` is built by `extend/shl/or` and consumed by `and/shr_u` with no escape, forward the scalar components directly and drop the carrier local. Kills the pack/unpack outright. **This is what turns "inlined" into "dissolved."**
2. **wasm-local mem2reg / promotion + coalescing** — promote non-escaping single-assignment locals and shrink the live set so the 5× pointer reload + store-then-reload chains collapse to register keeps.
3. **const dedup / hoisting** — materialize each constant once (also addresses the documented 61% const-redundancy in `flat_flight`).

## Parity read
Bit-exact 1.0× vs LLVM-LTO is unlikely (their regalloc is decades-tuned), but **within ~20% is achievable**, and the **in-context overhead is already +11%** (composed flight_control bench). Levers 1+2 alone should take the sem body from 83 insns toward ~55–60 (drop the u64 round-trip + most `[sp]` churn) — a large chunk of the 1.83×.

## Kill-criterion
On a build with pass #1: `repro-loom-seam-sroa/sem.loom.wasm` → synth shows **no `i64` pack/unpack** in the dissolved body, the `i64` local is gone, and the ARM body drops below ~70 insns. I'm the on-silicon gate (G474RE) — I'll re-measure sem 860 + mutex 472 + the composed-bench deltas the moment a build lands.

**Repro:** `gale-smart-data/.../wasm-testbed/repro-loom-seam-sroa/` (`sem.loom.wasm` + shim). xref synth#209 (backend regalloc/const-CSE — the complementary lever).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

opt: dissolve the seam fully — post-inline SROA/scalar-forwarding eliminates the u64 ABI pack/unpack (+ wasm-local mem2reg) → toward LLVM-LTO parity #219

Evidence (wasm-IR, post-loom-inline — frozen repro attached)

Cost it produces (synth ARM backend, `z_impl_k_sem_give` = 83 insns / 540 B object)

Recommended passes (all wasm-IR → benefit ARM and RISC-V backends; complementary to synth#209's backend regalloc)

Parity read

Kill-criterion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

opt: dissolve the seam fully — post-inline SROA/scalar-forwarding eliminates the u64 ABI pack/unpack (+ wasm-local mem2reg) → toward LLVM-LTO parity #219

Description

Evidence (wasm-IR, post-loom-inline — frozen repro attached)

Cost it produces (synth ARM backend, z_impl_k_sem_give = 83 insns / 540 B object)

Recommended passes (all wasm-IR → benefit ARM and RISC-V backends; complementary to synth#209's backend regalloc)

Parity read

Kill-criterion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Cost it produces (synth ARM backend, `z_impl_k_sem_give` = 83 insns / 540 B object)