feat(vcr-ra): dead-frame elimination for promoted-local leaves, flag-off (#390, #242)#481
Merged
Conversation
…off (#390, #242) `compute_local_layout` reserves a frame slot (`sub sp,#N` / `add sp,#N`) for every non-param wasm local it sees. Local promotion (v0.14.0) then homes the eligible i32 locals in registers, so for a function whose locals all promote — and which neither spills, calls, nor touches i64 / stack-passed params — those frame bytes are never accessed. The `sub`/`add sp` pair is then pure overhead (~2-3 cyc on a small leaf), AND because it writes SP it makes `shrink_callee_saved_saves` decline (that pass bails on any SP def/use). `elide_dead_frame` removes the pair when the body provably never touches SP, saving the two instructions and restoring the SP-untouched precondition the shrink pass needs — so the two passes compose. It runs BEFORE shrink in the relocatable pipeline. Safe-by-construction: fires only when NO instruction reads/writes SP except the matched frame sub/add and the prologue Push / epilogue Pop. For wasm locals that guard is exact deadness — locals are not addressable, so every other SP consumer (spills, #204 param-backing, the i64 pair-spill area, the #359 outgoing-arg region, incoming stack params) manifests as an `[sp,#off]` access the guard sees. Any such access, or any unmodeled op whose SP effect can't be confirmed absent, declines and leaves the bytes unchanged. Removal-only: no instruction is added, rewritten, or reordered. Flag-off (opt-in `SYNTH_DEAD_FRAME_ELIM=1`); default path byte-identical — the frozen byte gate stays green. Default-on flip held for on-silicon validation, like the realloc/shrink levers. Validation: - 6 unit tests (removes / declines on sp-relative / unbalanced add-sp / unmodeled sp-effect / no-frame noop / multiple epilogues). - leaf_dead_frame_differential.py: leaf3 under unicorn, flag-off==flag-on== wasmtime over 10 vectors (signed + i32-wrap edges); 36 B -> 28 B (-8 B). Both builds return cleanly via popped LR, confirming SP balance. NOTE: the push stays {r4-r8,lr} here — a,b,c land in callee-saved r4,r5,r6 + scratch r7 = 4 saved regs, which shrink pads back to the even-count {r4-r8,lr}. Trimming the push needs the locals OUT of callee-saved (caller-saved leaf homing), tracked separately as #390. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
VCR-RA-002 — lever 2 of the perf feature loop (uxth → 002 → #468 → #472)
compute_local_layoutreserves a frame slot (sub sp,#N/add sp,#N) for every non-param wasm local it sees. Local promotion (v0.14.0) then homes the eligible i32 locals in registers, so when a function's locals all promote — and it neither spills, calls, nor touches i64 / stack-passed params — those frame bytes are never accessed. Thesub/add sppair is pure overhead (~2-3 cyc on a small leaf), AND because it writes SP it makesshrink_callee_saved_savesdecline (that pass bails on any SP def/use).elide_dead_frameremoves the pair when the body provably never touches SP — saving the two instructions and restoring the SP-untouched precondition the shrink pass needs, so the two passes compose. It runs before shrink in the relocatable pipeline.Safe-by-construction
Fires only when no instruction reads/writes SP except the matched frame sub/add and the prologue
Push/ epiloguePop. For wasm locals that guard is exact deadness — locals are not addressable, so every other SP consumer (spills, #204 param-backing, the i64 pair-spill area, the #359 outgoing-arg region, incoming stack params) manifests as an[sp,#off]access the guard sees. Any such access — or any unmodeled op whose SP effect can't be confirmed absent — declines and leaves the bytes unchanged. Removal-only: no instruction added, rewritten, or reordered.Gating
SYNTH_DEAD_FRAME_ELIM=1); default path byte-identical → frozen byte gate green.Validation
leaf_dead_frame_differential.py—leaf3under unicorn, flag-off == flag-on == wasmtime over 10 vectors (signed + i32-wrap edges); 36 B → 28 B (-8 B). Both builds return cleanly via popped LR, confirming SP balance.Honest scope
The push stays
{r4-r8,lr}onleaf3— a,b,c land in callee-saved r4,r5,r6 + scratch r7 = 4 saved regs, which shrink pads back to the even-count{r4-r8,lr}. Trimming the push needs the locals out of callee-saved (caller-saved leaf homing), tracked separately as #390.🤖 Generated with Claude Code