feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242)#482
Merged
Merged
Conversation
…init, flag-off (#468, #242) Lever 3 of the perf feature loop. On the optimized (non-relocatable) path, `i32.store* (i32.const ADDR) V` lowers to `movw/movt ip,#base; add ip,ip,raddr; str V,[ip]` — re-materializing the loop-invariant linear-memory base before EVERY store. A field initializer pays N base materializations where 1 would do. `SYNTH_BASE_CSE` (opt-in, default off ⇒ byte-identical) hoists the base into R11 ONCE at function entry and folds each constant store address into the access immediate: `str V,[R11,#ADDR]`. This drops both the per-access base re- materialization (#468's complaint) AND the now-dead address materialization (the register-pressure relief that makes the reserved base a net win). Key invariant — R11 is realloc-immune: `reallocate_function`'s pool is R0–R8 and it identity-preserves everything outside it, so the single entry materialization survives every straight-line segment untouched — no per-run re-materialization, no cross-segment remap hazard. R11 is also outside local promotion's R4–R8 pool and is not the encoder scratch (R12); it is reserved from every optimized-path allocator via `param_reserved_regs` + the const pool. A standalone, unit-tested planner (`plan_base_cse`) decides activation: ≥2 foldable single-use const-address accesses (ADDR+offset ≤ imm12) AND every opcode in the base-CSE-safe set. Any Branch/CondBranch (multi-block), Select, Global*, MemorySize/Grow, Call, i64, or unenumerated opcode declines the whole function (`None` → unchanged per-access codegen). v1 is thus confined to single-basic-block field initializers — #468's exact target — keeping it clear of the optimized path's separately-tracked multi-block lowering. Result on `init_fields`: .text 336 B → 218 B (−118 B, −35 %), base materialized once, all 7 addresses folded, matching the relocatable path's `str [fp,#off]`. Oracle (this path has NO cargo byte-gate — the frozen gate compiles only --relocatable): - Flag-off bit-identical — explicit .text diff of a fixture corpus vs a pre-change baseline binary (4/4 identical) + full optimized-path suite (wast_compile et al.) green. base-CSE is None when the flag is unset → off byte-identical by construction. - 8 planner unit tests (folds ≥2 / declines below threshold / on control flow / on disqualifying op / on imm12 overflow / on multi-use addr / allows structural label / folds static offset). - base_cse_differential.py (unicorn): init_fields flag-off == flag-on == wasmtime by comparing LINEAR MEMORY (the fixture returns nothing); init_branch asserts flag-on .text byte-identical to flag-off (base-CSE correctly DECLINES on control flow). Found en route: the optimized path's block/br_if lowering miscompiles independent of base-CSE — a separate optimized-path bug, noted as a follow-up. Default-on flip held for the on-silicon cycle gate, like the prior levers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
VCR-RA lever 3 of the perf feature loop (uxth → 002 → #468 → #472)
On the optimized (non-relocatable) path,
i32.store* (i32.const ADDR) Vlowers tomovw/movt ip,#base; add ip,ip,raddr; str V,[ip]— re-materializing the loop-invariant linear-memory base before every store. A field initializer pays N base materializations where 1 would do.SYNTH_BASE_CSE(opt-in, default off ⇒ byte-identical) hoists the base into R11 once at function entry and folds each constant store address into the access immediate:str V,[R11,#ADDR]— dropping both the per-access base re-materialization (#468's complaint) and the now-dead address materialization (the register-pressure relief that makes the reserved base a net win).Key invariant — R11 is realloc-immune
reallocate_function's pool isR0–R8and it identity-preserves everything outside it, so the single entry materialization survives every straight-line segment untouched — no per-run re-materialization, no cross-segment remap hazard. R11 is also outside local promotion'sR4–R8pool and isn't the encoder scratch (R12); it's reserved from every optimized-path allocator viaparam_reserved_regs+ the const pool.Activation (standalone, unit-tested
plan_base_cse)≥2 foldable single-use const-address accesses (
ADDR+off ≤ imm12) and every opcode in the base-CSE-safe set. AnyBranch/CondBranch(multi-block),Select,Global*,MemorySize/Grow,Call, i64, or unenumerated opcode declines the whole function (None→ unchanged per-access codegen). v1 is confined to single-basic-block field initializers — #468's exact target — keeping it clear of the optimized path's separately-tracked multi-block lowering.Result
init_fields: .text 336 B → 218 B (−118 B, −35 %), base materialized once, all 7 addresses folded, matching the relocatable path'sstr [fp,#off]shape.Oracle — this path has no cargo byte-gate (the frozen gate compiles only
--relocatable).textdiff of a fixture corpus vs a pre-change baseline binary (4/4 identical) + full optimized-path suite (wast_compileet al.) green. base-CSE isNonewhen the flag is unset → off byte-identical by construction.base_cse_differential.py(unicorn):init_fieldsflag-off == flag-on == wasmtime by comparing linear memory (the fixture returns nothing);init_branchasserts flag-on.textbyte-identical to flag-off (base-CSE correctly declines on control flow).Found en route
The optimized path's
block/br_iflowering miscompiles independent of base-CSE (init_branchflag-off already disagrees with wasmtime) — a separate optimized-path control-flow bug, noted as a follow-up; base-CSE declining on control flow keeps clear of it.Default-on flip held for the on-silicon cycle gate, like the prior levers.
🤖 Generated with Claude Code