Skip to content

feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242)#482

Merged
avrabe merged 1 commit into
mainfrom
vcr-ra/468-base-cse
Jun 25, 2026
Merged

feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242)#482
avrabe merged 1 commit into
mainfrom
vcr-ra/468-base-cse

Conversation

@avrabe

@avrabe avrabe commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

VCR-RA lever 3 of the perf feature loop (uxth → 002 → #468#472)

On the optimized (non-relocatable) path, i32.store* (i32.const ADDR) V lowers to movw/movt ip,#base; add ip,ip,raddr; str V,[ip] — re-materializing the loop-invariant linear-memory base before every store. A field initializer pays N base materializations where 1 would do.

SYNTH_BASE_CSE (opt-in, default off ⇒ byte-identical) hoists the base into R11 once at function entry and folds each constant store address into the access immediate: str V,[R11,#ADDR] — dropping both the per-access base re-materialization (#468's complaint) and the now-dead address materialization (the register-pressure relief that makes the reserved base a net win).

Key invariant — R11 is realloc-immune

reallocate_function's pool is R0–R8 and it identity-preserves everything outside it, so the single entry materialization survives every straight-line segment untouched — no per-run re-materialization, no cross-segment remap hazard. R11 is also outside local promotion's R4–R8 pool and isn't the encoder scratch (R12); it's reserved from every optimized-path allocator via param_reserved_regs + the const pool.

Activation (standalone, unit-tested plan_base_cse)

≥2 foldable single-use const-address accesses (ADDR+off ≤ imm12) and every opcode in the base-CSE-safe set. Any Branch/CondBranch (multi-block), Select, Global*, MemorySize/Grow, Call, i64, or unenumerated opcode declines the whole function (None → unchanged per-access codegen). v1 is confined to single-basic-block field initializers — #468's exact target — keeping it clear of the optimized path's separately-tracked multi-block lowering.

Result

init_fields: .text 336 B → 218 B (−118 B, −35 %), base materialized once, all 7 addresses folded, matching the relocatable path's str [fp,#off] shape.

Oracle — this path has no cargo byte-gate (the frozen gate compiles only --relocatable)

  • Flag-off bit-identical — explicit .text diff of a fixture corpus vs a pre-change baseline binary (4/4 identical) + full optimized-path suite (wast_compile et al.) green. base-CSE is None when the flag is unset → off byte-identical by construction.
  • 8 planner unit tests (folds ≥2 / declines below threshold / on control flow / on disqualifying op / on imm12 overflow / on multi-use addr / allows structural label / folds static offset).
  • base_cse_differential.py (unicorn): init_fields flag-off == flag-on == wasmtime by comparing linear memory (the fixture returns nothing); init_branch asserts flag-on .text byte-identical to flag-off (base-CSE correctly declines on control flow).

Found en route

The optimized path's block/br_if lowering miscompiles independent of base-CSE (init_branch flag-off already disagrees with wasmtime) — a separate optimized-path control-flow bug, noted as a follow-up; base-CSE declining on control flow keeps clear of it.

Default-on flip held for the on-silicon cycle gate, like the prior levers.

🤖 Generated with Claude Code

…init, flag-off (#468, #242)

Lever 3 of the perf feature loop. On the optimized (non-relocatable) path,
`i32.store* (i32.const ADDR) V` lowers to `movw/movt ip,#base; add ip,ip,raddr;
str V,[ip]` — re-materializing the loop-invariant linear-memory base before EVERY
store. A field initializer pays N base materializations where 1 would do.

`SYNTH_BASE_CSE` (opt-in, default off ⇒ byte-identical) hoists the base into R11
ONCE at function entry and folds each constant store address into the access
immediate: `str V,[R11,#ADDR]`. This drops both the per-access base re-
materialization (#468's complaint) AND the now-dead address materialization (the
register-pressure relief that makes the reserved base a net win).

Key invariant — R11 is realloc-immune: `reallocate_function`'s pool is R0–R8 and
it identity-preserves everything outside it, so the single entry materialization
survives every straight-line segment untouched — no per-run re-materialization,
no cross-segment remap hazard. R11 is also outside local promotion's R4–R8 pool
and is not the encoder scratch (R12); it is reserved from every optimized-path
allocator via `param_reserved_regs` + the const pool.

A standalone, unit-tested planner (`plan_base_cse`) decides activation: ≥2
foldable single-use const-address accesses (ADDR+offset ≤ imm12) AND every opcode
in the base-CSE-safe set. Any Branch/CondBranch (multi-block), Select, Global*,
MemorySize/Grow, Call, i64, or unenumerated opcode declines the whole function
(`None` → unchanged per-access codegen). v1 is thus confined to single-basic-block
field initializers — #468's exact target — keeping it clear of the optimized
path's separately-tracked multi-block lowering.

Result on `init_fields`: .text 336 B → 218 B (−118 B, −35 %), base materialized
once, all 7 addresses folded, matching the relocatable path's `str [fp,#off]`.

Oracle (this path has NO cargo byte-gate — the frozen gate compiles only
--relocatable):
- Flag-off bit-identical — explicit .text diff of a fixture corpus vs a pre-change
  baseline binary (4/4 identical) + full optimized-path suite (wast_compile et al.)
  green. base-CSE is None when the flag is unset → off byte-identical by
  construction.
- 8 planner unit tests (folds ≥2 / declines below threshold / on control flow /
  on disqualifying op / on imm12 overflow / on multi-use addr / allows structural
  label / folds static offset).
- base_cse_differential.py (unicorn): init_fields flag-off == flag-on == wasmtime
  by comparing LINEAR MEMORY (the fixture returns nothing); init_branch asserts
  flag-on .text byte-identical to flag-off (base-CSE correctly DECLINES on control
  flow). Found en route: the optimized path's block/br_if lowering miscompiles
  independent of base-CSE — a separate optimized-path bug, noted as a follow-up.

Default-on flip held for the on-silicon cycle gate, like the prior levers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@avrabe avrabe merged commit d56f1c1 into main Jun 25, 2026
14 checks passed
@avrabe avrabe deleted the vcr-ra/468-base-cse branch June 25, 2026 07:11
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 77.07182% with 83 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/synth-synthesis/src/optimizer_bridge.rs 77.07% 83 Missing ⚠️

📢 Thoughts on this report? Let us know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant