feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242) by avrabe · Pull Request #482 · pulseengine/synth

avrabe · 2026-06-25T06:40:36Z

VCR-RA lever 3 of the perf feature loop (uxth → 002 → #468 → #472)

On the optimized (non-relocatable) path, i32.store* (i32.const ADDR) V lowers to movw/movt ip,#base; add ip,ip,raddr; str V,[ip] — re-materializing the loop-invariant linear-memory base before every store. A field initializer pays N base materializations where 1 would do.

SYNTH_BASE_CSE (opt-in, default off ⇒ byte-identical) hoists the base into R11 once at function entry and folds each constant store address into the access immediate: str V,[R11,#ADDR] — dropping both the per-access base re-materialization (#468's complaint) and the now-dead address materialization (the register-pressure relief that makes the reserved base a net win).

Key invariant — R11 is realloc-immune

reallocate_function's pool is R0–R8 and it identity-preserves everything outside it, so the single entry materialization survives every straight-line segment untouched — no per-run re-materialization, no cross-segment remap hazard. R11 is also outside local promotion's R4–R8 pool and isn't the encoder scratch (R12); it's reserved from every optimized-path allocator via param_reserved_regs + the const pool.

Activation (standalone, unit-tested `plan_base_cse`)

≥2 foldable single-use const-address accesses (ADDR+off ≤ imm12) and every opcode in the base-CSE-safe set. Any Branch/CondBranch (multi-block), Select, Global*, MemorySize/Grow, Call, i64, or unenumerated opcode declines the whole function (None → unchanged per-access codegen). v1 is confined to single-basic-block field initializers — #468's exact target — keeping it clear of the optimized path's separately-tracked multi-block lowering.

Result

init_fields: .text 336 B → 218 B (−118 B, −35 %), base materialized once, all 7 addresses folded, matching the relocatable path's str [fp,#off] shape.

Oracle — this path has no cargo byte-gate (the frozen gate compiles only `--relocatable`)

Flag-off bit-identical — explicit .text diff of a fixture corpus vs a pre-change baseline binary (4/4 identical) + full optimized-path suite (wast_compile et al.) green. base-CSE is None when the flag is unset → off byte-identical by construction.
8 planner unit tests (folds ≥2 / declines below threshold / on control flow / on disqualifying op / on imm12 overflow / on multi-use addr / allows structural label / folds static offset).
base_cse_differential.py (unicorn): init_fields flag-off == flag-on == wasmtime by comparing linear memory (the fixture returns nothing); init_branch asserts flag-on .text byte-identical to flag-off (base-CSE correctly declines on control flow).

Found en route

The optimized path's block/br_if lowering miscompiles independent of base-CSE (init_branch flag-off already disagrees with wasmtime) — a separate optimized-path control-flow bug, noted as a follow-up; base-CSE declining on control flow keeps clear of it.

Default-on flip held for the on-silicon cycle gate, like the prior levers.

🤖 Generated with Claude Code

…init, flag-off (#468, #242) Lever 3 of the perf feature loop. On the optimized (non-relocatable) path, `i32.store* (i32.const ADDR) V` lowers to `movw/movt ip,#base; add ip,ip,raddr; str V,[ip]` — re-materializing the loop-invariant linear-memory base before EVERY store. A field initializer pays N base materializations where 1 would do. `SYNTH_BASE_CSE` (opt-in, default off ⇒ byte-identical) hoists the base into R11 ONCE at function entry and folds each constant store address into the access immediate: `str V,[R11,#ADDR]`. This drops both the per-access base re- materialization (#468's complaint) AND the now-dead address materialization (the register-pressure relief that makes the reserved base a net win). Key invariant — R11 is realloc-immune: `reallocate_function`'s pool is R0–R8 and it identity-preserves everything outside it, so the single entry materialization survives every straight-line segment untouched — no per-run re-materialization, no cross-segment remap hazard. R11 is also outside local promotion's R4–R8 pool and is not the encoder scratch (R12); it is reserved from every optimized-path allocator via `param_reserved_regs` + the const pool. A standalone, unit-tested planner (`plan_base_cse`) decides activation: ≥2 foldable single-use const-address accesses (ADDR+offset ≤ imm12) AND every opcode in the base-CSE-safe set. Any Branch/CondBranch (multi-block), Select, Global*, MemorySize/Grow, Call, i64, or unenumerated opcode declines the whole function (`None` → unchanged per-access codegen). v1 is thus confined to single-basic-block field initializers — #468's exact target — keeping it clear of the optimized path's separately-tracked multi-block lowering. Result on `init_fields`: .text 336 B → 218 B (−118 B, −35 %), base materialized once, all 7 addresses folded, matching the relocatable path's `str [fp,#off]`. Oracle (this path has NO cargo byte-gate — the frozen gate compiles only --relocatable): - Flag-off bit-identical — explicit .text diff of a fixture corpus vs a pre-change baseline binary (4/4 identical) + full optimized-path suite (wast_compile et al.) green. base-CSE is None when the flag is unset → off byte-identical by construction. - 8 planner unit tests (folds ≥2 / declines below threshold / on control flow / on disqualifying op / on imm12 overflow / on multi-use addr / allows structural label / folds static offset). - base_cse_differential.py (unicorn): init_fields flag-off == flag-on == wasmtime by comparing LINEAR MEMORY (the fixture returns nothing); init_branch asserts flag-on .text byte-identical to flag-off (base-CSE correctly DECLINES on control flow). Found en route: the optimized path's block/br_if lowering miscompiles independent of base-CSE — a separate optimized-path bug, noted as a follow-up. Default-on flip held for the on-silicon cycle gate, like the prior levers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-06-25T07:11:38Z

Codecov Report

❌ Patch coverage is 77.07182% with 83 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
crates/synth-synthesis/src/optimizer_bridge.rs	77.07%	83 Missing ⚠️

📢 Thoughts on this report? Let us know!

avrabe merged commit d56f1c1 into main Jun 25, 2026
14 checks passed

avrabe deleted the vcr-ra/468-base-cse branch June 25, 2026 07:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242)#482

feat(vcr-ra): base-CSE — hoist linear-memory base once per field-init, flag-off (#468, #242)#482
avrabe merged 1 commit into
mainfrom
vcr-ra/468-base-cse

avrabe commented Jun 25, 2026

Uh oh!

Uh oh!

codecov Bot commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

avrabe commented Jun 25, 2026

VCR-RA lever 3 of the perf feature loop (uxth → 002 → #468 → #472)

Key invariant — R11 is realloc-immune

Activation (standalone, unit-tested plan_base_cse)

Result

Oracle — this path has no cargo byte-gate (the frozen gate compiles only --relocatable)

Found en route

Uh oh!

Uh oh!

codecov Bot commented Jun 25, 2026

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Activation (standalone, unit-tested `plan_base_cse`)

Oracle — this path has no cargo byte-gate (the frozen gate compiles only `--relocatable`)