Skip to content

feat(vcr-ra): RV32 immediate-shift-fold — const shift amount into slli/srli/srai, flag-off (#472, #242)#487

Merged
avrabe merged 1 commit into
mainfrom
feat/rv32-imm-shift-fold-472
Jun 25, 2026
Merged

feat(vcr-ra): RV32 immediate-shift-fold — const shift amount into slli/srli/srai, flag-off (#472, #242)#487
avrabe merged 1 commit into
mainfrom
feat/rv32-imm-shift-fold-472

Conversation

@avrabe

@avrabe avrabe commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

What

Ports the first applicable ARM perf lever to the RV32 backend (the lever-by-lever scoping landed in #484). A constant shift

(i32.shl (val) (i32.const N))

lowers on RV32 as addi tmp,zero,N ; sll rd,val,tmp — the amount is materialized into a register, then consumed by the register-form shift. RV32 has immediate shift forms slli/srli/srai carrying the amount in the instruction, so folding a constant amount drops the addi: one instruction saved per constant shift.

How

fold_const_shift — a post-pass peephole mirroring the ARM fold_immediate_shifts / fold_uxth scaffolding. For each addi tmp,zero,N, a windowed scan finds the consuming register shift, rewrites it to the immediate form, and drops the addi as a dead store.

Soundness:

  • rs1 != tmp guard — dropping the addi must not remove the shift's input definition (the load-bearing guard).
  • The addi is removed only when it is a dead store: either the fold's destination is tmp (the slli redefines it, reading only rs1) or tmp is dead after the shift (rv_reg_dead_after, the RV32 analogue of ARM's reg_dead_by_redef; an unmodeled op ⇒ can't-prove ⇒ keep).
  • shamt = N & 31 reproduces the register sll's hardware low-5-bit mask = WASM's shift-mod-32, so amounts ≥ 32 and negative constants fold to identical behaviour.

Only the single-addi const form (N in -2048..=2047, covering every meaningful amount 0..31) folds; a large constant via lui+addi stays a register shift (out of v1 scope).

Frozen-safe

Flag-off behind SYNTH_RV_SHIFT_FOLD (default off). With the env unset the output is byte-identical to the pre-lever baseline, so the frozen RV32 fixtures (control_step / signed_div_const) are unchanged. The on-target cycle win is validated before the default-on flip — the same gated protocol as the ARM levers.

Oracle

scripts/repro/shift_fold.wat + shift_fold_riscv_differential.py: every exported function runs under unicorn UC_ARCH_RISCV in both flag states and matches wasmtime ground truth — including the mask cases (shl33 << 33→1, shlneg << -1→31) and a variable shift (shlvar) that must NOT fold.

.text 168B -> 148B (-20B): 5 const shift(s) folded
ORACLE: PASS

Non-vacuity: flag-on .text strictly smaller (−20B = exactly 5 folds); flag-off zero; shlvar unfolded.

Tests / gates

Part of the #472 RISC-V lever-parity slice under epic #242 (VCR-*). Next steps (separate gated PRs): const-address-fold, then local-promotion carrying the #474 promotion-exhaustion fallback.

🤖 Generated with Claude Code

…i/srli/srai, flag-off (#472, #242)

Port the first applicable ARM perf lever to the RV32 backend (scoped in #484).
A constant shift `i32.shl/shr_u/shr_s (val) (i32.const N)` lowers as
`addi tmp,zero,N ; sll/srl/sra rd,val,tmp` — the amount is materialized into a
register, then consumed by the register-form shift. RV32 has immediate shift
forms `slli/srli/srai` carrying the amount in the instruction, so folding a
constant amount drops the `addi` (one instruction per constant shift).

`fold_const_shift` is a post-pass peephole (mirrors the ARM `fold_immediate_shifts`
/ `fold_uxth` scaffolding): for each `addi tmp,zero,N`, the windowed scan finds
the consuming register shift and rewrites it to the immediate form, dropping the
`addi` as a dead store. Soundness:
  * `rs1 != tmp` guard — dropping the `addi` must not remove the shift's input
    definition;
  * the `addi` is removed only when it is a dead store — either the fold's
    destination IS `tmp` (the `slli` redefines it, reading only `rs1`) or `tmp`
    is dead after the shift (`rv_reg_dead_after`, the RV32 analogue of the ARM
    `reg_dead_by_redef`; an unmodeled op ⇒ can't-prove ⇒ keep);
  * `shamt = N & 31` reproduces the register `sll`'s hardware low-5-bit mask =
    WASM's shift-mod-32, so amounts ≥ 32 and negative constants fold identically.

Only the single-`addi` const form (N in -2048..=2047, covering every meaningful
amount 0..31) folds; a large constant via `lui+addi` stays a register shift.

Flag-off behind `SYNTH_RV_SHIFT_FOLD` (default off): with the env unset the
output is byte-identical to the pre-lever baseline, so the frozen RV32 fixtures
(control_step / signed_div_const) are unchanged — frozen-safe by construction.
The on-target cycle win is validated before the default-on flip.

Oracle (scripts/repro/shift_fold.wat + shift_fold_riscv_differential.py): every
exported function runs under unicorn UC_ARCH_RISCV in both flag states and
matches wasmtime — including the mask cases (`shl33` << 33→1, `shlneg` << -1→31)
and a VARIABLE shift (`shlvar`) that must NOT fold. Non-vacuity: flag-on `.text`
168B→148B (−20B = exactly 5 const shifts folded); flag-off zero. 6 unit tests
cover fold/decline (input-alias guard, live-after, dest==tmp), srl/sra, and the
mask. Full RV32 suite (184) + frozen byte gate (ARM+RV32) green; fmt + clippy
clean.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 83.60000% with 41 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
crates/synth-backend-riscv/src/selector.rs 83.60% 41 Missing ⚠️

📢 Thoughts on this report? Let us know!

@avrabe avrabe merged commit d0eda29 into main Jun 25, 2026
14 of 15 checks passed
@avrabe avrabe deleted the feat/rv32-imm-shift-fold-472 branch June 25, 2026 10:57
avrabe added a commit that referenced this pull request Jun 25, 2026
#472, #242) (#489)

* ci(vcr-oracle): CI-gate the RV32 immediate-shift-fold execution oracle (#472, #242)

VCR-ORACLE-001's deliverable is CI-gating the differential oracles, not just
shipping them as dev-time scripts. The RV32 immediate-shift-fold lever (#487,
PR landed flag-off behind SYNTH_RV_SHIFT_FOLD) came with a unicorn UC_ARCH_RISCV
differential (shift_fold_riscv_differential.py) but it only ran by hand. Since
the lever sits flag-off awaiting the on-silicon flip, nothing else exercises the
flag-on path — exactly the gap the cmp-select two-move oracle was added to close.

Adds an isolated `rv32-shift-fold-oracle` CI job mirroring the existing
`cmp-select-oracle` job: build synth, pip-install wasmtime+unicorn+pyelftools in
that job ONLY (the main `cargo test` gate is not taxed with the C-library build
graph), and run the differential. It executes every fixture function in BOTH flag
states under unicorn and asserts bit-identical-to-wasmtime — continuously
validating the slli/srli/srai folds, the `& 31` mask on >=32 and negative shift
amounts, and the variable-shift non-fold, plus non-vacuity (.text 168B->148B, 5
folds). The differential now honors a SYNTH env override (default release for
local dev; CI points it at the debug build for speed, like cmp-select).

Frozen-safe: no codegen change, no emitted bytes change — wires an
already-written, already-passing oracle into CI. Verified locally with the exact
CI invocation (debug binary via SYNTH=./target/debug/synth): ORACLE PASS. ci.yml
parses; new job well-formed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

* fix(oracle): read RV32 fixture symbols from the ELF symtab, not `synth disasm` text

The CI oracle job failed with `SYMBOL MISSING` on the fresh runner while passing
locally: the harness scraped function addresses out of `synth disasm` stdout with
a regex, and that text is host-dependent (the disasm backend even decodes RISC-V
bytes with an ARM decoder, and on the bare runner the symbol-line format differs
so the regex matched nothing). Read the addresses straight from the ELF symbol
table via pyelftools instead — the same backend-independent approach
base_cse_differential.py uses. synth emits the symtab with an empty section name,
so it's found by sh_type (SHT_SYMTAB), and addresses are made .text-relative by
subtracting sh_addr. Re-verified with the exact CI invocation (debug binary via
SYNTH env): ORACLE PASS, 5 folds, all 6 functions matched.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
avrabe added a commit that referenced this pull request Jun 25, 2026
…ccess immediate off s11, flag-off (#472, #242) (#491)

Loop-4 step 2 of the RISC-V lever-parity port (#472), the RISC-V analogue of the
ARM base-CSE address half (#468). A `i32.load/store (i32.const ADDR) …` lowers as
`addi a,zero,ADDR; add tmp,s11,a; lw/sw _,off(tmp)`; when `ADDR+off` fits the
signed-12-bit access immediate, `fold_const_addr` collapses it to a single
`lw/sw _,(ADDR+off)(s11)`, dropping the `add` and the address `addi` — 2
instructions per constant-address access.

Post-pass peephole (the structural twin of the #487 shift fold). Soundness:
  * `ADDR+off` is range-checked as a SUM against [-2048, 2047] (each term is
    already <=12 bits, so two in-range values can sum out of range);
  * the `add` base must be s11 and its address operand a `addi a,zero,ADDR`
    (single-`addi` small constant; a `lui+addi` large address stays the `add`
    form, out of v1 scope);
  * 3->1 rewrite, so BOTH dropped temps must be dead — `tmp` (add result) read
    only by the access, and `a` (address constant) read only by the `add`
    (rv_reg_dead_after + an untouched-between-def-and-use check); a bounds check
    between the add and the access reads `a` and disqualifies the fold.

Flag-off behind SYNTH_RV_ADDR_FOLD (default off => byte-identical to baseline, so
the frozen RV32 fixtures and `const_addr_store_not_folded_baseline_472` (#485)
stay green — frozen-safe). The on-target cycle win is validated before the flip.

Oracle (scripts/repro/const_addr_fold_riscv_differential.py, reusing the
redundant_base_materialization fixture): runs `init_fields` (7 constant-address
stores) under unicorn UC_ARCH_RISCV in both flag states; the resulting linear
MEMORY is bit-identical to wasmtime. Non-vacuity: .text 120B -> 64B (-56B = 14
instructions, 2 per store). CI-gated as an isolated `rv32-const-addr-fold-oracle`
job mirroring the shift-fold oracle. 5 unit tests (store/load fold, offset sum,
12-bit range guard, addr-reused decline). RV32 suite (189) + frozen byte gate
(ARM+RV32) green; fmt + clippy clean.

Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant