feat(vcr-ra): RV32 immediate-shift-fold — const shift amount into slli/srli/srai, flag-off (#472, #242)#487
Merged
Merged
Conversation
…i/srli/srai, flag-off (#472, #242) Port the first applicable ARM perf lever to the RV32 backend (scoped in #484). A constant shift `i32.shl/shr_u/shr_s (val) (i32.const N)` lowers as `addi tmp,zero,N ; sll/srl/sra rd,val,tmp` — the amount is materialized into a register, then consumed by the register-form shift. RV32 has immediate shift forms `slli/srli/srai` carrying the amount in the instruction, so folding a constant amount drops the `addi` (one instruction per constant shift). `fold_const_shift` is a post-pass peephole (mirrors the ARM `fold_immediate_shifts` / `fold_uxth` scaffolding): for each `addi tmp,zero,N`, the windowed scan finds the consuming register shift and rewrites it to the immediate form, dropping the `addi` as a dead store. Soundness: * `rs1 != tmp` guard — dropping the `addi` must not remove the shift's input definition; * the `addi` is removed only when it is a dead store — either the fold's destination IS `tmp` (the `slli` redefines it, reading only `rs1`) or `tmp` is dead after the shift (`rv_reg_dead_after`, the RV32 analogue of the ARM `reg_dead_by_redef`; an unmodeled op ⇒ can't-prove ⇒ keep); * `shamt = N & 31` reproduces the register `sll`'s hardware low-5-bit mask = WASM's shift-mod-32, so amounts ≥ 32 and negative constants fold identically. Only the single-`addi` const form (N in -2048..=2047, covering every meaningful amount 0..31) folds; a large constant via `lui+addi` stays a register shift. Flag-off behind `SYNTH_RV_SHIFT_FOLD` (default off): with the env unset the output is byte-identical to the pre-lever baseline, so the frozen RV32 fixtures (control_step / signed_div_const) are unchanged — frozen-safe by construction. The on-target cycle win is validated before the default-on flip. Oracle (scripts/repro/shift_fold.wat + shift_fold_riscv_differential.py): every exported function runs under unicorn UC_ARCH_RISCV in both flag states and matches wasmtime — including the mask cases (`shl33` << 33→1, `shlneg` << -1→31) and a VARIABLE shift (`shlvar`) that must NOT fold. Non-vacuity: flag-on `.text` 168B→148B (−20B = exactly 5 const shifts folded); flag-off zero. 6 unit tests cover fold/decline (input-alias guard, live-after, dest==tmp), srl/sra, and the mask. Full RV32 suite (184) + frozen byte gate (ARM+RV32) green; fmt + clippy clean. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
avrabe
added a commit
that referenced
this pull request
Jun 25, 2026
#472, #242) (#489) * ci(vcr-oracle): CI-gate the RV32 immediate-shift-fold execution oracle (#472, #242) VCR-ORACLE-001's deliverable is CI-gating the differential oracles, not just shipping them as dev-time scripts. The RV32 immediate-shift-fold lever (#487, PR landed flag-off behind SYNTH_RV_SHIFT_FOLD) came with a unicorn UC_ARCH_RISCV differential (shift_fold_riscv_differential.py) but it only ran by hand. Since the lever sits flag-off awaiting the on-silicon flip, nothing else exercises the flag-on path — exactly the gap the cmp-select two-move oracle was added to close. Adds an isolated `rv32-shift-fold-oracle` CI job mirroring the existing `cmp-select-oracle` job: build synth, pip-install wasmtime+unicorn+pyelftools in that job ONLY (the main `cargo test` gate is not taxed with the C-library build graph), and run the differential. It executes every fixture function in BOTH flag states under unicorn and asserts bit-identical-to-wasmtime — continuously validating the slli/srli/srai folds, the `& 31` mask on >=32 and negative shift amounts, and the variable-shift non-fold, plus non-vacuity (.text 168B->148B, 5 folds). The differential now honors a SYNTH env override (default release for local dev; CI points it at the debug build for speed, like cmp-select). Frozen-safe: no codegen change, no emitted bytes change — wires an already-written, already-passing oracle into CI. Verified locally with the exact CI invocation (debug binary via SYNTH=./target/debug/synth): ORACLE PASS. ci.yml parses; new job well-formed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> * fix(oracle): read RV32 fixture symbols from the ELF symtab, not `synth disasm` text The CI oracle job failed with `SYMBOL MISSING` on the fresh runner while passing locally: the harness scraped function addresses out of `synth disasm` stdout with a regex, and that text is host-dependent (the disasm backend even decodes RISC-V bytes with an ARM decoder, and on the bare runner the symbol-line format differs so the regex matched nothing). Read the addresses straight from the ELF symbol table via pyelftools instead — the same backend-independent approach base_cse_differential.py uses. synth emits the symtab with an empty section name, so it's found by sh_type (SHT_SYMTAB), and addresses are made .text-relative by subtracting sh_addr. Re-verified with the exact CI invocation (debug binary via SYNTH env): ORACLE PASS, 5 folds, all 6 functions matched. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
avrabe
added a commit
that referenced
this pull request
Jun 25, 2026
…ccess immediate off s11, flag-off (#472, #242) (#491) Loop-4 step 2 of the RISC-V lever-parity port (#472), the RISC-V analogue of the ARM base-CSE address half (#468). A `i32.load/store (i32.const ADDR) …` lowers as `addi a,zero,ADDR; add tmp,s11,a; lw/sw _,off(tmp)`; when `ADDR+off` fits the signed-12-bit access immediate, `fold_const_addr` collapses it to a single `lw/sw _,(ADDR+off)(s11)`, dropping the `add` and the address `addi` — 2 instructions per constant-address access. Post-pass peephole (the structural twin of the #487 shift fold). Soundness: * `ADDR+off` is range-checked as a SUM against [-2048, 2047] (each term is already <=12 bits, so two in-range values can sum out of range); * the `add` base must be s11 and its address operand a `addi a,zero,ADDR` (single-`addi` small constant; a `lui+addi` large address stays the `add` form, out of v1 scope); * 3->1 rewrite, so BOTH dropped temps must be dead — `tmp` (add result) read only by the access, and `a` (address constant) read only by the `add` (rv_reg_dead_after + an untouched-between-def-and-use check); a bounds check between the add and the access reads `a` and disqualifies the fold. Flag-off behind SYNTH_RV_ADDR_FOLD (default off => byte-identical to baseline, so the frozen RV32 fixtures and `const_addr_store_not_folded_baseline_472` (#485) stay green — frozen-safe). The on-target cycle win is validated before the flip. Oracle (scripts/repro/const_addr_fold_riscv_differential.py, reusing the redundant_base_materialization fixture): runs `init_fields` (7 constant-address stores) under unicorn UC_ARCH_RISCV in both flag states; the resulting linear MEMORY is bit-identical to wasmtime. Non-vacuity: .text 120B -> 64B (-56B = 14 instructions, 2 per store). CI-gated as an isolated `rv32-const-addr-fold-oracle` job mirroring the shift-fold oracle. 5 unit tests (store/load fold, offset sum, 12-bit range guard, addr-reused decline). RV32 suite (189) + frozen byte gate (ARM+RV32) green; fmt + clippy clean. Co-authored-by: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Ports the first applicable ARM perf lever to the RV32 backend (the lever-by-lever scoping landed in #484). A constant shift
lowers on RV32 as
addi tmp,zero,N ; sll rd,val,tmp— the amount is materialized into a register, then consumed by the register-form shift. RV32 has immediate shift formsslli/srli/sraicarrying the amount in the instruction, so folding a constant amount drops theaddi: one instruction saved per constant shift.How
fold_const_shift— a post-pass peephole mirroring the ARMfold_immediate_shifts/fold_uxthscaffolding. For eachaddi tmp,zero,N, a windowed scan finds the consuming register shift, rewrites it to the immediate form, and drops theaddias a dead store.Soundness:
rs1 != tmpguard — dropping theaddimust not remove the shift's input definition (the load-bearing guard).addiis removed only when it is a dead store: either the fold's destination istmp(theslliredefines it, reading onlyrs1) ortmpis dead after the shift (rv_reg_dead_after, the RV32 analogue of ARM'sreg_dead_by_redef; an unmodeled op ⇒ can't-prove ⇒ keep).shamt = N & 31reproduces the registersll's hardware low-5-bit mask = WASM's shift-mod-32, so amounts ≥ 32 and negative constants fold to identical behaviour.Only the single-
addiconst form (Nin-2048..=2047, covering every meaningful amount0..31) folds; a large constant vialui+addistays a register shift (out of v1 scope).Frozen-safe
Flag-off behind
SYNTH_RV_SHIFT_FOLD(default off). With the env unset the output is byte-identical to the pre-lever baseline, so the frozen RV32 fixtures (control_step/signed_div_const) are unchanged. The on-target cycle win is validated before the default-on flip — the same gated protocol as the ARM levers.Oracle
scripts/repro/shift_fold.wat+shift_fold_riscv_differential.py: every exported function runs under unicornUC_ARCH_RISCVin both flag states and matches wasmtime ground truth — including the mask cases (shl33<< 33→1,shlneg<< -1→31) and a variable shift (shlvar) that must NOT fold.Non-vacuity: flag-on
.textstrictly smaller (−20B = exactly 5 folds); flag-off zero;shlvarunfolded.Tests / gates
dest==tmp),srl/sra, the& 31mask.fmt+clippyclean.Part of the #472 RISC-V lever-parity slice under epic #242 (VCR-*). Next steps (separate gated PRs): const-address-fold, then local-promotion carrying the #474 promotion-exhaustion fallback.
🤖 Generated with Claude Code