From eb85bbbd3d4e466a5e85185db7b5c0457a9a3256 Mon Sep 17 00:00:00 2001
From: Ralf Anton Beier <ralf_beier@me.com>
Date: Thu, 25 Jun 2026 09:27:17 +0200
Subject: [PATCH] =?UTF-8?q?docs(vcr-ra):=20RISC-V=20lever-parity=20scoping?=
 =?UTF-8?q?=20spike=20=E2=80=94=20map=20ARM=20perf=20levers=20to=20RV32=20?=
 =?UTF-8?q?(#472,=20#242)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Frozen-safe scoping spike (no codegen change) for the RISC-V lever port. Reads the
RV32IMAC backend source and measures the per-function overhead, mapping each ARM
perf lever to its RV32 status:

- cmp→select: N/A for RV32IMAC — no conditional-move (Zicond not in IMAC, no
  predication); `lower_select` is already the minimal branchy form.
- local-promotion: APPLIES (direct #390 analogue) — non-param i32 locals are
  always frame-spilled (sw/lw off(sp)); port to s-register homing, leaf-only,
  carrying the #474 promotion-exhaustion fallback from the start.
- immediate-shift-fold: APPLIES (RV form) — const shift amounts use the register
  sll/sra/srl (li tmp,N; sll); fold to slli/srli/srai (the ops already exist).
- const-address-fold: APPLIES (RISC-V-specific) — RV already holds the linmem base
  in s11 (no base re-materialization, so #468's base-hoist half is N/A), but const
  lw/sw addresses do `li addr; add tmp,s11,addr; lw/sw off(tmp)` instead of folding
  to `lw/sw (ADDR+off)(s11)`.

Scope-changing finding: the port is 2 levers + 1 RISC-V-specific fold, not a 1:1
port of all three named ARM levers (cmp→select does not apply to RV32IMAC).

Measured .text (RV32 vs ARM): redundant_base 120B/30insn (const-addr-fold headroom
~56B), leaf_caller_saved 104B (local-promotion), shifts 44B (imm-shift-fold ~8B).

Lays out the gated per-lever implementation plan (each flag-off → RV32 differential
→ qemu_riscv32/ESP32-C3 cycle gate → flip) and notes the oracle gap: the RV32 path
has no cargo byte-gate and no local RISC-V disassembler, so the differential needs
an RV32 execution harness + a small instruction decoder, built as part of step 1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
---
 scripts/repro/riscv_lever_parity_472.md | 113 ++++++++++++++++++++++++
 1 file changed, 113 insertions(+)
 create mode 100644 scripts/repro/riscv_lever_parity_472.md

diff --git a/scripts/repro/riscv_lever_parity_472.md b/scripts/repro/riscv_lever_parity_472.md
new file mode 100644
index 0000000..01081fb
--- /dev/null
+++ b/scripts/repro/riscv_lever_parity_472.md
@@ -0,0 +1,113 @@
+# #472 scoping spike — port the ARM perf levers to the RISC-V backend
+
+**Issue:** synth#472 · **Epic:** #242 (VCR-*) · north-star #390
+**Status:** SCOPING SPIKE (no codegen change — frozen-safe by construction).
+The byte-changing lever ports are the explicitly-separate next gated steps
+(flag-off → RV32 differential → qemu_riscv32 / ESP32-C3 cycle gate → default-on
+flip), exactly like the ARM cmp→select / local-promotion / imm-shift / dead-frame /
+base-CSE levers.
+
+## The gap
+
+The RV32 backend lags the ARM backend on dissolved code: the ARM perf levers that
+landed v0.13–v0.15 (cmp→select, local-promotion, immediate-shift-fold, dead-frame
+elimination, base-CSE) have **no RV32 equivalent yet**. #472 asks: port the
+applicable levers to `synth-backend-riscv` so RV32 closes the same per-function
+overhead the ARM levers removed.
+
+**Load-bearing finding — the port is NOT a 1:1 mapping of all three named levers.**
+Read against the RV32IMAC backend source, one lever does not apply, one applies
+directly, one applies in a RISC-V-specific form, and there is a fourth,
+RISC-V-only opportunity the ARM base-CSE work surfaced.
+
+## Lever-by-lever (source-grounded in `crates/synth-backend-riscv/src/selector.rs`)
+
+### 1. cmp→select → **N/A for RV32IMAC**
+
+`lower_select` (selector.rs:930) emits the branchy form — `bne cond,zero,La;
+mv dst,b; j Lend; La: mv dst,a; Lend:` (5 instructions for an i32 select). This is
+**already the minimal correct lowering**: RV32IMAC has **no conditional-move**
+(`czero.*` is the Zicond extension, not in IMAC; there is no predication like ARM's
+IT block). The ARM lever fused cmp→select into IT-predicated moves — there is no
+RV32IMAC instruction to fuse into. **Conclusion:** skip for the base ISA. A future,
+separate item could exploit branchless idioms for *special* selects (select-vs-0
+via `seqz`/`neg`/`and` masking; or Zicond when targeting a core that has it), but
+that is not a port of the ARM lever and is out of #472's scope.
+
+### 2. local-promotion → **APPLIES (direct analogue of ARM #390)**
+
+Non-param i32 locals are **always frame-spilled**: `local.set` emits
+`sw src, off(sp)` and `local.get` emits `lw dst, off(sp)` (selector.rs:1070-1090,
+~1023). A leaf function pays `sw`+`lw` traffic for every local access where a
+callee-saved register would do — exactly the ARM #390 situation. RV32 has a
+generous register file (s1–s11 callee-saved, t0–t6 temps), so leaf locals can be
+register-promoted with no prologue cost beyond the existing `preserve_callee_saved`
+save (selector.rs:307,317), which already brackets touched callee-saved regs.
+**Port:** the ARM `compute_local_promotion` shape (i32, write-before-read, depth-0,
+≥2 reads, leaf-only) maps directly; home eligible locals in s-registers instead of
+frame slots.
+
+### 3. immediate-shift-fold → **APPLIES (RISC-V-specific form)**
+
+`I32Shl/ShrS/ShrU` lower through `bin` (selector.rs:1184) to the **register** shift
+forms `sll/sra/srl` (selector.rs:739). A constant shift amount is first
+materialized into a register (`i32.const N` → `emit_load_imm` = `li tmp,N`), then
+consumed as `rs2`. RV32 has immediate shift forms `slli/srli/srai` (shamt in the
+instruction) — folding a constant shift amount into them drops the `li tmp,N`.
+**Port:** when the shift-amount operand is a known `i32.const` in `[0,31]`, emit
+`slli/srli/srai rd, rs1, #N` instead of `li tmp,N; sll rd,rs1,tmp` — saves one
+instruction per constant shift. (`Slli/Srli/Srai` ops already exist in the op enum,
+selector.rs:233-235 — only the selector wiring is missing.)
+
+### 4. const-address-fold → **APPLIES (RISC-V-only; the base-CSE analogue)**
+
+The ARM base-CSE lever (#468) had two halves: (a) hoist the re-materialized
+linear-memory base, and (b) fold constant store/load addresses into the access
+immediate. **Half (a) does not exist on RV32** — the base already lives
+persistently in `s11`/x27 (`LINEAR_MEM_BASE`, selector.rs:137); it is never
+re-materialized. But **half (b) is missing**: `lower_load_word`/`lower_store_word`
+(selector.rs:1493+) unconditionally emit `add tmp, s11, addr; lw/sw dst, off(tmp)`,
+where `addr` for a constant address is a `li`'d register. For a constant address
+with `ADDR+offset` in the 12-bit signed `lw/sw` immediate window, this folds to
+`lw/sw dst, (ADDR+offset)(s11)` — dropping **both** the `li addr,ADDR` and the
+`add tmp,s11,addr` (2 instructions per constant-address access).
+
+## Measured sizes (`.text`, this spike's fixtures + the existing corpus)
+
+| fixture | ARM `.text` | RV32 `.text` | RV32 lever headroom |
+|---|---:|---:|---|
+| `redundant_base_materialization` (7 const-addr stores) | 336 B | 120 B (30 insn) | const-addr-fold: ~2 insn × 7 = **~56 B** |
+| `leaf_caller_saved` (1 param + 3 i32 locals) | 200 B | 104 B (26 insn) | local-promotion: the `sw`/`lw` local traffic |
+| `shifts` (2 const shifts) | 188 B | 44 B (11 insn) | imm-shift-fold: 1 `li` × 2 = **~8 B** |
+
+(RV32 `.text` is already smaller than ARM in absolute bytes — RV32 is a 4-byte
+fixed encoding vs Thumb-2's mixed 2/4, and these leaves avoid ARM's `movw/movt`
+base materialization. The lever headroom is the RV32-vs-RV32 win — the per-function
+overhead removed — which is what the on-target cycle ratio reflects.)
+
+## Must carry from the start: the #474 promotion-exhaustion fallback
+
+ARM local-promotion shipped a v0.15.1 fix (#474): promotion must **never** cause a
+compile failure — when register pressure exhausts the promotable pool, fall back to
+the frame-slot path rather than erroring. The RV32 local-promotion port must carry
+the same fallback by construction: if the s-register budget is exhausted, the local
+stays frame-slotted (the current, correct path). The RV32 selector already has an
+`alloc_exhausted` → `Unsupported` → skip-and-continue mechanism (#226) to model
+this against.
+
+## Gated implementation plan (each a separate PR)
+
+1. **imm-shift-fold** (smallest, self-contained): const shift amount → `slli/srli/
+   srai`. Flag-off → RV32 differential → cycle gate → flip.
+2. **const-address-fold**: const `lw/sw` address → fold into the access immediate
+   off `s11`. Flag-off → RV32 differential → cycle gate → flip.
+3. **local-promotion**: port `compute_local_promotion` to s-registers, leaf-only,
+   carrying the #474 fallback. Flag-off → RV32 differential → cycle gate → flip.
+4. (cmp→select: out of scope for RV32IMAC — see above.)
+
+**Oracle note:** the RV32 path has no cargo byte-gate and (unlike ARM) no local
+RISC-V disassembler on the dev host. The differential will need an RV32 execution
+harness (unicorn `UC_ARCH_RISCV` or qemu_riscv32, comparing against wasmtime), and
+a small RV32 instruction decoder for byte-level assertions — built as part of step 1.
+The final cycle win is validated on qemu_riscv32 / ESP32-C3, the same on-target
+protocol as the ARM cycle gate.