From 8a87289caf6837c2d30b40b7b7d04d217321bfd5 Mon Sep 17 00:00:00 2001 From: Juneyoung Lee Date: Tue, 12 May 2026 06:07:40 +0000 Subject: [PATCH] Add program_equivalence.md which explains the proof structures of program equivalence in s2n-bignum --- arm/tutorial/README.md | 2 + doc/program_equivalence.md | 727 +++++++++++++++++++++++++++++++++++++ 2 files changed, 729 insertions(+) create mode 100644 doc/program_equivalence.md diff --git a/arm/tutorial/README.md b/arm/tutorial/README.md index a96b4b776..92e59762c 100644 --- a/arm/tutorial/README.md +++ b/arm/tutorial/README.md @@ -24,3 +24,5 @@ To verify programs in x86, see `x86/tutorial`. 3. `rel_reordertac.ml`: Proving equivalence of two programs where the second one has instructions reordered from that of the first one. 4. `rel_loop.ml`: Proving equivalence of two simple loops. 5. `rel_veceq.ml`: Proving equivalence of scalar vs. vectorized 128x128->256-bit squaring. + +[doc/program_equivalence.md](doc/program_equivalence.md) has detailed explanations about the structures of program equivalence proofs in s2n-bignum. diff --git a/doc/program_equivalence.md b/doc/program_equivalence.md new file mode 100644 index 000000000..8315c2540 --- /dev/null +++ b/doc/program_equivalence.md @@ -0,0 +1,727 @@ +# Program Equivalence Proofs in s2n-bignum + +## 1. Introduction + +This file explains how program equivalence proofs are written for correctness of +NIST curves in s2n-bignum as well as the optimized Montgomery reduction algorithm. +The theoretical foundation for these proofs is the relational Hoare logic `L2` +described in Mazzucato et al., *"Relational Hoare Logic for Realistically Modelled +Machine Code"* ([arXiv 2505.14348](https://arxiv.org/abs/2505.14348), CAV'25) + +The [`arm/tutorial/`](https://github.com/awslabs/s2n-bignum/tree/main/arm/tutorial) +directory has a few small examples starting with `rel_`. The `x86/tutorial` +directory also has a few, but Arm has more because program equivalence has been +more extensively used in Arm's NEON vectorization proofs targeting Graviton2. + +## 2. Program Equivalence of Field Operations + +There are several proofs of linear optimizations for the field operations of the +P-256/384/521 elliptic curves: + +- [`arm/proofs/bignum_montmul_p256.ml`](arm/proofs/bignum_montmul_p256.ml), + [`bignum_montmul_p384.ml`](arm/proofs/bignum_montmul_p384.ml), + [`bignum_montmul_p521.ml`](arm/proofs/bignum_montmul_p521.ml), +- corresponding `bignum_montsqr_*` files, +- [`bignum_mul_p521.ml`](arm/proofs/bignum_mul_p521.ml) and + [`bignum_sqr_p521.ml`](arm/proofs/bignum_sqr_p521.ml). + +Each of these composes (1) NEON vectorization followed by (2) SLOTHY instruction +reordering. Each step proves a separate `ensures2` equivalence theorem; the two are +then composed into a single scalar↔optimized equivalence; and the composed +equivalence is combined with the pre-existing `ensures`-style functional correctness +of the scalar reference to yield functional correctness of the optimized program. + +We use `bignum_montmul_p256` as the running example and call out P384/P521 +differences in passing. All name references below are from +[`arm/proofs/bignum_montmul_p256.ml`](arm/proofs/bignum_montmul_p256.ml) unless +noted otherwise. + +### 2.1 The three machine-code versions + +Three versions of the routine coexist in the proof script: + +| variable | role | source | +|---|---|---| +| `bignum_montmul_p256_unopt_mc` | original, scalar reference | `arm/p256/unopt/bignum_montmul_p256_base.o` | +| `bignum_montmul_p256_interm1_mc` | NEON-vectorized intermediate | inline opcode list in the `.ml` proof script | +| `bignum_montmul_p256_mc` | final, NEON + SLOTHY-scheduled | `arm/p256/bignum_montmul_p256.o` | + +Each is stripped of function prologue/epilogue into a `_core_mc` slice (`bignum_montmul_p256_unopt_core_mc`, +`bignum_montmul_p256_interm1_core_mc`, `bignum_montmul_p256_core_mc`) used by the +`ensures2` statements. + +### 2.2 State equivalence relations + +Because the two programs manipulate the same scalar-level *input* and need to produce +the same scalar-level *output*, the input and output state relations are both small: + +``` +montmul_p256_eqin (s1,s1') x y z ⇔ + C_ARGUMENTS [z; x; y] s1 ∧ C_ARGUMENTS [z; x; y] s1' ∧ + (∃a. bignum_from_memory (x,4) s1 = a ∧ bignum_from_memory (x,4) s1' = a) ∧ + (∃b. bignum_from_memory (y,4) s1 = b ∧ bignum_from_memory (y,4) s1' = b) + +montmul_p256_eqout (s1,s1') z ⇔ + ∃a. bignum_from_memory (z,4) s1 = a ∧ bignum_from_memory (z,4) s1' = a +``` + +Three points are worth noting. + +1. **Pointers are concrete and equal** — `C_ARGUMENTS [z; x; y]` pins both states to + the same caller-provided addresses (and register assignments via the AArch64 ABI). +2. **Memory contents are abstract and equal** — we say "there exists `a` such that + both sides read `a` at `(x,4)`", rather than naming `a`. Existential quantification + over `a, b` is the canonical way to state "the two states happen to agree on this + memory", and it scales to the output relation without plumbing the actual modular + value through the equivalence proof. +3. **No other components are constrained by `eqin`/`eqout`.** Their free variation + is also encoded instead in the *frame* (`MAYCHANGE`) clauses of each `ensures2` goal + which will be explained later (e.g. `equiv_goal1` and `equiv_goal2` in the proof script). + The equivalence only ever asserts the scalar result deposited at `(z,4)` is the same on both + sides, and both sides are free to scribble on e.g., SIMD regs in any way they like. + +The *same* `eqin`/`eqout` pair is reused for both the scalar↔NEON and NEON↔SLOTHY +equivalences — a nontrivial fact that only works because the NEON intermediate +happens to agree with both the scalar reference *and* the SLOTHY-scheduled output on +the value deposited at `(z,4)`, despite the vast differences in their traces. + +**Building the full `ensures2` spec (`equiv_goal` term).** Each `equiv_goal1` / `equiv_goal2` is +constructed by the helper `mk_equiv_statement` (in `arm/proofs/equiv.ml`) — +this is the standard way to produce an `ensures2` goal in s2n-bignum. It +takes the following arguments: + +- A side-condition assumption (typically `nonoverlapping (…)` clauses); +- The input and output state relations `eqin` / `eqout`; +- The two executables, each together with its PC offset range + `(pc_ofs, pc_ofs_to)` identifying the slice of code to relate; +- The two `MAYCHANGE` frames, one per side; +- Two step-count functions, one per side. + +It assembles the universally quantified `ensures2` term with the expected +`aligned_bytes_loaded` / `read PC` boilerplate already in place. + +The convenience wrapper `mk_equiv_statement_simple` covers +the common case of "the whole of each `_mc`, start to end" without requiring +explicit PC offsets. The same helpers are used everywhere an `ensures2` goal +is built in the rest of this document. + +### 2.3 Proving the two equivalences + +``` +BIGNUM_MONTMUL_P256_CORE_EQUIV1 -- unopt_core ↔ interm1_core +BIGNUM_MONTMUL_P256_CORE_EQUIV2 -- interm1_core ↔ core (SLOTHY) +``` + +For each of the two optimizations there exists a proof whose definition is +`prove(equiv_goal, …)`. Their theorem statements are almost the same, but their +proofs use very different tactics. + +**Scalar ↔ NEON (`EQUIV1`).** After `EQUIV_INITIATE_TAC montmul_p256_eqin` unfolds +the input relation, the main work is done by + +``` +EQUIV_STEPS_TAC actions + BIGNUM_MONTMUL_P256_UNOPT_CORE_EXEC + BIGNUM_MONTMUL_P256_INTERM1_CORE_EXEC +``` + +`EQUIV_STEPS_TAC` traverses an `actions : (string * int * int * int * int) list` +describing per-range correspondences between the two programs: + +- `"equal"` ranges run as *lock-step* symbolic simulation — one step on each side, + both instructions produce the same symbolic output, and the common value is + immediately abbreviated by a fresh variable to prevent expression blow-up. +- `"replace"`, `"insert"`, `"delete"` ranges run *stuttering* simulation — one side + advances while the other does nothing (or vice versa), and the resulting output + expressions are matched up by rewriting. + +The hard case for NEON↔scalar is the `"replace"` one: a single vector instruction +computes the same 64-bit word as a group of scalar `mul`/`umulh`/carry-chain +instructions, but their *syntactic* symbolic expressions look nothing alike. This is +where *user-registered bit-vector equality theorems* come in. The proof script +temporarily extends `extra_word_CONV`: + +```ocaml +let _org_extra_word_CONV = !extra_word_CONV;; +extra_word_CONV := + [GEN_REWRITE_CONV I [WORD_BITMANIP_SIMP_LEMMAS; WORD_MUL64_LO; WORD_MUL64_HI]] + @ (!extra_word_CONV);; +``` + +and restores it after the `EQUIV1` proof. `WORD_MUL64_LO`/`WORD_MUL64_HI` (in +[`arm/proofs/neon_helper.ml`](arm/proofs/neon_helper.ml)) rewrite the NEON compound- +multiplication idiom (sequences of `UMULL`/`UZP`/`USRA`/`UMLAL` computing +`mul_lo`/`mul_hi` of two 64-bit lanes) into the corresponding scalar `MUL`/`UMULH` +outputs. `WORD_BITMANIP_SIMP_LEMMAS` cleans up small bit-permutation helpers +(`xtn`, `uzp1`, `rev64`, …). Once these rewrites close the stutter-step, the tactic +converges and proceeds. + +**NEON ↔ SLOTHY (`EQUIV2`).** Here the two programs contain *the same set of +instructions* in a permuted order. `EQUIV_STEPS_TAC` is not used; instead the proof +uses the two-phase tactic pair + +``` +ARM_N_STEPS_AND_ABBREV_TAC BIGNUM_MONTMUL_P256_INTERM1_CORE_EXEC (1--N) state_to_abbrevs None +ARM_N_STEPS_AND_REWRITE_TAC BIGNUM_MONTMUL_P256_CORE_EXEC (1--N) inst_map state_to_abbrevs None +``` + +The first tactic stuttering-simulates the left (`interm1`) program, abbreviating +each instruction's symbolic output and storing the mapping from state number to +abbreviation in the OCaml ref `state_to_abbrevs`. The second tactic then simulates +the right (SLOTHY) program one instruction at a time; at step `k` on the right it +looks up the abbreviation via `inst_map[k]` — the permutation from SLOTHY's schedule +back to original order — and discharges the step by equating the right output with +the stored abbreviation. `inst_map` is emitted by SLOTHY and pasted into the proof +script. + +### 2.4 Horizontal composition + +`BIGNUM_MONTMUL_P256_CORE_EQUIV` states `unopt_core ↔ core (SLOTHY)`. Its proof +is essentially a single call to `EQUIV_TRANS_TAC` — the tactic form of Lemma 4 +(*frame-condition composition*) of the paper: + +``` +EQUIV_TRANS_TAC + (BIGNUM_MONTMUL_P256_CORE_EQUIV1, BIGNUM_MONTMUL_P256_CORE_EQUIV2) + (montmul_p256_eqin, montmul_p256_eqin, montmul_p256_eqin) + montmul_p256_eqout_TRANS + (…_UNOPT_CORE_EXEC, …_INTERM1_CORE_EXEC, …_CORE_EXEC) +``` + +The first argument pairs the two stage theorems; the second declares that the *same* +input relation bridges all three programs; the third is a trivial transitivity +lemma for `eqout` (`montmul_p256_eqout_TRANS`); the fourth supplies the three +executables. A preceding `SUBGOAL_THEN` existentially introduces the intermediate +program's load address `pc3` and discharges its `nonoverlapping` obligations via +`FIND_HOLE_TAC`; this intermediate `pc3` does not appear in the final specification. + +**What is this `FIND_HOLE_TAC`?** To chain the two stage equivalences we +need a 64-bit address `pc3` at which the intermediate program (`interm1_mc`) can +be loaded such that `pc3` does not overlap any of the already-fixed regions — +the original program at `pc`, the SLOTHY program at `pc2`, the input/output +buffers at `x`, `y`, `z`, and so on. `FIND_HOLE_TAC` proves that such a `pc3` +exists by arithmetic on the given size bounds of all those regions: the total +occupied space has to be less than `2^64`, leaving a hole somewhere to slot the +intermediate program into. + +This 'find an empty memory area to place the intermediate machine code' is also +the reason why `bignum_emontredc_8n_cdiff` carries the extra +`k < 2^32` criterion (where k is the number of words) in its precondition, +which didn't exist in `bignum_emontredc_8n`. +Without that bound, the input/output buffers alone would be allowed to be so large +that no hole of the intermediate program's size would be guaranteed to exist in +the remaining 64-bit address space, and `FIND_HOLE_TAC` would not go through. + +### 2.5 Lifting the result to `ensures` on the optimized program + +The composed equivalence alone is *not* yet a functional-correctness theorem for the +optimized program; it says "the optimized program produces the same output as the +scalar one *if* the scalar program produces some output at all". We need to bridge +it with an `ensures_n` fact about the scalar program. This is the content of +Appendix D of the paper. + +**Step 1 — prove `eventually_n_at_pc` for the scalar core** +(`BIGNUM_MONTMUL_P256_EVENTUALLY_N_AT_PC`). The goal is stated against +`APPEND bignum_montmul_p256_unopt_core_mc barrier_inst_bytes` — the scalar code +followed by the *stopper* (a non-decodable 4-byte word). The stopper is essential: +it forces the number of steps to the terminal PC to be *unique*, which is exactly +the hypothesis of Lemma 6. The proof itself is mechanical, driven by +`PROVE_EVENTUALLY_IMPLIES_EVENTUALLY_N_TAC`. Note that +`PROVE_EVENTUALLY_IMPLIES_EVENTUALLY_N_TAC` only works when the program is +supposed to have a *constant* number of steps to finish — it cannot handle a +program containing a loop that iterates a symbolic `n` times, because it needs to +statically unroll every step. Routines with data-dependent loops (such as +`bignum_emontredc_8n_cdiff`'s main loop in §4) therefore require a different lifting +strategy. + +**Step 2 — promote `ensures` to `ensures_n`** +(`BIGNUM_MONTMUL_P256_UNOPT_CORE_CORRECT_N`). One call to `prove_ensures_n` +combines the scalar `ensures` theorem with the `eventually_n_at_pc` lemma and emits +the `ensures_n` version mechanically (Lemma 6). + +**Step 3 — combine equivalence with `ensures_n`**. The optimized-core correctness +theorem `BIGNUM_MONTMUL_P256_CORE_CORRECT` is proved with a single invocation of + +``` +PROVE_ENSURES_FROM_EQUIV_AND_ENSURES_N_TAC + BIGNUM_MONTMUL_P256_CORE_EQUIV + BIGNUM_MONTMUL_P256_UNOPT_CORE_CORRECT_N + (BIGNUM_MONTMUL_P256_UNOPT_CORE_EXEC, BIGNUM_MONTMUL_P256_CORE_EXEC) + (montmul_p256_eqin, montmul_p256_eqout) +``` + +The tactic instantiates the existentially quantified intermediate `pc` of the +scalar program (discharged by a `FIND_HOLE_TAC` just above) and applies the hybrid- +ensures theorem (Theorem 4 in the paper). Crucially, because `P`, `Q`, `R` appear +under existentials in that theorem, *the stopper condition disappears from the +final specification*: `BIGNUM_MONTMUL_P256_CORE_CORRECT` is a pure `ensures` +statement about `bignum_montmul_p256_core_mc`, with no trace of the barrier. + +Wrapping the core theorem back into the subroutine form +(`BIGNUM_MONTMUL_P256_SUBROUTINE_CORRECT`) is then a routine application of +`ARM_ADD_RETURN_NOSTACK_TAC`. + +**P384/P521.** The structure is identical. Same three-versions layout +(`…_unopt_core_mc`, `…_interm1_core_mc`, `…_core_mc`), same `eqin`/`eqout` (with +6-word and 9-word bignums respectively), same `EQUIV1`/`EQUIV2`/composed-`EQUIV` +names, same final `PROVE_ENSURES_FROM_EQUIV_AND_ENSURES_N_TAC`. + +## 3. Program Equivalence of Point Operations + +The proofs of [`p256_montjadd.ml`](arm/proofs/p256_montjadd.ml) and +[`p384_montjadd.ml`](arm/proofs/p384_montjadd.ml) follow the same master plan as §2 +but at a much larger scale: the unoptimized routine makes several `bl` calls to +Montgomery multiplication/squaring subroutines and has significant redundancy across +those call boundaries, which is exploited by two *extra* optimizations: + +1. **Function inlining.** Every `bl bignum_montmul_p256` etc. is expanded into the + callee's body, the trailing `ret` is dropped, and the argument-setup code is + merged with the callee's prologue. +2. **Redundant memory-access elimination.** Across the just-removed call boundary + the optimizer identifies store/reload pairs that are bypassable (the value is + still in a register) and deletes the memory access. + +These are followed by NEON vectorization and SLOTHY scheduling of the individual +field operations. +Importantly, SLOTHY is only applied to the individual inlined field operations +(`bignum_montmul_p*`, `bignum_montsqr_p*`, `bignum_sub_p*`), *not* to the whole +inlined point-operation routine: running SLOTHY on the entire inlined body was +infeasible because SLOTHY's scheduling search did not converge in a reasonable time. + +The optimizations themselves were produced by a custom (and ad-hoc) OCaml script, +but it is the `ensures2` equivalence proof that *verifies* the rewrites are +semantics-preserving. + +### 3.1 Proving equivalence after the two optimizations in one shot + +Unlike §2, only two machine-code definitions are introduced for each point +operation, because all rewrites above are bundled into a single monolithic +transformation in the final `.o`: + +| variable | role | +|---|---| +| `p256_montjadd_unopt_mc` | original, with `bl` calls | +| `p256_montjadd_mc` | final: inlined + mem-elim | + +Note that the two optimizations that this single `EQUIV_STEPS_TAC` actually +has to verify are only (1) function inlining and (2) redundant memory-access +elimination. The NEON vectorization and SLOTHY scheduling have already been +applied, and separately verified, at the level of the inlined field operations +(`BIGNUM_MONTSQR_P256_CORE_CORRECT`, `BIGNUM_MONTSQR_P256_CORE_CORRECT`, …, from §2); +what gets inlined into the point-op body. + +The `actions` argument consumed by the main `EQUIV_STEPS_TAC` invocation is +assembled progressively to match this decomposition. Each individual rewrite +contributes its own action list (`"call montsqr_p256"`, `"call montmul_p256"`, +`"call sub_p256"`, identity `"equal"` blocks, `"replace"`/`"insert"`/`"delete"` +for mem-elim), and the `merge_actions` helper in +[`arm/proofs/utils/p256_montjadd_params.ml`](arm/proofs/utils/p256_montjadd_params.ml) +composes them into a single `actions_merged` list that is passed to the main +equivalence proof. + +### 3.2 Input / output relation + +`p256_montjadd_eqin` again puts the three operand pointers in `C_ARGUMENTS` and +requires SP to match, then asks the 12-word Jacobian points at `p1` and `p2` to be +abstract-and-equal. `p256_montjadd_eqout` is carefully split into *three* 4-word +reads at offsets 0, 32, 64 of `p3` — one per projective coordinate (x, y, z). +Splitting this way makes the final postcondition easier to discharge because each +coordinate gets its own existential `a0`/`a1`/`a2`. + +### 3.3 For faster symbolic simulation: dead-value information and caching tricks + +The core equivalence `P256_MONTJADD_EQUIV` uses `EQUIV_STEPS_TAC` as in §2.3, but +its arguments are much richer: + +``` +EQUIV_STEPS_TAC + ~dead_value_info_left:p256_montjadd_unopt_dead_value_info + ~dead_value_info_right:p256_montjadd_dead_value_info + actions_merged P256_MONTJADD_UNOPT_CORE_EXEC P256_MONTJADD_EXEC +``` + +Two things to notice. + +- `dead_value_info_left`/`right` are *precomputed per-instruction liveness + annotations*. At each step, the symbolic simulator is told which registers and + memory locations are *dead* (not used on any path to the terminal PC). Dead + values are allowed to differ across the two states, which is what licenses the + removal of redundant stores: after the optimizer deletes a store, the right-hand + memory cell holds stale data while the left-hand one holds the freshly written + value — but since the cell is dead from that point, the simulator doesn't need + to reconcile the two. Without `dead_value_info`, redundant-store elimination + would induce spurious proof obligations that never close. +- A global result-cache ref `orthogonal_components_conv_custom_cache` is installed + before `P256_MONTJADD_EQUIV` and reset after it. It short-circuits expensive + component-orthogonality checks (the ones that decide whether two memory writes + can safely commute) by caching `(state-component, state-component)` results + across the whole symbolic simulation. Empirically this brings roughly a 20% + speedup on the point-op equivalence proofs. + +### 3.4 End-to-end correctness + +Same recipe as §2.5, just with a slightly different closing tactic: + +``` +VCGEN_EQUIV_TAC P256_MONTJADD_EQUIV P256_MONTJADD_UNOPT_CORE_CORRECT_N + [fst P256_MONTJADD_UNOPT_CORE_EXEC; fst P256_MONTJADD_EXEC] +``` + +`VCGEN_EQUIV_TAC` (in [`arm/proofs/equiv.ml`](arm/proofs/equiv.ml)) is a slightly +heavier cousin of `PROVE_ENSURES_FROM_EQUIV_AND_ENSURES_N_TAC`: it exposes three +remaining subgoals (precondition, postcondition, frame) that the caller then +discharges, instead of packaging them behind a canned pattern. For the point +operations the precondition subgoal in particular requires custom work because the +input-state relation mentions SP and a 224-byte stack scratch area, which +`PROVE_ENSURES_FROM_EQUIV_AND_ENSURES_N_TAC`'s canned strategy cannot anticipate. + +[`p384_montjadd.ml`](arm/proofs/p384_montjadd.ml) follows exactly the same pattern +with a different register set and 6×3-word Jacobian coordinates. + +## 4. Program Equivalence of `bignum_emontredc_8n_cdiff` + +`bignum_emontredc_8n_cdiff` is the extended Montgomery reduction of an arbitrary +bignum, used inside bignum modular multiplication. Compared to the existing +[`bignum_emontredc_8n`](arm/proofs/bignum_emontredc_8n.ml), the `_cdiff` variant +adds: + +1. **NEON vectorization** of the expensive `MUL`/`UMULH` quartets in both the outer- + loop prologue and the inner add-multiply loop (this is what `bignum_emontredc_8n_neon` + already achieves, but the `_cdiff` variant folds it in too). +2. **SLOTHY instruction reordering + software pipelining** of every basic block. + The software pipelining hoists work for iteration `i+1` up into iteration + `i`. +3. **Cached multiplied differences** — an ADK-style trick: pairwise differences of + the four-word outer-loop-iteration bignum digits are computed once and reused + by the inner loop, avoiding redundant subtractions. (This is what the `_cdiff` + suffix stands for; see §4.1 for details.) + +The software-pipelining part of (2) is new relative to §2 and §3: it requires +rotating the inner loop, which in turn forces us to introduce an extra +intermediate program beyond the plain scalar/NEON/SLOTHY stack. The +equivalence proof for the pipelined loop is not attempted in one shot; instead +§4.2 below will decompose it into a chain of smaller equivalences - +(1) partitioning the loop body into two smaller parts `[A; B]`, +(2) rotating the loop into `A; [B; A]; B`, and +(3) finally applying instruction reordering to the instructions in basic blocks +`A`, `[B; A]` and `B` separately. + +The proof is in +[`arm/proofs/bignum_emontredc_8n_cdiff.ml`](arm/proofs/bignum_emontredc_8n_cdiff.ml). +All name references in this section are from that file unless noted. + +### 4.1 Where are the cached multiplied differences computed? + +Computing the cached differences is done by a *precomputation loop* placed at the +very entry of the optimized routine. The loop walks over the modulus words +and fills a dedicated `m_precalc` buffer with pairwise absolute differences and +their sign bits, following the recursive definition + +``` +get_m_precalc n 0 = 0 +get_m_precalc n (i_div_4 + 1) = get_m_precalc n i_div_4 + + 2 EXP (64 * 12 * i_div_4) + * bignum_of_wordlist (precalc_diffs n (i_div_4+1)) +``` + +(`get_m_precalc`, `precalc_diffs`). The precomputation-loop correctness lemma +`BIGNUM_EMONTREDC_8N_CDIFF_PRECALCLOOP` says exactly that, after executing this +entry block, `m_precalc` holds `get_m_precalc m (k DIV 4 − 1)`. + +Because the original `bignum_emontredc_8n` has no such precomputation step, the +equivalence between original and optimized is necessarily conditional — the +`eqin` state relation of the main-loop equivalence assumes that the `m_precalc` +buffer is already populated with the cdiff content when the main loop is entered. +The unconditional top-level correctness of the optimized routine is then +recovered by sequentially composing `BIGNUM_EMONTREDC_8N_CDIFF_PRECALCLOOP` +(which establishes this assumption) with the main-loop equivalence (which +consumes it). + +### 4.2 Basic-block decomposition + +Each outer-loop body is split into three basic blocks, named uniformly via +`bignum_emontredc_8n_labels`, `outerloop_step{1,2,3}_labels`, and +`bignum_emontredc_8n_cdiff_labels`: + +``` +[ outerloop-prologue ] → [ maddloop inner loop ] → [ inner_loop_postamble ] +``` + +The outerloop prologue is the setup that runs once per outer iteration (loading +the four-word chunk of `z`, computing the Montgomery words, kicking off the inner +loop); the `maddloop` is the inner word-multiply-accumulate loop proper; the +`inner_loop_postamble` is the epilogue that absorbs the carry into the tail of +`z` and decrements the outer-loop counter. + +### 4.3 Five programs, one per stage + +All intermediate programs encode only the body of the outer loop plus its inner +maddloop plus its inner-loop postamble — i.e. one iteration of the outer loop. + +| name in the file | role | +|---|---| +| `bignum_emontredc_8n_mc` (step 0 / "original") | unoptimized scalar reference, from `unopt/bignum_emontredc_8n_base.o` | +| `outerloop_step1_mc` (step 1) | NEON-vectorized outer-loop body; nothing else changed | +| `outerloop_step2_mc` (step 2) | step 1 with inner-loop instructions reordered so that instructions that are to be pipelined sits at the top of the basic block | +| `outerloop_step3_mc` (step 3) | step 2 with the inner loop rotated: a `maddloop_prologue` precedes the body and an epilogue (`inner_loop_postamble_noepilogue`) trails it | +| `bignum_emontredc_8n_cdiff_mc` (step 4 / final) | individual basic blocks SLOTHY-reordered, from `arm/fastmul/bignum_emontredc_8n_cdiff.o` | + +An auxiliary `maddloop_step2_x30_mc` is used to bridge step 1 and step 2 — it +is step 2's inner loop with the `x30` base register re-based, so that the +address-register abbreviations computed on step 1 can be textually matched against +step 2. + + +Control-flow-graph-wise, each of the five versions of the outer loop has the +following shape. Rectangular nodes are basic blocks; arrows are control transfers; +the back-edge on `maddloop` is the inner loop's own back-edge, and the back-edge +on `inner_loop_postamble` is the outer loop's. + +`` and `` label the two halves of step 1's maddloop body: `` is the +prefix of instructions that step 3 peels out into `maddloop_prologue` (and, +starting from step 2, gets reordered to the top of the basic block), and +`` is the remaining suffix that step 3 peels out into +`inner_loop_postamble_noepilogue`. In the real code, `` corresponds roughly +to the first ~44 instructions of the 149-instruction body (the NEON +`ldr q …`/`xtn`/`umull`/… setup for iteration `i+1`'s next `m` word), and +`` corresponds to the rest (the multiply-accumulate chain that consumes the +values loaded in ``). + +``` +original (step 0) step 1 (NEON) step 2 step 3 (rotate) step 4 (SLOTHY final) + (reorder maddloop) +================ ================ ================ ================== ================ + ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ + │ outerloop │ │ outerloop │ │ outerloop │ │ outerloop │ │ outerloop │ + │ prologue │ │ prologue │ │ prologue │ │ prologue │ │ prologue │ + │ (scalar) │ │ (NEON) │ │ (NEON, │ │ (NEON, + one │ │ (SLOTHY- │ + │ │ │ │ │ identical │ │ extra x27-=1 │ │ scheduled) │ + │ │ │ │ │ to step 1) │ │ for rotation)│ │ │ + └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ │ │ + │ │ │ v v + │ │ │ ┌──────────────┐ ┌──────────────┐ + │ │ │ │ maddloop_ │ │ maddloop_ │ + │ │ │ │ prologue │ │ prologue │ + │ │ │ │ │ │ (SLOTHY- │ + │ │ │ │ │ │ scheduled) │ + │ │ │ └──────┬───────┘ └──────┬───────┘ + v v v v v + ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ + │ maddloop │ │ maddloop_ │ │ maddloop_ │ │ maddloop_ │ │ maddloop_ │ + │ (scalar, │<┐ │ neon (NEON, │<┐ │ neon │<┐ │ neon │<┐ │ neon │<┐ + │ k/4−1 iters) │ │ │ k/4−1 iters) │ │ │ │ │ │ │ │ │ (SLOTHY- │ │ + │ │ │ │ │ │ │--------------│ │ |--------------│ │ │ scheduled, │ │ + │ │ │ │ │ │ │ │ │ │ │ │ │ k/4−2 iters) │ │ + │ │─┘ │ │─┘ │ │─┘ │ (k/4−2 iters)│─┘ │ │─┘ + └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ + │ │ │ │ │ + │ │ │ v v + │ │ │ ┌──────────────┐ ┌──────────────┐ + │ │ │ │ inner_loop_ │ │ inner_loop_ │ + │ │ │ │ postamble_ │ │ postamble_ │ + │ │ │ │ noepilogue │ │ noepilogue │ + │ │ │ │ │ │ (SLOTHY- │ + │ │ │ │ │ │ scheduled) │ + │ │ │ └──────┬───────┘ └──────┬───────┘ + v v v v v + ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ + │ inner_loop_ │ │ inner_loop_ │ │ inner_loop_ │ │ inner_loop_ │ │ inner_loop_ │ + │ postamble │ │ postamble │ │ postamble │ │ postamble │ │ postamble │ + │ (scalar) │ │ (NEON) │ │ (NEON) │ │ │ │ (SLOTHY- │ + │ │ │ │ │ │ │ │ │ scheduled) │ + │ │ │ │ │ │ │ │ │ │ + └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ └──────────────┘ + │ │ │ │ │ + v v v v v + (outer) (outer) (outer) (outer) (outer) +``` + +For each pair of adjacent steps, **three per-block equivalences** are proved and +then **vertically composed** (by `prove_equiv_seq_composition`, i.e. Lemma 3 of the +paper) into one `OUTERLOOP_FULL_STEP{i}_STEP{i+1}_EQUIV`. Then the four +`OUTERLOOP_FULL_*` lemmas are **horizontally composed** (by `EQUIV_TRANS_GEN_TAC`, +Lemma 4) into `OUTERLOOP_FULL_EQUIV`. Finally the outer-loop back-edge is wrapped +by `ENSURES2_WHILE_PAUP_TAC` to yield `MAINLOOP_EQUIV`. + +### 4.4 Per-stage vertical equivalences + +#### (a) Original ↔ Step 1 — NEON vectorization + +| lemma | code compared (outerloop block ↔ step-1 block) | +|---|---| +| `OUTERLOOP_ORIGINAL_STEP1_EQUIV` | original `outerloop..maddloop` ↔ step1 prologue | +| `MADDLOOP_ORIGINAL_STEP1_EQUIV` | original `maddloop..madddone` ↔ step1 `maddloop_neon..inner_loop_postamble` | +| `INNER_LOOP_POSTAMBLE_ORIGINAL_STEP1_EQUIV` | original `madddone..outerloop_end` ↔ step1 `inner_loop_postamble..end` | +| `OUTERLOOP_FULL_ORIGINAL_STEP1_EQUIV` | sequential composition of the three above | + +This stage is NEON vs. scalar, so the symbolic simulation relies on the same +`extra_word_CONV` hook as in §2.3. The `MADDLOOP_ORIGINAL_STEP1_EQUIV` proof saves +and restores `_org_extra_word_CONV` and temporarily registers `ITE_WORD_NOT_XOR`. +Because the two inner loops share an *identical iteration count* (`k/4 − 1`), this +stage uses `ENSURES2_WHILE_PAUP_TAC` with the obvious 1–1 body correspondence. + +#### (b) Step 1 ↔ Step 2 — instruction reordering + +Step 2 reshuffles instructions *within* the `maddloop_neon` basic block +to hide latencies, without crossing any basic-block boundary. The outer- +loop prologue and the `inner_loop_postamble` are byte-for-byte identical +to step 1's, the loop still iterates `k/4 − 1` times, and the maddloop +body is a true permutation of step 1's (with a few destination-register +renames to break false dependencies). See the worked example in §4.3 +for a concrete illustration of what moves and what doesn't. + +Because the two `maddloop_neon` bodies are a true permutation of the same +instruction sequence, the inner-loop equivalence is split via the auxiliary +program `maddloop_step2_x30_mc`: + +| lemma | code compared | +|---|---| +| `OUTERLOOP_STEP1_STEP2_EQUIV` | outer-loop prologues (identical instructions) | +| `MADDLOOP_STEP1_STEP2_X30_EQUIV` | step1 maddloop ↔ `maddloop_step2_x30_mc` | +| `MADDLOOP_STEP2_X30_STEP2_EQUIV` | `maddloop_step2_x30_mc` ↔ step2 maddloop | +| `MADDLOOP_STEP1_STEP2_EQUIV` | horizontal composition of the two above | +| `INNER_LOOP_POSTAMBLE_STEP1_STEP2_EQUIV` | postambles (identical) | +| `OUTERLOOP_FULL_STEP1_STEP2_EQUIV` | sequential composition | + +`MADDLOOP_STEP1_STEP2_EQUIV` recomposes the two half-stages using `EQUIV_TRANS_TAC` +— the same horizontal-composition tactic we saw in §2.4, but *inside* a single +stage. This is why the `_X30` auxiliary has to be introduced: without it, step 1's +and step 2's address-register usage does not align, and the symbolic abbreviations +cannot be cross-referenced. + +The proofs of the "identical" blocks use `ARM_N_STEPS_AND_ABBREV_TAC` + +`ARM_N_STEPS_AND_REWRITE_TAC` with the identity permutation as `inst_map`, in the +style of §2.3's `EQUIV2`. + +#### (c) Step 2 ↔ Step 3 — inner-loop rotation + +The step 2's single `maddloop_neon` block is split in three +places to produce, in step 3, a prologue (`maddloop_prologue`) before +the loop, a loop body (iterating `k/4 − 2` times instead of +`k/4 − 1`), and an epilogue (`inner_loop_postamble_noepilogue`) after the +loop. + +| lemma | code compared | +|---|---| +| `OUTERLOOP_STEP2_STEP3_EQUIV` | step2 prologue ↔ step3 prologue (identical instructions) | +| `PROLOG_STEP2_STEP3_EQUIV` | step2 `maddloop_neon..maddloop_rotated` ↔ step3 `maddloop_prologue..maddloop_neon`: "first-iteration work pulled out of the loop" | +| `MADDLOOP_STEP2_STEP3_EQUIV` | step2 `maddloop_rotated..maddloop_rotated` (k/4 − 2 iterations) ↔ step3 `maddloop_neon..inner_loop_postamble` | +| `EPILOG_STEP2_STEP3_EQUIV` | step2 final rotated-body copy ↔ step3 `inner_loop_postamble..inner_loop_postamble_noepilogue` | +| `INNER_LOOP_POSTAMBLE_STEP2_STEP3_EQUIV` | step2 postamble ↔ step3 postamble (identical) | +| `OUTERLOOP_FULL_STEP2_STEP3_EQUIV` | sequential composition: outerloop → prolog → rotated body → epilog → postamble | + +Because the *number of loop iterations differs* between step 2 and step 3 +(`k/4 − 1` on step 2, `k/4 − 2` on step 3), `ENSURES2_WHILE_PAUP_TAC` cannot use a +1–1 body correspondence; the "missing" first iteration becomes the prolog and +the "missing" last iteration becomes the epilog, both hoisted out into their own +`ensures2` lemmas and composed around the middle. + +#### (d) Step 3 ↔ Step 4 — SLOTHY rescheduling + +Same shape as §2.3's `EQUIV2`, but applied independently to each of the three +blocks: + +| lemma | code compared | +|---|---| +| `OUTERLOOP_STEP3_STEP4_EQUIV` | step3 prologue ↔ final `outerloop..maddloop_neon` | +| `MADDLOOP_STEP3_STEP4_EQUIV` | step3 rotated maddloop ↔ final `maddloop_neon..inner_loop_postamble` | +| `INNER_LOOP_POSTAMBLE_STEP3_STEP4_EQUIV` | step3 postamble ↔ final `inner_loop_postamble..outerloop_end` | +| `OUTERLOOP_FULL_STEP3_STEP4_EQUIV` | sequential composition | + +Each uses a hand-tuned `inst_map` emitted by SLOTHY, driving +`ARM_N_STEPS_AND_ABBREV_TAC` / `ARM_N_STEPS_AND_REWRITE_TAC`. + +### 4.5 Preventing abbreviation of memory addresses and loop counter + +Abbreviating a register that holds a +memory-access base address (e.g. `word_add z (word (8 * 4 * i))`) or a loop +counter (e.g. `word (k4 - (i + 1))`) would make subsequent symbolic simulation of +`ldr`/`str` unable to identify from which memory invariants values should be loaded. +To avoid the destructive abbreviations, the step-abbreviate/rewrite tactic pair +(`ARM_N_STEPS_AND_ABBREV_TAC` / `ARM_N_STEPS_AND_REWRITE_TAC`) takes two extra arguments, +`regs_no_abbrev_left` and `regs_no_abbrev_right`, that list per instruction +slot the registers whose symbolic expression must not be replaced by a fresh +variable. + +Both lists are typically computed by `backward_taint_analysis` over each +program: starting from "sinks" (the base register of every load/store, plus +`X27`/`X0`/`X1`/`X2`/`SP`/`X30` at the block exit, which appear in the +postcondition) and walking the instruction stream backwards, it marks every +register that data-flow-feeds a sink. The two lists can differ because the +two programs have different schedules. The same mechanism is reused in +`MADDLOOP_STEP2_STEP3_EQUIV` and `MADDLOOP_STEP1_STEP2_X30_EQUIV`. + +For `EQUIV_STEPS_TAC` (used in §2.3 and §3.3), the analogous escape hatch is +the `"replace"` action: a range marked `"replace"` skips the default +abbreviation and instead asks the tactic to close the stutter by rewriting +the two sides' symbolic outputs against each other, so any pointer or counter +expressions survive the range unaltered. + +Manually writing the `"replace"` ranges by hand needs efforts however, and +the helper `break_equal_loads` (in `arm/proofs/equiv.ml`) post-processes an +`actions` list and converts every +contiguous run of `arm_LD*` instructions inside an `"equal"` range into a +`"replace"` range. + +### 4.6 Horizontal composition across stages + +The four `OUTERLOOP_FULL_STEP{i}_STEP{i+1}_EQUIV` lemmas have *different* input +state relations (each stage's `_eqin` is tailored to that stage) and slightly +different frame conditions. Before chaining them transitively, the proof performs +three kinds of bookkeeping: + +- `add_Q29_to_precond` injects an additional `mk_equiv_regs [Q29]` into the + precondition of the STEP1–STEP2, STEP2–STEP3 and STEP3–STEP4 equivs. This is + needed because the original–STEP1 proof establishes Q29 as a specific constant + (`word_join 0xffffffff 0xffffffff`) on the right-hand side, which subsequent + stages expect. +- `extend_first_equiv_for_seq_composition` enriches each stage's input relation + with register equalities that the *next* stage expects but this stage doesn't + mention — this is where the `_EXT` suffix in names like + `OUTERLOOP_FULL_STEP1_STEP2_EQUIV_EXT` comes from. +- Canonicalized `maychange_eq_thm`s rewrite each stage's MAYCHANGE set into a + common form, without which Lemma 4 cannot line the stages up. + +The three transitive composition steps are then: + +``` +OUTERLOOP_FULL_ORIGINAL_STEP2_EQUIV -- original ↔ step2 +OUTERLOOP_FULL_ORIGINAL_STEP3_EQUIV -- original ↔ step3 +OUTERLOOP_FULL_EQUIV -- original ↔ step4 +``` + +each proved by a single `EQUIV_TRANS_GEN_TAC` invocation, with an `eqout_TRANS` +helper that propagates Q29 through the chain. + +### 4.7 Adding the outer-loop backedge + +`OUTERLOOP_FULL_EQUIV` concerns one iteration of the outer loop. To lift it to +the full main loop, `OUTERLOOP_FULL_EQUIV_EXT` specializes `z ↦ z + 32·i` and adds +frame equalities for `X28` and the counter memory; `MAINLOOP_EQUIV` then applies +`ENSURES2_WHILE_PAUP_TAC` to wrap the per-iteration equivalence into an equivalence +of the actual main loop between the labels `main..main_end` on both programs. + +### 4.8 From equivalence to functional correctness + +The pattern from §2.5 / §3.4 is reused, with one twist forced by the fact that +`bignum_emontredc_8n`'s main loop iterates a *symbolic* number of times (so +`PROVE_EVENTUALLY_IMPLIES_EVENTUALLY_N_TAC` from §2.5 is not applicable): + +1. The functional correctness `BIGNUM_EMONTREDC_8N_MAINLOOP_ENSURES_N` is proved directly for the scalar + main loop — the script does not re-derive it via `prove_ensures_n` from a + pre-existing `ensures`-style theorem, because the main loop has a + data-dependent iteration count. Instead, the `ensures_n` proof is written from + scratch using symbolic-simulation tactics that track the step count through the + loop. (This is what the header comment at the top of the file refers to as + "step 1 of the proof structure".) +2. `BIGNUM_EMONTREDC_8N_CDIFF_MAINLOOP_CORRECT` combines the two via + `VCGEN_EQUIV_TAC MAINLOOP_EQUIV BIGNUM_EMONTREDC_8N_MAINLOOP_ENSURES_N_NSTEP_REWRITTEN`, + in the point-op style of §3.4. +3. `BIGNUM_EMONTREDC_8N_CDIFF_CORE_CORRECT` prepends the precomputation-loop + correctness `BIGNUM_EMONTREDC_8N_CDIFF_PRECALCLOOP` (see §4.1) to obtain the + full routine's spec: the conditional `eqin` precondition "`m_precalc` already + holds `get_m_precalc m (k DIV 4 − 1)`" is discharged by the precomputation + loop's output, yielding an unconditional correctness theorem. Finally + `BIGNUM_EMONTREDC_8N_CDIFF_CORRECT` / `_SUBROUTINE_CORRECT` wrap the prologue/ + epilogue and ABI frame to yield the top-level subroutine correctness theorem.