Skip to content

Implement Small-Value Sum-Check Optimization (Algorithm 6)#98

Open
wu-s-john wants to merge 108 commits intomicrosoft:mainfrom
wu-s-john:feat/procedure-9-accumulator
Open

Implement Small-Value Sum-Check Optimization (Algorithm 6)#98
wu-s-john wants to merge 108 commits intomicrosoft:mainfrom
wu-s-john:feat/procedure-9-accumulator

Conversation

@wu-s-john
Copy link
Contributor

@wu-s-john wu-s-john commented Dec 18, 2025

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.

Key Insight

In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.

Multiplication Cost Hierarchy:

  • ss (small × small): Native i32/i64 multiplication (~1 cycle)
  • sl (small × large): Barrett-optimized multiplication (~9 base mults)
  • ll (large × large): Full Montgomery multiplication (~32 base mults)

For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.

Benchmarks

Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with jemalloc.
Note: halo2curves/asm is not enabled (unavailable on Apple Silicon).

Headline Result: 1.83× Speedup on BN254

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):

sumcheck_bench_bn254-fr_n26_l03
Percentile Base prove (ms) Small-value prove (ms) Speedup
25% 2,588 1,418 1.79×
50% (median) 2,616 1,429 1.82×
75% 2,662 1,473 1.84×
90% 2,895 1,577 1.88×
  • Mean speedup: 1.83× (base: 2,671ms → small-value: 1,464ms)
  • Lower variance in optimized version (std: 73ms vs 151ms)
  • Consistent speedup across all percentiles

Scaling Across Problem Sizes

cargo run --release --example sumcheck_sweep -- --field bn254-fr range-sweep --min 16 --max 27
num_vars n base prove (µs) small-value prove (µs) speedup
16 65,536 9,459 4,619 2.05×
17 131,072 8,795 4,516 1.95×
18 262,144 13,582 7,567 1.80×
19 524,288 24,293 14,199 1.71×
20 1,048,576 49,350 25,117 1.97×
21 2,097,152 96,088 62,382 1.54×
22 4,194,304 176,852 94,437 1.87×
23 8,388,608 349,647 190,513 1.84×
24 16,777,216 679,342 365,720 1.86×
25 33,554,432 1,409,206 729,882 1.93×
26 67,108,864 2,799,654 1,493,126 1.88×
27 134,217,728 5,671,207 3,066,167 1.85×

Key observations:

  • Consistent 1.8-2.0× speedup across all problem sizes on BN254
  • Speedup remains stable even at n = 2²⁷ (134M constraints)
  • Peak speedup of 2.05× at n = 2¹⁶

Delayed Modular Reduction Impact

To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.

Accumulator Building Phase

The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.

cargo run --release --example accum_bench -- --field bn254-fr --l0 3 range-sweep --min 16 --max 27
num_vars n DMR accum (µs) no-DMR accum (µs) accum speedup
16 65,536 1,058 3,983 3.76×
17 131,072 1,595 8,668 5.43×
18 262,144 2,434 18,357 7.54×
19 524,288 3,831 15,732 4.11×
20 1,048,576 7,708 29,625 3.84×
21 2,097,152 13,739 55,788 4.06×
22 4,194,304 25,401 120,680 4.75×
23 8,388,608 55,910 223,901 4.00×
24 16,777,216 116,574 458,829 3.94×
25 33,554,432 242,699 935,360 3.85×
26 67,108,864 516,107 1,891,151 3.66×
27 134,217,728 1,024,774 3,794,770 3.70×

Key observations:

  • DMR provides 3.7-7.5× speedup on accumulator building with l0=3
  • Peak speedup of 7.54× at n = 2¹⁸
  • This is the primary source of performance gains in the small-value optimization

Time Breakdown: First l0 Rounds vs Remaining Rounds

With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:

num_vars n first l0 (ms) remaining (ms) ratio (first:remaining)
16 65,536 1.2 3.4 0.34 : 1
17 131,072 1.7 2.8 0.60 : 1
18 262,144 2.5 5.0 0.50 : 1
19 524,288 3.9 10.3 0.38 : 1
20 1,048,576 7.8 17.3 0.45 : 1
21 2,097,152 13.8 48.6 0.28 : 1
22 4,194,304 25.5 68.9 0.37 : 1
23 8,388,608 56.0 134.5 0.42 : 1
24 16,777,216 116.6 249.1 0.47 : 1
25 33,554,432 242.8 487.1 0.50 : 1
26 67,108,864 516.2 977.0 0.53 : 1
27 134,217,728 1,024.8 2,041.3 0.50 : 1

Key observations:

  • First l0 rounds (accumulators + l0 round proofs) take ~1/3 to 1/2 of total prove time
  • Remaining rounds (l0+1 to n) dominate, taking ~2× longer than first l0
  • Ratio stabilizes around 1:2 for large instances (n ≥ 2²⁴)
  • This balanced split indicates l0=3 is a good choice for BN254

Split-Eq Sumcheck with DMR

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.

cargo run --release --example sumcheck_sweep -- --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq-DMR prove (µs) prove speedup
16 65,536 9,546 5,113 1.87×
17 131,072 8,278 5,698 1.45×
18 262,144 13,570 9,973 1.36×
19 524,288 26,253 18,446 1.42×
20 1,048,576 44,574 41,793 1.07×
21 2,097,152 95,211 86,503 1.10×
22 4,194,304 172,328 136,639 1.26×
23 8,388,608 327,351 271,947 1.20×
24 16,777,216 630,354 525,370 1.20×
25 33,554,432 1,255,856 1,058,761 1.19×
26 67,108,864 2,506,240 2,171,037 1.15×
27 134,217,728 4,904,540 5,304,525 0.93×

Key observations:

  • Split-eq with DMR provides 1.15-1.87× speedup for most instance sizes
  • At n = 2²⁷, there is a slight slowdown (0.93×), likely due to increased memory pressure from DMR state
  • The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.87×
  • For very large instances (n ≥ 2²⁵), the speedup stabilizes around 1.15-1.19×

SHA-256 Chain Benchmark

To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.

cargo run --release --no-default-features --example sha256_chain_benchmark
chain_length num_vars log₂(constraints) num_constraints witness_ms orig_sumcheck_ms small_sumcheck_ms total_ms speedup witness_pct
2 16 16 65,536 14 5 3 20 1.67× 70.0%
8 18 18 262,144 55 16 11 75 1.45× 73.3%
32 20 20 1,048,576 229 48 32 301 1.50× 76.1%
128 22 22 4,194,304 1,260 163 109 1,547 1.50× 81.4%
512 24 24 16,777,216 5,686 609 395 6,743 1.54× 84.3%
2048 26 26 67,108,864 17,015 2,857 1,677 22,116 1.70× 76.9%

Key observations:

  • 2048 SHA-256 hashes proven in ~22 seconds
  • Witness generation dominates at 70-84% of total proving time
  • Small-value sumcheck achieves consistent 1.45-1.70× speedup

Solana Light Client Comparison

A Solana light client verifying block finality requires:

Component Hash Function Count
Vote signature verification SHA-512 (Ed25519 internal) ~21 to ~1,588
Merkle shred verification SHA-256 ~108 to ~1,206
  • Ed25519 uses SHA-512 internally for challenge hashing
  • Finality requires ≥2/3 supermajority stake (~21-530 validators)
  • SHA-512 is ~1.5-2× more expensive than SHA-256 per hash

SHA-256 equivalent cost:

  • Solana SHA-256: ~1,206 hashes
  • Solana SHA-512: ~1,588 × 1.5-2 = ~2,382-3,176 SHA-256 equivalent
  • Total: ~3,588-4,382 SHA-256 equivalent
  • Our 2048-chain benchmark covers ~47-57% of Solana's worst-case proving requirement

Implementation

Core Components

  1. SmallValueField trait (src/small_field.rs)

    • Defines SmallValue (i32) and IntermediateSmallValue (i64) types
    • Barrett-optimized sl_mul and isl_mul for BN254/BLS12-381 (~3× faster than ll)
    • Overflow analysis ensuring correctness for typical witness bounds
  2. Lagrange Domain Extension (src/lagrange.rs)

    • LagrangeEvaluatedMultilinearPolynomial<T, D> for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}
    • Zero-allocation extend_in_place with ping-pong buffers
    • gather_prefix_evals for efficient prefix collection (Procedure 6)
  3. Accumulator Data Structures (src/accumulators.rs, src/accumulator_index.rs)

    • SmallValueAccumulators<S, D> storing A_i(v, u) with O(1) indexing via UdTuple
    • idx4 mapping (Definition A.5) for distributing products to correct accumulators
    • Type-safe UdEvaluations and UdHatEvaluations wrappers
  4. Procedure 9 Implementation (src/accumulators.rs)

    • build_accumulators_spartan: Optimized for Spartan's Az·Bz structure
    • build_accumulators: Generic version for arbitrary polynomial products
    • Parallel fold-reduce with thread-local scratch buffers
  5. Thread-Local Buffer Reuse (src/thread_state_accumulators.rs)

    • SpartanThreadState and GenericThreadState eliminate O(num_x_out) allocations
    • Reduces allocator contention in parallel workloads
  6. Sum-Check Integration (src/sumcheck.rs)

    • SmallValueSumCheck::from_accumulators factory method
    • Round-by-round Lagrange coefficient multiplication (R_{i+1} = R_i ⊗ L_{U_d}(r_i))

Algorithm Flow

┌─────────────────────────────────────────────────────────────────────────┐
│  Precomputation: Build accumulators A_i(v, u) for i ∈ [ℓ₀]              │
│                                                                         │
│  For each x_out ∈ {0,1}^{ℓ/2-ℓ₀}:                                       │
│    For each x_in ∈ {0,1}^{ℓ/2}:                                         │
│      ein = eq(w_R, x_in) · eq(w_L, x_out)                              │
│      Extend Az/Bz prefixes to U_d^{ℓ₀} via Lagrange                    │
│      Accumulate products weighted by ein into A_i(v, u)                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Rounds 1..ℓ₀: Compute s_i(X) = ⟨R_i, A_i(·, u)⟩ for u ∈ Û_d           │
│                R_{i+1} = R_i ⊗ (L_{U_d,k}(r_i))_{k∈U_d}                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Round ℓ₀+1: Streaming round (Algorithm 2) to bind to r_{1:ℓ₀}         │
│  Rounds ℓ₀+2..ℓ: Standard linear-time sum-check (Algorithm 1)          │
└─────────────────────────────────────────────────────────────────────────┘

Test Plan

  • cargo test test_build_accumulators - Verifies accumulator construction
  • cargo test test_small_value - SmallValueField arithmetic correctness
  • cargo test lagrange - Lagrange extension and interpolation
  • cargo test sumcheck - Full sum-check protocol equivalence
  • cargo clippy - No warnings
  • examples/sumcheck_sha256_equivalence.rs - Verifies new method produces identical proofs to baseline
  • examples/sha256_chain_benchmark.rs - SHA-256 chain proving with CSV output

References

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in
src/lagrange.rs for representing evaluation domains U_d and Û_d used in
the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6)

Introduces RoundAccumulator and SmallValueAccumulators for the
small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage
with const generic D for cache efficiency and vectorizable merge
operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and
LagrangeEvaluatedMultilinearPolynomial with const generic D to enable:

- Compile-time enforcement that domain types match accumulator degree
- Debug assertions for bounds checking (v < D in constructors)
- Elimination of runtime base parameter from to_flat_index()

This prevents mixing domain sizes at compile time and catches
out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps
evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by
decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends
prefixes to the Ud domain, applies the ∞/Cz rule, and routes
contributions via cached idx4 with E_in/E_out weighting. Expanded
accumulator tests with a naive cross-check, ∞ handling, and binary-β
zero behavior to validate correctness. Cleaned up dead-code allowances
now that the code paths are used.
Added explicit MSB-first checks for eq table generation,
gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure
“top” binds the MSB.These tests catch silent index/order regressions
across components.
@wu-s-john wu-s-john changed the title Implement Algorithm 6 Foundation — Procedure 9 Accumulator Builder Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Dec 18, 2025
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check
rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1)
for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to
derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with
EqSumCheckInstance rounds.
indexing

Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix
data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2
allocations instead of N+1, improving cache locality. Replaces ad-hoc
offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value
  optimization for first ℓ₀ rounds with eq-poly optimization for
  remaining
- Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree
  parameter
- Add sumcheck_sweep.rs examples for performance comparison
build_accumulators

The new from_boolean_evals_with_buffer_reusing method takes
caller-provided scratch buffers and alternates between them during
extension. This reduces allocations from O(num_x_in × num_x_out) per
call to O(num_threads) buffers allocated once per thread.
variants

Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and
prove_cubic_with_three_inputs_small_value produce identical proofs when
used with a real SHA256 circuit (Algorithm 6 validation).

Changes:
- Add PartialEq, Eq derive to SumcheckProof for proof comparison
- Add extract_outer_sumcheck_inputs helper to SpartanSNARK
- Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.

Key changes:
  - Add SmallValueField trait for type-safe i32/i64 small-value
    operations
  - Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
  - Add SpartanAccumulatorInput trait to unify field and i32 witness
    handling
  - Make LagrangeEvaluatedMultilinearPolynomial generic over element
    type
  - Update sumcheck prover to accept separate i32 witness polynomials
  - Clean up MultilinearPolynomial<i32>: remove unused
    from_u32/from_u64/from_field
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 2828f04 to 67674c4 Compare December 23, 2025 19:33
evaluations

Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and
  UdHatPoint
- Move test-only methods (alpha, prefix_len, suffix_len,
  extend_from_boolean) to cfg(test) impl blocks
- Add CachedPrefixIndex struct with From impl to accumulator_index.rs
- Remove unused QuadraticTAccumulatorPrefixIndex type alias
- Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in
build_accumulators_spartan and build_accumulators. Previously, 5 vectors
were allocated on every x_out iteration; now allocations happen once per
Rayon thread subdivision.

- Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids
  .to_vec())
- Add SpartanThreadState and GenericThreadState structs for buffer reuse
- Extract thread state structs to thread_state_accumulators module

Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to
its own module for better code organization. Rename from
SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify
that it abstracts over multilinear polynomial representations (field
elements vs small values).
@wu-s-john wu-s-john changed the title Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Implement Small-Value Sum-Check Optimization (Algorithm 6) Dec 23, 2025
@wu-s-john wu-s-john marked this pull request as ready for review December 23, 2025 23:56
  Extract WideMul trait (small × small → product) into wide_mul.rs,
  separating widening multiplication from delayed reduction. This
  enables cleaner generic bounds where only one capability is needed.

  Key changes:
  - Add WideMul trait with i32→i64 and i64→i128 implementations
  - Refactor DelayedReduction to be generic over value type
    (i32/i64/i128/F)
  - Rename multiply_accumulate → unreduced_multiply_accumulate for
    clarity
  - Remove LagrangeAccumulatorField convenience trait (use explicit
    bounds)
  - Make multiply_vec_small and fold functions generic over SmallValue
  - Delete impls.rs, consolidating implementations into trait modules
  - Update documentation: fix outdated terminology and add limb size
    context

  The trait hierarchy is now:
  - WideMul: small × small → product (pure integer widening)
  - SmallValueField: field ↔ small conversions
  - DelayedReduction<T>: field × T accumulation without reduction
  function

  - Replace BenchmarkBox wrapper with direct enum dispatch
  - Move run logic to standalone run_benchmark<E, B> function
  - Simplify SumcheckBenchmark trait to just convert() and prove()
  - Remove build_benchmarks function in favor of direct dispatch
  - Add i32 support for BN254 (SupportsSmallI32 marker trait)
  - Fix clippy warning for unnecessary cast in small_field
- Add timed() helper to reduce timing boilerplate across 5 call sites
- Extract TRANSCRIPT_LABEL constant (was hardcoded 3 times)
- Rename run_chain_benchmark → run_sumcheck_benchmark for clarity
- Remove outdated comment referencing PallasHyraxEngine
- Reorganize imports into spartan2 block
- Extract FieldChoice into shared cli module
- Consolidate timing phases into (name, short_name) tuples with HashMap
  storage
- Simplify NeutronNova benchmark: remove rounds param and subcommands
- Clean up phase names: remove prep_ml, add nifs_prove, unify span names
- Make sha256.rs generic over engine type
 Delete examples/sumcheck_sha256_equivalence.rs and add equivalent test
 in src/small_sumcheck.rs. Tests that prove_cubic_small_value produces
 identical output to prove_cubic_with_three_inputs using
 SmallSha256Circuit
- ExtensionBound<SV, D> precomputes max_safe bound once for reuse
- Make prove_small_value and helpers generic over SV (i32/i64)
- Remove unused Debug bounds from SV and SV::Product
- Update docs to reference ExtensionBound
- Remove `build_accumulators` generic function (was marked dead_code)
- Remove `gather_prefix_evals` from MultilinearPolynomial
- Add standalone `extend_to_lagrange_domain` function (Procedure 6)
- Remove helper functions only used by deleted code:
  - `fill_eyx`, `scatter_beta_contributions`
- Clean up unused fields and imports across lagrange_accumulator module
- Remove tests that depended on deleted functions
- Change `LagrangeAccumulators::rounds` from `pub` to `pub(crate)`
- Change `LagrangeEvals::infinity` and `finite` from `pub` to
  `pub(crate)`
- Fix stale comment referencing non-existent `cz_ext` and `cz_pref`
  variables
- Use `data_mut()` accessor instead of direct field access for
  consistency
 Change eq_cache from [round][y * num_x + x] to [round][x * num_y + y]
 so each parallel task (fixed x_out) accesses a contiguous memory block.
 Also remove unnecessary clone of e_in by borrowing directly.
@wu-s-john wu-s-john force-pushed the feat/procedure-9-accumulator branch from 4d49748 to f614127 Compare February 6, 2026 15:09
@wu-s-john
Copy link
Contributor Author

Split-Eq Sumcheck with Delayed Modular Reduction

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), delayed modular reduction provides additional speedup by batching modular reductions in the remaining rounds. The speedup comes from inner-product-like multiplications with two streams of large field elements—performing a single reduction for the entire inner-product is faster than reducing after each multiplication term.

Scaling Across Problem Sizes
sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq prove (µs) prove speedup
16 65,536 8,082 5,853 1.38×
17 131,072 10,018 7,615 1.32×
18 262,144 17,714 13,015 1.36×
19 524,288 26,564 21,801 1.22×
20 1,048,576 51,821 43,759 1.18×
21 2,097,152 104,513 95,216 1.10×
22 4,194,304 195,987 159,347 1.23×
23 8,388,608 383,391 330,527 1.16×
24 16,777,216 759,441 590,406 1.29×
25 33,554,432 1,313,572 1,125,700 1.17×
26 67,108,864 2,615,161 2,246,468 1.16×
27 134,217,728 5,292,385 5,259,143 1.01×
Statistical Analysis at n = 2²⁶ (30 trials)
sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --trials 30 --methods base,split-eq-dmr single 26
Statistic Base prove (ms) Split-eq prove (ms) Speedup
Min 2,595 2,252 1.15×
25% 2,603 2,278 1.14×
50% (median) 2,627 2,284 1.15×
75% 2,649 2,299 1.15×
Max 2,743 2,443 1.12×
Mean 2,641 2,294 1.15×

Key observations:

  • Split-eq with delayed modular reduction provides 1.10-1.38× speedup for most instance sizes
  • Consistent 1.15× mean speedup at n = 2²⁶ across 30 trials with low variance
  • At n = 2²⁷, speedup drops to 1.01×, likely due to increased memory pressure
  • The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.38×
sumcheck_split_eq_dmr

@wu-s-john
Copy link
Contributor Author

NeutronNova NIFS Benchmark

NeutronNova folds multiple R1CS instances using a NIFS (Non-Interactive Folding Scheme) sumcheck. This benchmark measures the end-to-end proving time with detailed phase breakdown.

8 instances × 128 SHA-256 hashes each (4,194,304 constraints per instance):

NIFS Folding Only

sudo nice -n -20 cargo run --release --example neutronnova_sha256_benchmark -- --mode nifs --instances 8 --chain-length 128 --field bn254-fr
Phase small (ms) large (ms) Speedup
shared_syn 12 9 0.75×
precom_syn 1,047 1,240 1.18×
commit_pre 1,240 1,148 0.93×
mat_vec 550 618 1.12×
nifs_fold_sc 156 2,014 12.91×
fold_W 38 112 2.95×
fold_U 305 285 0.93×
nifs_prove 1,416 3,211 2.27×
end_to_end 5,429 7,157 1.32×

Full ZK Prove

sudo nice -n -20 cargo run --release --example neutronnova_sha256_benchmark -- --mode zk-prove --instances 8 --chain-length 128 --field bn254-fr
Phase small (ms) large (ms) Speedup
prep 3,899 4,484 1.15×
shared_syn 12 10 0.83×
precom_syn 1,470 1,472 1.00×
commit_pre 1,210 1,101 0.91×
rerand 838 901 1.08×
gen_inst 91 74 0.81×
nifs_sc 155 1,984 12.80×
fold_W 37 115 3.11×
fold_U 271 273 1.01×
nifs 1,416 3,198 2.26×
outer_sc 1,695 1,684 0.99×
inner_sc 794 771 0.97×
pcs 67 67 1.00×
zk_prove 6,014 7,709 1.28×
end_to_end 10,026 12,294 1.23×

Key observations:

  • NIFS sumcheck achieves 12.8-12.9× speedup from the small-value optimization, the highest speedup of any phase
  • NIFS prove time: 1.4s (small) vs 3.2s (large) = 2.27× speedup
  • Full ZK prove: 6.0s (small) vs 7.7s (large) = 1.28× speedup
  • End-to-end with prep: 10.0s (small) vs 12.3s (large) = 1.23× overall speedup
  • The optimization has successfully shifted the bottleneck away from the sumcheck

Future optimizations with native small-value types:

If the R1CS matrices and witness are stored natively as small-value types (rather than field elements), additional speedups are possible:

  1. Matrix-vector multiply (mat_vec): Currently 550ms. With small-value matrix entries and witness, this becomes small×small multiplication—essentially free compared to field operations.

  2. Witness synthesis (precom_syn): Currently 1.0-1.5s. Small-value arithmetic during witness generation would significantly reduce this cost.

Remaining bottlenecks and GPU acceleration opportunities:

The phase breakdown reveals that the NIFS sumcheck (nifs_sc) is now only ~1.5% of total time. The bottleneck has shifted to preprocessing, commitment, and witness generation:

  1. Preprocessing (prep): 3.9s — the largest single cost, including circuit compilation and setup.

  2. Commitment (commit_pre): 1.2s. Hyrax PCS committing to binary witnesses via MSM. GPU-accelerated MSM implementations routinely achieve 10-50× speedup over CPU.

@wu-s-john
Copy link
Contributor Author

End-to-End Spartan Proving Breakdown

Full Spartan proving pipeline with detailed phase breakdown, measured with sudo nice -n -20 for minimal scheduling noise on M1 Max.

Single SHA-256 Hash (varying message sizes)

sudo nice -n -20 cargo run --release --no-default-features --example sha256

msg=1024B, constraints=1,048,576:

Phase small (ms) large (ms) Speedup
synth_pre 127 156 1.23×
commit_pre 24 22 0.92×
r1cs_rest 17 16 0.94×
commit_rest 12 11 0.92×
mat_vec 15 13 0.87×
outer_sc 19 44 2.32×
eval_rx 4 6 1.50×
eval_sparse 41 41 1.00×
poly_ABC 20 19 0.95×
poly_z 3 2 0.67×
inner_sc 39 47 1.21×
pcs 33 27 0.82×
prove_total 216 235 1.09×

msg=2048B, constraints=2,097,152:

Phase small (ms) large (ms) Speedup
synth_pre 240 326 1.36×
commit_pre 46 43 0.93×
r1cs_rest 36 34 0.94×
commit_rest 26 23 0.88×
mat_vec 33 23 0.70×
outer_sc 36 84 2.33×
eval_rx 11 12 1.09×
eval_sparse 85 82 0.96×
poly_ABC 41 38 0.93×
poly_z 6 9 1.50×
inner_sc 78 79 1.01×
pcs 40 41 1.02×
prove_total 411 437 1.06×

@wu-s-john
Copy link
Contributor Author

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.91× speedup on the BN254 scalar field (30 trials):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --trials 30 --methods base,i64 single 26
Percentile Base prove (ms) Small-value prove (ms) Speedup
25% 2,601 1,360 1.91×
50% (median) 2,610 1,367 1.91×
75% 2,640 1,385 1.91×
90% 2,683 1,468 1.83×
  • Mean speedup: 1.91× (base: 2,662ms → small-value: 1,390ms)
  • Lower variance in optimized version
  • Consistent speedup across all percentiles

Scaling Across Problem Sizes

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,i64 range-sweep --min 16 --max 28
num_vars n base prove (µs) small-value prove (µs) speedup
16 65,536 9,636 4,321 2.23×
17 131,072 10,008 4,618 2.17×
18 262,144 14,634 7,468 1.96×
19 524,288 33,984 12,477 2.72×
20 1,048,576 46,076 22,357 2.06×
21 2,097,152 92,270 41,879 2.20×
22 4,194,304 203,444 84,042 2.42×
23 8,388,608 343,588 172,579 1.99×
24 16,777,216 697,024 341,831 2.04×
25 33,554,432 1,335,911 687,493 1.94×
26 67,108,864 2,626,311 1,404,205 1.87×
27 134,217,728 5,211,726 2,842,161 1.83×
28 268,435,456 10,554,689 5,962,386 1.77×

Key observations:

  • Consistent 1.8-2.7× speedup across all problem sizes on BN254
  • Speedup remains stable even at n = 2²⁸ (268M constraints)
  • Peak speedup of 2.72× at n = 2¹⁹
sumcheck_speedup

- Replace bare .unwrap() with .expect() containing descriptive messages
  for ilog2 conversions and field inversions in spartan.rs,
  spartan_zk.rs, neutronnova_zk.rs, and zk.rs

- Convert field inversions in neutronnova_zk.rs to return Result with
  ProofVerifyError instead of panicking, protecting against zero
  challenges

- Add overflow bounds documentation to all DelayedReduction accumulator
  types in delayed_reduction.rs explaining bit capacities and max
  accumulation counts

- Promote debug_assert to assert for critical invariants in extension.rs
  that would silently produce incorrect results if violated in release

- Add # Panics documentation to enforce_sc_claim in zk.rs

- Add descriptive expect message to batch_invert_array in basis.rs
@wu-s-john
Copy link
Contributor Author

Cross-Field Comparison: Pallas and Vesta

The optimization provides consistent speedups across different field implementations, demonstrating that the benefit is not specific to BN254's field arithmetic.

BN254 (Fr):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq prove (µs) prove speedup
16 65,536 7,409 4,825 1.54×
17 131,072 10,481 8,704 1.20×
18 262,144 19,211 15,555 1.24×
19 524,288 28,453 27,045 1.05×
20 1,048,576 63,309 49,089 1.29×
21 2,097,152 111,088 97,782 1.14×
22 4,194,304 224,928 184,728 1.22×
23 8,388,608 419,936 337,845 1.24×
24 16,777,216 966,793 842,851 1.15×
25 33,554,432 1,688,485 1,378,239 1.23×
26 67,108,864 3,492,649 3,024,462 1.15×
27 134,217,728 6,785,269 6,637,933 1.02×

Pallas (Fq):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field pallas-fq --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq prove (µs) prove speedup
16 65,536 7,379 5,386 1.37×
17 131,072 10,325 7,975 1.29×
18 262,144 15,592 13,999 1.11×
19 524,288 31,104 25,899 1.20×
20 1,048,576 61,710 45,470 1.36×
21 2,097,152 101,028 101,547 1.00×
22 4,194,304 219,411 182,989 1.20×
23 8,388,608 392,407 326,981 1.20×
24 16,777,216 872,222 718,890 1.21×
25 33,554,432 1,793,937 1,421,947 1.26×
26 67,108,864 3,617,611 2,912,601 1.24×
27 134,217,728 6,801,901 5,351,520 1.27×

Vesta (Fp):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field vesta-fp --methods base,split-eq-dmr range-sweep --min 16 --max 27
num_vars n base prove (µs) split-eq prove (µs) prove speedup
16 65,536 6,823 4,623 1.48×
17 131,072 12,137 7,841 1.55×
18 262,144 16,188 12,901 1.25×
19 524,288 28,954 21,832 1.33×
20 1,048,576 61,831 47,901 1.29×
21 2,097,152 100,097 92,516 1.08×
22 4,194,304 193,884 197,480 0.98×
23 8,388,608 398,991 306,962 1.30×
24 16,777,216 770,999 693,625 1.11×
25 33,554,432 1,709,271 1,424,109 1.20×
26 67,108,864 2,948,199 2,483,610 1.19×
27 134,217,728 6,842,589 5,356,907 1.28×

Cross-field observations:

  • Both Pasta curves show consistent improvement at scale, confirming the optimization is field-agnostic
  • Occasional near-1× results at n = 2²¹-2²² may be due to cache effects at these intermediate sizes

@wu-s-john
Copy link
Contributor Author

Assembly Analysis: Why Delayed Reduction is Faster

To understand the performance difference at the instruction level, we analyze the ARM64 assembly for inner-product operations. The example at examples/asm_compare.rs demonstrates four approaches:

cargo asm --example asm_compare inner_product_field_field_base   # Eager reduction
cargo asm --example asm_compare inner_product_field_field_dmr    # Delayed reduction
cargo asm --example asm_compare inner_product_field_i64          # Field × i64
cargo asm --example asm_compare inner_product_field_i128         # Field × i128

Source Code (examples/asm_compare.rs):

use halo2curves::bn256::Fr as Bn254Fr;
use spartan2::small_field::{DelayedReduction, SignedWideLimbs, WideLimbs};

/// Function 1: Field × Field with EAGER reduction (base approach)
/// Each multiplication triggers a Montgomery reduction inside the loop.
#[inline(never)]
pub fn inner_product_field_field_base(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
    let mut acc = Bn254Fr::zero();
    for (ai, bi) in a.iter().zip(b.iter()) {
        acc += *ai * *bi; // Montgomery reduction happens here on each iteration
    }
    acc
}

/// Function 2: Field × Field with DELAYED modular reduction (DMR)
/// Accumulates in WideLimbs<9> (576 bits) and reduces once at the end.
#[inline(never)]
pub fn inner_product_field_field_dmr(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
    let mut acc = WideLimbs::<9>::default();
    for (ai, bi) in a.iter().zip(b.iter()) {
        // Wide accumulation - no modular reduction here
        <Bn254Fr as DelayedReduction<Bn254Fr>>::unreduced_multiply_accumulate(&mut acc, ai, bi);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<Bn254Fr>>::reduce(&acc)
}

/// Function 3: Field × i64 with delayed modular reduction
/// Uses SignedWideLimbs<6> (384 bits) for signed small values.
#[inline(never)]
pub fn inner_product_field_i64(fields: &[Bn254Fr], values: &[i64]) -> Bn254Fr {
    let mut acc = SignedWideLimbs::<6>::default();
    for (f, v) in fields.iter().zip(values.iter()) {
        // Fused multiply-accumulate into wide limbs
        <Bn254Fr as DelayedReduction<i64>>::unreduced_multiply_accumulate(&mut acc, f, v);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<i64>>::reduce(&acc)
}

/// Function 4: Field × i128 with delayed modular reduction
/// Uses SignedWideLimbs<7> (448 bits) for larger signed values.
#[inline(never)]
pub fn inner_product_field_i128(fields: &[Bn254Fr], values: &[i128]) -> Bn254Fr {
    let mut acc = SignedWideLimbs::<7>::default();
    for (f, v) in fields.iter().zip(values.iter()) {
        // Two-pass multiply-accumulate (low 64 bits, then high 64 bits)
        <Bn254Fr as DelayedReduction<i128>>::unreduced_multiply_accumulate(&mut acc, f, v);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<i128>>::reduce(&acc)
}

Instruction Count Per Multiply-Accumulate Operation:

Function Loop Instructions mul umulh adds/adcs/adc cinc/cset Notes
base 250 36 32 89 53 Includes Montgomery REDC
dmr 109 16 16 49 22 Wide accumulation only
i64 54 4 4 8 8 4×1 multiply + sign handling
i128 92 8 8 16 15 4×2 (two-pass) + sign handling

Detailed Breakdown:

Base (250 instructions/iteration):

57× adds    - Addition with flags
50× cinc    - Conditional increment (carry propagation)
36× mul     - Low 64-bit multiply
32× umulh   - High 64-bit multiply
24× adcs    - Add with carry + set flags
 8× asr     - Arithmetic shift (modular reduction)
 8× and     - Mask operations (reduction)
 8× adc     - Add with carry
 4× ldp     - Load pair
 4× cmn     - Compare negative

The 36 muls = 16 (4×4 product) + 20 (Montgomery REDC)

DMR (109 instructions/iteration):

32× adds    - Addition with flags
16× umulh   - High 64-bit multiply
16× mul     - Low 64-bit multiply  ← exactly 4×4 = 16
16× cset    - Conditional set (carry capture)
10× adc     - Add with carry
 7× adcs    - Add with carry + set flags
 6× cinc    - Conditional increment
 4× ldp     - Load pair

Pure 4×4 multiply + 9-limb accumulation. No reduction in loop.

i64 (54 instructions/iteration):

17× csel    - Conditional select (pos/neg accumulator)
 8× adds    - Addition with flags
 7× cinc    - Conditional increment
 4× umulh   - High 64-bit multiply  ← 4×1 = 4
 4× mul     - Low 64-bit multiply   ← 4×1 = 4
 3× tst     - Test sign bit
 2× ldp     - Load pair

Only 8 multiply instructions (4 mul + 4 umulh) for 4×1 product.

i128 (92 instructions/iteration):

20× csel    - Conditional select (pos/neg)
16× adds    - Addition with flags
15× cinc    - Conditional increment
 8× umulh   - High 64-bit multiply  ← 4×2 = 8
 8× mul     - Low 64-bit multiply   ← 4×2 = 8
 4× ldp     - Load pair
 3× tst     - Test sign bit
 2× eor     - XOR (sign handling)

16 multiply instructions (8 mul + 8 umulh) for two-pass 4×2 product.

Cost Comparison for n iterations:

Method Total Loop Cost Reduction Cost Formula
base 250n 0 (in loop) 250n
dmr 109n ~200 (once) 109n + 200
i64 54n ~150 (once) 54n + 150
i128 92n ~150 (once) 92n + 150

Key insight: The base method spends ~170 instructions (68%) on Montgomery reduction inside the loop. Delayed reduction moves this cost outside the loop, paying it only once regardless of iteration count.

Timing Benchmark (n = 65536, 10 trials, BN254):

cargo run --example asm_compare --release
Inner Product Implementations - Assembly Comparison (BN254)


Timing Benchmark (n = 65536, 10 trials)
----------------------------------------
  base (field×field eager):     1598.8 µs
  dmr  (field×field delayed):    640.0 µs  (2.50× faster)
  i64  (field×i64 delayed):      329.4 µs  (4.85× faster than base)
  i128 (field×i128 delayed):     506.2 µs  (3.16× faster than base)

Instruction counts per iteration (from assembly):
  base: 250 instrs  |  dmr: 109 instrs  |  i64: 54 instrs  |  i128: 92 instrs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants