Implement Small-Value Sum-Check Optimization (Algorithm 6) by wu-s-john · Pull Request #98 · microsoft/Spartan2

wu-s-john · 2025-12-18T23:02:41Z

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.

Key Insight

In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.

Multiplication Cost Hierarchy:

ss (small × small): Native i32/i64 multiplication (~1 cycle)
sl (small × large): Barrett-optimized multiplication (~9 base mults)
ll (large × large): Full Montgomery multiplication (~32 base mults)

For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.

Benchmarks

Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with jemalloc.
Note: halo2curves/asm is not enabled (unavailable on Apple Silicon).

Headline Result: 1.83× Speedup on BN254

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):

Percentile	Base prove (ms)	Small-value prove (ms)	Speedup
25%	2,588	1,418	1.79×
50% (median)	2,616	1,429	1.82×
75%	2,662	1,473	1.84×
90%	2,895	1,577	1.88×

Mean speedup: 1.83× (base: 2,671ms → small-value: 1,464ms)
Lower variance in optimized version (std: 73ms vs 151ms)
Consistent speedup across all percentiles

Scaling Across Problem Sizes

cargo run --release --example sumcheck_sweep -- --field bn254-fr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	small-value prove (µs)	speedup
16	65,536	9,459	4,619	2.05×
17	131,072	8,795	4,516	1.95×
18	262,144	13,582	7,567	1.80×
19	524,288	24,293	14,199	1.71×
20	1,048,576	49,350	25,117	1.97×
21	2,097,152	96,088	62,382	1.54×
22	4,194,304	176,852	94,437	1.87×
23	8,388,608	349,647	190,513	1.84×
24	16,777,216	679,342	365,720	1.86×
25	33,554,432	1,409,206	729,882	1.93×
26	67,108,864	2,799,654	1,493,126	1.88×
27	134,217,728	5,671,207	3,066,167	1.85×

Key observations:

Consistent 1.8-2.0× speedup across all problem sizes on BN254
Speedup remains stable even at n = 2²⁷ (134M constraints)
Peak speedup of 2.05× at n = 2¹⁶

Delayed Modular Reduction Impact

To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.

Accumulator Building Phase

The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.

cargo run --release --example accum_bench -- --field bn254-fr --l0 3 range-sweep --min 16 --max 27

num_vars	n	DMR accum (µs)	no-DMR accum (µs)	accum speedup
16	65,536	1,058	3,983	3.76×
17	131,072	1,595	8,668	5.43×
18	262,144	2,434	18,357	7.54×
19	524,288	3,831	15,732	4.11×
20	1,048,576	7,708	29,625	3.84×
21	2,097,152	13,739	55,788	4.06×
22	4,194,304	25,401	120,680	4.75×
23	8,388,608	55,910	223,901	4.00×
24	16,777,216	116,574	458,829	3.94×
25	33,554,432	242,699	935,360	3.85×
26	67,108,864	516,107	1,891,151	3.66×
27	134,217,728	1,024,774	3,794,770	3.70×

Key observations:

DMR provides 3.7-7.5× speedup on accumulator building with l0=3
Peak speedup of 7.54× at n = 2¹⁸
This is the primary source of performance gains in the small-value optimization

Time Breakdown: First l0 Rounds vs Remaining Rounds

With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:

num_vars	n	first l0 (ms)	remaining (ms)	ratio (first:remaining)
16	65,536	1.2	3.4	0.34 : 1
17	131,072	1.7	2.8	0.60 : 1
18	262,144	2.5	5.0	0.50 : 1
19	524,288	3.9	10.3	0.38 : 1
20	1,048,576	7.8	17.3	0.45 : 1
21	2,097,152	13.8	48.6	0.28 : 1
22	4,194,304	25.5	68.9	0.37 : 1
23	8,388,608	56.0	134.5	0.42 : 1
24	16,777,216	116.6	249.1	0.47 : 1
25	33,554,432	242.8	487.1	0.50 : 1
26	67,108,864	516.2	977.0	0.53 : 1
27	134,217,728	1,024.8	2,041.3	0.50 : 1

Key observations:

First l0 rounds (accumulators + l0 round proofs) take ~1/3 to 1/2 of total prove time
Remaining rounds (l0+1 to n) dominate, taking ~2× longer than first l0
Ratio stabilizes around 1:2 for large instances (n ≥ 2²⁴)
This balanced split indicates l0=3 is a good choice for BN254

Split-Eq Sumcheck with DMR

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.

cargo run --release --example sumcheck_sweep -- --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq-DMR prove (µs)	prove speedup
16	65,536	9,546	5,113	1.87×
17	131,072	8,278	5,698	1.45×
18	262,144	13,570	9,973	1.36×
19	524,288	26,253	18,446	1.42×
20	1,048,576	44,574	41,793	1.07×
21	2,097,152	95,211	86,503	1.10×
22	4,194,304	172,328	136,639	1.26×
23	8,388,608	327,351	271,947	1.20×
24	16,777,216	630,354	525,370	1.20×
25	33,554,432	1,255,856	1,058,761	1.19×
26	67,108,864	2,506,240	2,171,037	1.15×
27	134,217,728	4,904,540	5,304,525	0.93×

Key observations:

Split-eq with DMR provides 1.15-1.87× speedup for most instance sizes
At n = 2²⁷, there is a slight slowdown (0.93×), likely due to increased memory pressure from DMR state
The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.87×
For very large instances (n ≥ 2²⁵), the speedup stabilizes around 1.15-1.19×

SHA-256 Chain Benchmark

To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.

cargo run --release --no-default-features --example sha256_chain_benchmark

chain_length	num_vars	log₂(constraints)	num_constraints	witness_ms	orig_sumcheck_ms	small_sumcheck_ms	total_ms	speedup	witness_pct
2	16	16	65,536	14	5	3	20	1.67×	70.0%
8	18	18	262,144	55	16	11	75	1.45×	73.3%
32	20	20	1,048,576	229	48	32	301	1.50×	76.1%
128	22	22	4,194,304	1,260	163	109	1,547	1.50×	81.4%
512	24	24	16,777,216	5,686	609	395	6,743	1.54×	84.3%
2048	26	26	67,108,864	17,015	2,857	1,677	22,116	1.70×	76.9%

Key observations:

2048 SHA-256 hashes proven in ~22 seconds
Witness generation dominates at 70-84% of total proving time
Small-value sumcheck achieves consistent 1.45-1.70× speedup

Solana Light Client Comparison

A Solana light client verifying block finality requires:

Component	Hash Function	Count
Vote signature verification	SHA-512 (Ed25519 internal)	~21 to ~1,588
Merkle shred verification	SHA-256	~108 to ~1,206

Ed25519 uses SHA-512 internally for challenge hashing
Finality requires ≥2/3 supermajority stake (~21-530 validators)
SHA-512 is ~1.5-2× more expensive than SHA-256 per hash

SHA-256 equivalent cost:

Solana SHA-256: ~1,206 hashes
Solana SHA-512: ~1,588 × 1.5-2 = ~2,382-3,176 SHA-256 equivalent
Total: ~3,588-4,382 SHA-256 equivalent
Our 2048-chain benchmark covers ~47-57% of Solana's worst-case proving requirement

Implementation

Core Components

SmallValueField trait (src/small_field.rs)
- Defines SmallValue (i32) and IntermediateSmallValue (i64) types
- Barrett-optimized sl_mul and isl_mul for BN254/BLS12-381 (~3× faster than ll)
- Overflow analysis ensuring correctness for typical witness bounds
Lagrange Domain Extension (src/lagrange.rs)
- LagrangeEvaluatedMultilinearPolynomial<T, D> for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}
- Zero-allocation extend_in_place with ping-pong buffers
- gather_prefix_evals for efficient prefix collection (Procedure 6)
Accumulator Data Structures (src/accumulators.rs, src/accumulator_index.rs)
- SmallValueAccumulators<S, D> storing A_i(v, u) with O(1) indexing via UdTuple
- idx4 mapping (Definition A.5) for distributing products to correct accumulators
- Type-safe UdEvaluations and UdHatEvaluations wrappers
Procedure 9 Implementation (src/accumulators.rs)
- build_accumulators_spartan: Optimized for Spartan's Az·Bz structure
- build_accumulators: Generic version for arbitrary polynomial products
- Parallel fold-reduce with thread-local scratch buffers
Thread-Local Buffer Reuse (src/thread_state_accumulators.rs)
- SpartanThreadState and GenericThreadState eliminate O(num_x_out) allocations
- Reduces allocator contention in parallel workloads
Sum-Check Integration (src/sumcheck.rs)
- SmallValueSumCheck::from_accumulators factory method
- Round-by-round Lagrange coefficient multiplication (R_{i+1} = R_i ⊗ L_{U_d}(r_i))

Algorithm Flow

┌─────────────────────────────────────────────────────────────────────────┐
│  Precomputation: Build accumulators A_i(v, u) for i ∈ [ℓ₀]              │
│                                                                         │
│  For each x_out ∈ {0,1}^{ℓ/2-ℓ₀}:                                       │
│    For each x_in ∈ {0,1}^{ℓ/2}:                                         │
│      ein = eq(w_R, x_in) · eq(w_L, x_out)                              │
│      Extend Az/Bz prefixes to U_d^{ℓ₀} via Lagrange                    │
│      Accumulate products weighted by ein into A_i(v, u)                │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Rounds 1..ℓ₀: Compute s_i(X) = ⟨R_i, A_i(·, u)⟩ for u ∈ Û_d           │
│                R_{i+1} = R_i ⊗ (L_{U_d,k}(r_i))_{k∈U_d}                 │
└─────────────────────────────────────────────────────────────────────────┘
                                    │
                                    ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Round ℓ₀+1: Streaming round (Algorithm 2) to bind to r_{1:ℓ₀}         │
│  Rounds ℓ₀+2..ℓ: Standard linear-time sum-check (Algorithm 1)          │
└─────────────────────────────────────────────────────────────────────────┘

Test Plan

cargo test test_build_accumulators - Verifies accumulator construction
cargo test test_small_value - SmallValueField arithmetic correctness
cargo test lagrange - Lagrange extension and interpolation
cargo test sumcheck - Full sum-check protocol equivalence
cargo clippy - No warnings
examples/sumcheck_sha256_equivalence.rs - Verifies new method produces identical proofs to baseline
examples/sha256_chain_benchmark.rs - SHA-256 chain proving with CSV output

References

Paper: Speeding Up Sum-Check Proving (ePrint 2024/1046)
Jolt integration: a16z/jolt#690

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.

Implements LagrangeEvaluatedMultilinearPolynomial with from_multilinear() factory method that extends evaluations from {0,1}^n to U_d^n.

sumcheck optimization (Algorithm 6) Introduces RoundAccumulator and SmallValueAccumulators for the small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage with const generic D for cache efficiency and vectorizable merge operations in parallel fold-reduce.

Parameterize UdPoint, UdHatPoint, UdTuple, and LagrangeEvaluatedMultilinearPolynomial with const generic D to enable: - Compile-time enforcement that domain types match accumulator degree - Debug assertions for bounds checking (v < D in constructors) - Elimination of runtime base parameter from to_flat_index() This prevents mixing domain sizes at compile time and catches out-of-bounds errors in debug builds.

Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.

Extracts strided polynomial evaluations for all binary prefixes b ∈ {0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6 (Lagrange extension).

Added a parallel build_accumulators that binds suffixes, extends prefixes to the Ud domain, applies the ∞/Cz rule, and routes contributions via cached idx4 with E_in/E_out weighting. Expanded accumulator tests with a naive cross-check, ∞ handling, and binary-β zero behavior to validate correctness. Cleaned up dead-code allowances now that the code paths are used.

Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.

Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds

Replace range-indexed loops and a redundant closure with iterator forms

Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.

Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.

indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.

Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators

- Add prove_cubic_with_three_inputs_small_value combining small-value optimization for first ℓ₀ rounds with eq-poly optimization for remaining - Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree parameter - Add sumcheck_sweep.rs examples for performance comparison

build_accumulators The new from_boolean_evals_with_buffer_reusing method takes caller-provided scratch buffers and alternates between them during extension. This reduces allocations from O(num_x_in × num_x_out) per call to O(num_threads) buffers allocated once per thread.

variants Spartan version (D=2) skips binary betas since satisfying witnesses have Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial products.

Adds a new example that tests prove_cubic_with_three_inputs and prove_cubic_with_three_inputs_small_value produce identical proofs when used with a real SHA256 circuit (Algorithm 6 validation). Changes: - Add PartialEq, Eq derive to SumcheckProof for proof comparison - Add extract_outer_sumcheck_inputs helper to SpartanSNARK - Add examples/sumcheck_sha256_equivalence.rs

Implement the small × large multiplication optimization from "Speeding Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive field multiplication. Key changes: - Add SmallValueField trait for type-safe i32/i64 small-value operations - Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul) - Add SpartanAccumulatorInput trait to unify field and i32 witness handling - Make LagrangeEvaluatedMultilinearPolynomial generic over element type - Update sumcheck prover to accept separate i32 witness polynomials - Clean up MultilinearPolynomial<i32>: remove unused from_u32/from_u64/from_field

evaluations Replace raw arrays and ad-hoc structs with proper abstractions for U_d = {∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove EqRoundValues in favor of UdEvaluations<F, 2>.

- Delete unused constructor/predicate methods from UdPoint and UdHatPoint - Move test-only methods (alpha, prefix_len, suffix_len, extend_from_boolean) to cfg(test) impl blocks - Add CachedPrefixIndex struct with From impl to accumulator_index.rs - Remove unused QuadraticTAccumulatorPrefixIndex type alias - Delete unused eq_factor_alpha method from sumcheck

Hoist scratch buffers to thread-local state in build_accumulators_spartan and build_accumulators. Previously, 5 vectors were allocated on every x_out iteration; now allocations happen once per Rayon thread subdivision. - Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids .to_vec()) - Add SpartanThreadState and GenericThreadState structs for buffer reuse - Extract thread state structs to thread_state_accumulators module Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).

Move the witness polynomial abstraction trait from accumulators.rs to its own module for better code organization. Rename from SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify that it abstracts over multilinear polynomial representations (field elements vs small values).

Extract WideMul trait (small × small → product) into wide_mul.rs, separating widening multiplication from delayed reduction. This enables cleaner generic bounds where only one capability is needed. Key changes: - Add WideMul trait with i32→i64 and i64→i128 implementations - Refactor DelayedReduction to be generic over value type (i32/i64/i128/F) - Rename multiply_accumulate → unreduced_multiply_accumulate for clarity - Remove LagrangeAccumulatorField convenience trait (use explicit bounds) - Make multiply_vec_small and fold functions generic over SmallValue - Delete impls.rs, consolidating implementations into trait modules - Update documentation: fix outdated terminology and add limb size context The trait hierarchy is now: - WideMul: small × small → product (pure integer widening) - SmallValueField: field ↔ small conversions - DelayedReduction<T>: field × T accumulation without reduction

function - Replace BenchmarkBox wrapper with direct enum dispatch - Move run logic to standalone run_benchmark<E, B> function - Simplify SumcheckBenchmark trait to just convert() and prove() - Remove build_benchmarks function in favor of direct dispatch - Add i32 support for BN254 (SupportsSmallI32 marker trait) - Fix clippy warning for unnecessary cast in small_field

- Add timed() helper to reduce timing boilerplate across 5 call sites - Extract TRANSCRIPT_LABEL constant (was hardcoded 3 times) - Rename run_chain_benchmark → run_sumcheck_benchmark for clarity - Remove outdated comment referencing PallasHyraxEngine - Reorganize imports into spartan2 block

- Extract FieldChoice into shared cli module - Consolidate timing phases into (name, short_name) tuples with HashMap storage - Simplify NeutronNova benchmark: remove rounds param and subcommands - Clean up phase names: remove prep_ml, add nifs_prove, unify span names - Make sha256.rs generic over engine type

Delete examples/sumcheck_sha256_equivalence.rs and add equivalent test in src/small_sumcheck.rs. Tests that prove_cubic_small_value produces identical output to prove_cubic_with_three_inputs using SmallSha256Circuit

- ExtensionBound<SV, D> precomputes max_safe bound once for reuse - Make prove_small_value and helpers generic over SV (i32/i64) - Remove unused Debug bounds from SV and SV::Product - Update docs to reference ExtensionBound

- Remove `build_accumulators` generic function (was marked dead_code) - Remove `gather_prefix_evals` from MultilinearPolynomial - Add standalone `extend_to_lagrange_domain` function (Procedure 6) - Remove helper functions only used by deleted code: - `fill_eyx`, `scatter_beta_contributions` - Clean up unused fields and imports across lagrange_accumulator module - Remove tests that depended on deleted functions

- Change `LagrangeAccumulators::rounds` from `pub` to `pub(crate)` - Change `LagrangeEvals::infinity` and `finite` from `pub` to `pub(crate)` - Fix stale comment referencing non-existent `cz_ext` and `cz_pref` variables - Use `data_mut()` accessor instead of direct field access for consistency

Change eq_cache from [round][y * num_x + x] to [round][x * num_y + y] so each parallel task (fixed x_out) accesses a contiguous memory block. Also remove unnecessary clone of e_in by borrowing directly.

wu-s-john · 2026-02-06T15:15:48Z

Split-Eq Sumcheck with Delayed Modular Reduction

For the split-eq sumcheck (which uses pre-split eq-polynomial tables), delayed modular reduction provides additional speedup by batching modular reductions in the remaining rounds. The speedup comes from inner-product-like multiplications with two streams of large field elements—performing a single reduction for the entire inner-product is faster than reducing after each multiplication term.

Scaling Across Problem Sizes

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq prove (µs)	prove speedup
16	65,536	8,082	5,853	1.38×
17	131,072	10,018	7,615	1.32×
18	262,144	17,714	13,015	1.36×
19	524,288	26,564	21,801	1.22×
20	1,048,576	51,821	43,759	1.18×
21	2,097,152	104,513	95,216	1.10×
22	4,194,304	195,987	159,347	1.23×
23	8,388,608	383,391	330,527	1.16×
24	16,777,216	759,441	590,406	1.29×
25	33,554,432	1,313,572	1,125,700	1.17×
26	67,108,864	2,615,161	2,246,468	1.16×
27	134,217,728	5,292,385	5,259,143	1.01×

Statistical Analysis at n = 2²⁶ (30 trials)

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --trials 30 --methods base,split-eq-dmr single 26

Statistic	Base prove (ms)	Split-eq prove (ms)	Speedup
Min	2,595	2,252	1.15×
25%	2,603	2,278	1.14×
50% (median)	2,627	2,284	1.15×
75%	2,649	2,299	1.15×
Max	2,743	2,443	1.12×
Mean	2,641	2,294	1.15×

Key observations:

Split-eq with delayed modular reduction provides 1.10-1.38× speedup for most instance sizes
Consistent 1.15× mean speedup at n = 2²⁶ across 30 trials with low variance
At n = 2²⁷, speedup drops to 1.01×, likely due to increased memory pressure
The sweet spot is around n = 2¹⁶ where the speedup peaks at 1.38×

wu-s-john · 2026-02-06T15:23:36Z

NeutronNova NIFS Benchmark

NeutronNova folds multiple R1CS instances using a NIFS (Non-Interactive Folding Scheme) sumcheck. This benchmark measures the end-to-end proving time with detailed phase breakdown.

8 instances × 128 SHA-256 hashes each (4,194,304 constraints per instance):

NIFS Folding Only

sudo nice -n -20 cargo run --release --example neutronnova_sha256_benchmark -- --mode nifs --instances 8 --chain-length 128 --field bn254-fr

Phase	small (ms)	large (ms)	Speedup
shared_syn	12	9	0.75×
precom_syn	1,047	1,240	1.18×
commit_pre	1,240	1,148	0.93×
mat_vec	550	618	1.12×
nifs_fold_sc	156	2,014	12.91×
fold_W	38	112	2.95×
fold_U	305	285	0.93×
nifs_prove	1,416	3,211	2.27×
end_to_end	5,429	7,157	1.32×

Full ZK Prove

sudo nice -n -20 cargo run --release --example neutronnova_sha256_benchmark -- --mode zk-prove --instances 8 --chain-length 128 --field bn254-fr

Phase	small (ms)	large (ms)	Speedup
prep	3,899	4,484	1.15×
shared_syn	12	10	0.83×
precom_syn	1,470	1,472	1.00×
commit_pre	1,210	1,101	0.91×
rerand	838	901	1.08×
gen_inst	91	74	0.81×
nifs_sc	155	1,984	12.80×
fold_W	37	115	3.11×
fold_U	271	273	1.01×
nifs	1,416	3,198	2.26×
outer_sc	1,695	1,684	0.99×
inner_sc	794	771	0.97×
pcs	67	67	1.00×
zk_prove	6,014	7,709	1.28×
end_to_end	10,026	12,294	1.23×

Key observations:

NIFS sumcheck achieves 12.8-12.9× speedup from the small-value optimization, the highest speedup of any phase
NIFS prove time: 1.4s (small) vs 3.2s (large) = 2.27× speedup
Full ZK prove: 6.0s (small) vs 7.7s (large) = 1.28× speedup
End-to-end with prep: 10.0s (small) vs 12.3s (large) = 1.23× overall speedup
The optimization has successfully shifted the bottleneck away from the sumcheck

Future optimizations with native small-value types:

If the R1CS matrices and witness are stored natively as small-value types (rather than field elements), additional speedups are possible:

Matrix-vector multiply (mat_vec): Currently 550ms. With small-value matrix entries and witness, this becomes small×small multiplication—essentially free compared to field operations.
Witness synthesis (precom_syn): Currently 1.0-1.5s. Small-value arithmetic during witness generation would significantly reduce this cost.

Remaining bottlenecks and GPU acceleration opportunities:

The phase breakdown reveals that the NIFS sumcheck (nifs_sc) is now only ~1.5% of total time. The bottleneck has shifted to preprocessing, commitment, and witness generation:

Preprocessing (prep): 3.9s — the largest single cost, including circuit compilation and setup.
Commitment (commit_pre): 1.2s. Hyrax PCS committing to binary witnesses via MSM. GPU-accelerated MSM implementations routinely achieve 10-50× speedup over CPU.

wu-s-john · 2026-02-06T15:30:24Z

End-to-End Spartan Proving Breakdown

Full Spartan proving pipeline with detailed phase breakdown, measured with sudo nice -n -20 for minimal scheduling noise on M1 Max.

Single SHA-256 Hash (varying message sizes)

sudo nice -n -20 cargo run --release --no-default-features --example sha256

msg=1024B, constraints=1,048,576:

Phase	small (ms)	large (ms)	Speedup
synth_pre	127	156	1.23×
commit_pre	24	22	0.92×
r1cs_rest	17	16	0.94×
commit_rest	12	11	0.92×
mat_vec	15	13	0.87×
outer_sc	19	44	2.32×
eval_rx	4	6	1.50×
eval_sparse	41	41	1.00×
poly_ABC	20	19	0.95×
poly_z	3	2	0.67×
inner_sc	39	47	1.21×
pcs	33	27	0.82×
prove_total	216	235	1.09×

msg=2048B, constraints=2,097,152:

Phase	small (ms)	large (ms)	Speedup
synth_pre	240	326	1.36×
commit_pre	46	43	0.93×
r1cs_rest	36	34	0.94×
commit_rest	26	23	0.88×
mat_vec	33	23	0.70×
outer_sc	36	84	2.33×
eval_rx	11	12	1.09×
eval_sparse	85	82	0.96×
poly_ABC	41	38	0.93×
poly_z	6	9	1.50×
inner_sc	78	79	1.01×
pcs	40	41	1.02×
prove_total	411	437	1.06×

wu-s-john · 2026-02-06T15:36:23Z

At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.91× speedup on the BN254 scalar field (30 trials):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --trials 30 --methods base,i64 single 26

Percentile	Base prove (ms)	Small-value prove (ms)	Speedup
25%	2,601	1,360	1.91×
50% (median)	2,610	1,367	1.91×
75%	2,640	1,385	1.91×
90%	2,683	1,468	1.83×

Mean speedup: 1.91× (base: 2,662ms → small-value: 1,390ms)
Lower variance in optimized version
Consistent speedup across all percentiles

Scaling Across Problem Sizes

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,i64 range-sweep --min 16 --max 28

num_vars	n	base prove (µs)	small-value prove (µs)	speedup
16	65,536	9,636	4,321	2.23×
17	131,072	10,008	4,618	2.17×
18	262,144	14,634	7,468	1.96×
19	524,288	33,984	12,477	2.72×
20	1,048,576	46,076	22,357	2.06×
21	2,097,152	92,270	41,879	2.20×
22	4,194,304	203,444	84,042	2.42×
23	8,388,608	343,588	172,579	1.99×
24	16,777,216	697,024	341,831	2.04×
25	33,554,432	1,335,911	687,493	1.94×
26	67,108,864	2,626,311	1,404,205	1.87×
27	134,217,728	5,211,726	2,842,161	1.83×
28	268,435,456	10,554,689	5,962,386	1.77×

Key observations:

Consistent 1.8-2.7× speedup across all problem sizes on BN254
Speedup remains stable even at n = 2²⁸ (268M constraints)
Peak speedup of 2.72× at n = 2¹⁹

- Replace bare .unwrap() with .expect() containing descriptive messages for ilog2 conversions and field inversions in spartan.rs, spartan_zk.rs, neutronnova_zk.rs, and zk.rs - Convert field inversions in neutronnova_zk.rs to return Result with ProofVerifyError instead of panicking, protecting against zero challenges - Add overflow bounds documentation to all DelayedReduction accumulator types in delayed_reduction.rs explaining bit capacities and max accumulation counts - Promote debug_assert to assert for critical invariants in extension.rs that would silently produce incorrect results if violated in release - Add # Panics documentation to enforce_sc_claim in zk.rs - Add descriptive expect message to batch_invert_array in basis.rs

wu-s-john · 2026-02-07T23:36:01Z

Cross-Field Comparison: Pallas and Vesta

The optimization provides consistent speedups across different field implementations, demonstrating that the benefit is not specific to BN254's field arithmetic.

BN254 (Fr):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field bn254-fr --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq prove (µs)	prove speedup
16	65,536	7,409	4,825	1.54×
17	131,072	10,481	8,704	1.20×
18	262,144	19,211	15,555	1.24×
19	524,288	28,453	27,045	1.05×
20	1,048,576	63,309	49,089	1.29×
21	2,097,152	111,088	97,782	1.14×
22	4,194,304	224,928	184,728	1.22×
23	8,388,608	419,936	337,845	1.24×
24	16,777,216	966,793	842,851	1.15×
25	33,554,432	1,688,485	1,378,239	1.23×
26	67,108,864	3,492,649	3,024,462	1.15×
27	134,217,728	6,785,269	6,637,933	1.02×

Pallas (Fq):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field pallas-fq --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq prove (µs)	prove speedup
16	65,536	7,379	5,386	1.37×
17	131,072	10,325	7,975	1.29×
18	262,144	15,592	13,999	1.11×
19	524,288	31,104	25,899	1.20×
20	1,048,576	61,710	45,470	1.36×
21	2,097,152	101,028	101,547	1.00×
22	4,194,304	219,411	182,989	1.20×
23	8,388,608	392,407	326,981	1.20×
24	16,777,216	872,222	718,890	1.21×
25	33,554,432	1,793,937	1,421,947	1.26×
26	67,108,864	3,617,611	2,912,601	1.24×
27	134,217,728	6,801,901	5,351,520	1.27×

Vesta (Fp):

sudo nice -n -20 cargo run --release --no-default-features --example sumcheck_sweep -- --field vesta-fp --methods base,split-eq-dmr range-sweep --min 16 --max 27

num_vars	n	base prove (µs)	split-eq prove (µs)	prove speedup
16	65,536	6,823	4,623	1.48×
17	131,072	12,137	7,841	1.55×
18	262,144	16,188	12,901	1.25×
19	524,288	28,954	21,832	1.33×
20	1,048,576	61,831	47,901	1.29×
21	2,097,152	100,097	92,516	1.08×
22	4,194,304	193,884	197,480	0.98×
23	8,388,608	398,991	306,962	1.30×
24	16,777,216	770,999	693,625	1.11×
25	33,554,432	1,709,271	1,424,109	1.20×
26	67,108,864	2,948,199	2,483,610	1.19×
27	134,217,728	6,842,589	5,356,907	1.28×

Cross-field observations:

Both Pasta curves show consistent improvement at scale, confirming the optimization is field-agnostic
Occasional near-1× results at n = 2²¹-2²² may be due to cache effects at these intermediate sizes

wu-s-john · 2026-02-07T23:39:46Z

Assembly Analysis: Why Delayed Reduction is Faster

To understand the performance difference at the instruction level, we analyze the ARM64 assembly for inner-product operations. The example at examples/asm_compare.rs demonstrates four approaches:

cargo asm --example asm_compare inner_product_field_field_base   # Eager reduction
cargo asm --example asm_compare inner_product_field_field_dmr    # Delayed reduction
cargo asm --example asm_compare inner_product_field_i64          # Field × i64
cargo asm --example asm_compare inner_product_field_i128         # Field × i128

Source Code (examples/asm_compare.rs):

use halo2curves::bn256::Fr as Bn254Fr;
use spartan2::small_field::{DelayedReduction, SignedWideLimbs, WideLimbs};

/// Function 1: Field × Field with EAGER reduction (base approach)
/// Each multiplication triggers a Montgomery reduction inside the loop.
#[inline(never)]
pub fn inner_product_field_field_base(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
    let mut acc = Bn254Fr::zero();
    for (ai, bi) in a.iter().zip(b.iter()) {
        acc += *ai * *bi; // Montgomery reduction happens here on each iteration
    }
    acc
}

/// Function 2: Field × Field with DELAYED modular reduction (DMR)
/// Accumulates in WideLimbs<9> (576 bits) and reduces once at the end.
#[inline(never)]
pub fn inner_product_field_field_dmr(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
    let mut acc = WideLimbs::<9>::default();
    for (ai, bi) in a.iter().zip(b.iter()) {
        // Wide accumulation - no modular reduction here
        <Bn254Fr as DelayedReduction<Bn254Fr>>::unreduced_multiply_accumulate(&mut acc, ai, bi);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<Bn254Fr>>::reduce(&acc)
}

/// Function 3: Field × i64 with delayed modular reduction
/// Uses SignedWideLimbs<6> (384 bits) for signed small values.
#[inline(never)]
pub fn inner_product_field_i64(fields: &[Bn254Fr], values: &[i64]) -> Bn254Fr {
    let mut acc = SignedWideLimbs::<6>::default();
    for (f, v) in fields.iter().zip(values.iter()) {
        // Fused multiply-accumulate into wide limbs
        <Bn254Fr as DelayedReduction<i64>>::unreduced_multiply_accumulate(&mut acc, f, v);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<i64>>::reduce(&acc)
}

/// Function 4: Field × i128 with delayed modular reduction
/// Uses SignedWideLimbs<7> (448 bits) for larger signed values.
#[inline(never)]
pub fn inner_product_field_i128(fields: &[Bn254Fr], values: &[i128]) -> Bn254Fr {
    let mut acc = SignedWideLimbs::<7>::default();
    for (f, v) in fields.iter().zip(values.iter()) {
        // Two-pass multiply-accumulate (low 64 bits, then high 64 bits)
        <Bn254Fr as DelayedReduction<i128>>::unreduced_multiply_accumulate(&mut acc, f, v);
    }
    // Single reduction at the end
    <Bn254Fr as DelayedReduction<i128>>::reduce(&acc)
}

Instruction Count Per Multiply-Accumulate Operation:

Function	Loop Instructions	mul	umulh	adds/adcs/adc	cinc/cset	Notes
base	250	36	32	89	53	Includes Montgomery REDC
dmr	109	16	16	49	22	Wide accumulation only
i64	54	4	4	8	8	4×1 multiply + sign handling
i128	92	8	8	16	15	4×2 (two-pass) + sign handling

Detailed Breakdown:

Base (250 instructions/iteration):

57× adds    - Addition with flags
50× cinc    - Conditional increment (carry propagation)
36× mul     - Low 64-bit multiply
32× umulh   - High 64-bit multiply
24× adcs    - Add with carry + set flags
 8× asr     - Arithmetic shift (modular reduction)
 8× and     - Mask operations (reduction)
 8× adc     - Add with carry
 4× ldp     - Load pair
 4× cmn     - Compare negative

The 36 muls = 16 (4×4 product) + 20 (Montgomery REDC)

DMR (109 instructions/iteration):

32× adds    - Addition with flags
16× umulh   - High 64-bit multiply
16× mul     - Low 64-bit multiply  ← exactly 4×4 = 16
16× cset    - Conditional set (carry capture)
10× adc     - Add with carry
 7× adcs    - Add with carry + set flags
 6× cinc    - Conditional increment
 4× ldp     - Load pair

Pure 4×4 multiply + 9-limb accumulation. No reduction in loop.

i64 (54 instructions/iteration):

17× csel    - Conditional select (pos/neg accumulator)
 8× adds    - Addition with flags
 7× cinc    - Conditional increment
 4× umulh   - High 64-bit multiply  ← 4×1 = 4
 4× mul     - Low 64-bit multiply   ← 4×1 = 4
 3× tst     - Test sign bit
 2× ldp     - Load pair

Only 8 multiply instructions (4 mul + 4 umulh) for 4×1 product.

i128 (92 instructions/iteration):

20× csel    - Conditional select (pos/neg)
16× adds    - Addition with flags
15× cinc    - Conditional increment
 8× umulh   - High 64-bit multiply  ← 4×2 = 8
 8× mul     - Low 64-bit multiply   ← 4×2 = 8
 4× ldp     - Load pair
 3× tst     - Test sign bit
 2× eor     - XOR (sign handling)

16 multiply instructions (8 mul + 8 umulh) for two-pass 4×2 product.

Cost Comparison for n iterations:

Method	Total Loop Cost	Reduction Cost	Formula
base	250n	0 (in loop)	250n
dmr	109n	~200 (once)	109n + 200
i64	54n	~150 (once)	54n + 150
i128	92n	~150 (once)	92n + 150

Key insight: The base method spends ~170 instructions (68%) on Montgomery reduction inside the loop. Delayed reduction moves this cost outside the loop, paying it only once regardless of iteration count.

Timing Benchmark (n = 65536, 10 trials, BN254):

cargo run --example asm_compare --release

Inner Product Implementations - Assembly Comparison (BN254)


Timing Benchmark (n = 65536, 10 trials)
----------------------------------------
  base (field×field eager):     1598.8 µs
  dmr  (field×field delayed):    640.0 µs  (2.50× faster)
  i64  (field×i64 delayed):      329.4 µs  (4.85× faster than base)
  i128 (field×i128 delayed):     506.2 µs  (3.16× faster than base)

Instruction counts per iteration (from assembly):
  base: 250 instrs  |  dmr: 109 instrs  |  i64: 54 instrs  |  i128: 92 instrs

wu-s-john added 9 commits December 18, 2025 14:52

Add domain types for Algorithm 6 sumcheck optimization

e426c34

Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.

Add Lagrange domain extension for multilinear polynomials

d0f8eed

Implements LagrangeEvaluatedMultilinearPolynomial with from_multilinear() factory method that extends evaluations from {0,1}^n to U_d^n.

Add index mapping for Algorithm 6 sumcheck optimization (Definition A.5)

d23b654

Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.

Add suffix eq-polynomial pyramid for Algorithm 6 sumcheck optimization

e4444fd

Add gather_prefix_evals and UdTuple::from_binary for Algorithm 6

b0908d7

Extracts strided polynomial evaluations for all binary prefixes b ∈ {0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6 (Lagrange extension).

Add bit-ordering sanity tests for eq, gather, and binding

7b32a0c

Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.

wu-s-john changed the title ~~Implement Algorithm 6 Foundation — Procedure 9 Accumulator Builder~~ Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder Dec 18, 2025

wu-s-john added 13 commits December 19, 2025 15:05

Add Lagrange tensor evaluation helper and tests

006fb04

Add eq round factor helper and tests

6465006

Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds

Fix clippy warnings in accumulator and Lagrange loops

d9c75df

Replace range-indexed loops and a redundant closure with iterator forms

Implement Lagrange sum-check round helpers

cc373c1

Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.

Add small-value sumcheck round harness and parity test

f8d5308

Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.

Improve small-value accumulators: use quadratic t_i and speed up

312bea3

indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.

Add generic Csr<T> data structure and refactor accumulator cache

93ee149

Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators

Add SmallValueSumCheck::from_accumulators factory method

1a81943

Split build_accumulators: add Spartan-optimized and generic Procedure 9

03b5293

variants Spartan version (D=2) skips binary betas since satisfying witnesses have Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial products.

wu-s-john force-pushed the feat/procedure-9-accumulator branch from 2828f04 to 67674c4 Compare December 23, 2025 19:33

wu-s-john added 4 commits December 23, 2025 13:29

Add type-safe UdEvaluations and UdHatEvaluations wrappers for domain

f6af2b4

evaluations Replace raw arrays and ad-hoc structs with proper abstractions for U_d = {∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove EqRoundValues in favor of UdEvaluations<F, 2>.

wu-s-john changed the title ~~Implement Faster Sumcheck Algorithm — Procedure 9 Accumulator Builder~~ Implement Small-Value Sum-Check Optimization (Algorithm 6) Dec 23, 2025

wu-s-john marked this pull request as ready for review December 23, 2025 23:56

wu-s-john added 12 commits February 5, 2026 15:56

Move SHA-256 sumcheck equivalence test from example to unit test

ace4c0e

Delete examples/sumcheck_sha256_equivalence.rs and add equivalent test in src/small_sumcheck.rs. Tests that prove_cubic_small_value produces identical output to prove_cubic_with_three_inputs using SmallSha256Circuit

Added NeutronNova ZK Prove

b4c2b24

Make default field parameter bn254

b25815d

Transpose eq_cache layout for cache-friendly parallel access

fbbc9c0

Change eq_cache from [round][y * num_x + x] to [round][x * num_y + y] so each parallel task (fixed x_out) accesses a contiguous memory block. Also remove unnecessary clone of e_in by borrowing directly.

Made timings more accurate for witness generation and commitment

f614127

wu-s-john force-pushed the feat/procedure-9-accumulator branch from 4d49748 to f614127 Compare February 6, 2026 15:09

wu-s-john added 2 commits February 7, 2026 06:23

Added equivalent tests for NeutronNova

7593245

Fixed compiler errors

fd801a9

wu-s-john mentioned this pull request Feb 9, 2026

add delayed modular reduction for split-eq sumcheck #105

Merged

4 tasks

wu-s-john added 5 commits February 10, 2026 07:20

Made Montgomery Reduction much faster

3dfcb3c

extra speedup by creating my own modular reduction

2314cdb

stuff

dd831a7

Better barrett reduction

c2b3507

Pasta specific optimizations

1cfa53f

This was referenced Feb 26, 2026

Add small-value delayed reduction with Barrett algorithm #111

Open

Small value sumcheck #112

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Small-Value Sum-Check Optimization (Algorithm 6)#98

Implement Small-Value Sum-Check Optimization (Algorithm 6)#98
wu-s-john wants to merge 108 commits intomicrosoft:mainfrom
wu-s-john:feat/procedure-9-accumulator

wu-s-john commented Dec 18, 2025 •

edited

Loading

Uh oh!

wu-s-john commented Feb 6, 2026

Uh oh!

wu-s-john commented Feb 6, 2026

Uh oh!

wu-s-john commented Feb 6, 2026

Uh oh!

wu-s-john commented Feb 6, 2026

Uh oh!

wu-s-john commented Feb 7, 2026

Uh oh!

wu-s-john commented Feb 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

wu-s-john commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Implement Small-Value Sum-Check Optimization (Algorithm 6)

Summary

Key Insight

Benchmarks

Headline Result: 1.83× Speedup on BN254

Scaling Across Problem Sizes

Delayed Modular Reduction Impact

Accumulator Building Phase

Time Breakdown: First l0 Rounds vs Remaining Rounds

Split-Eq Sumcheck with DMR

SHA-256 Chain Benchmark

Solana Light Client Comparison

Implementation

Core Components

Algorithm Flow

Test Plan

References

Uh oh!

wu-s-john commented Feb 6, 2026

Split-Eq Sumcheck with Delayed Modular Reduction

Scaling Across Problem Sizes

Statistical Analysis at n = 2²⁶ (30 trials)

Uh oh!

wu-s-john commented Feb 6, 2026

NeutronNova NIFS Benchmark

NIFS Folding Only

Full ZK Prove

Uh oh!

wu-s-john commented Feb 6, 2026

End-to-End Spartan Proving Breakdown

Single SHA-256 Hash (varying message sizes)

Uh oh!

wu-s-john commented Feb 6, 2026

Scaling Across Problem Sizes

Uh oh!

wu-s-john commented Feb 7, 2026

Cross-Field Comparison: Pallas and Vesta

Uh oh!

wu-s-john commented Feb 7, 2026

Assembly Analysis: Why Delayed Reduction is Faster

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

wu-s-john commented Dec 18, 2025 •

edited

Loading