Implement Small-Value Sum-Check Optimization (Algorithm 6)#98
Implement Small-Value Sum-Check Optimization (Algorithm 6)#98wu-s-john wants to merge 108 commits intomicrosoft:mainfrom
Conversation
Introduce UdPoint, UdHatPoint, UdTuple, and ValueOneExcluded types in src/lagrange.rs for representing evaluation domains U_d and Û_d used in the small-value sumcheck optimization.
Implements LagrangeEvaluatedMultilinearPolynomial with
from_multilinear() factory method that extends evaluations from {0,1}^n
to U_d^n.
sumcheck optimization (Algorithm 6) Introduces RoundAccumulator and SmallValueAccumulators for the small-value sumcheck optimization. Uses flat Vec<[Scalar; D]> storage with const generic D for cache efficiency and vectorizable merge operations in parallel fold-reduce.
Parameterize UdPoint, UdHatPoint, UdTuple, and LagrangeEvaluatedMultilinearPolynomial with const generic D to enable: - Compile-time enforcement that domain types match accumulator degree - Debug assertions for bounds checking (v < D in constructors) - Elimination of runtime base parameter from to_flat_index() This prevents mixing domain sizes at compile time and catches out-of-bounds errors in debug builds.
Implement AccumulatorPrefixIndex and compute_idx4() which maps evaluation prefixes β ∈ U_d^ℓ₀ to accumulator contributions by decomposing β into prefix v, coordinate u ∈ Û_d, and binary suffix y.
Extracts strided polynomial evaluations for all binary prefixes b ∈
{0,1}^ℓ₀ given a fixed suffix, bridging full polynomials to Procedure 6
(Lagrange extension).
Added a parallel build_accumulators that binds suffixes, extends prefixes to the Ud domain, applies the ∞/Cz rule, and routes contributions via cached idx4 with E_in/E_out weighting. Expanded accumulator tests with a naive cross-check, ∞ handling, and binary-β zero behavior to validate correctness. Cleaned up dead-code allowances now that the code paths are used.
Added explicit MSB-first checks for eq table generation, gather_prefix_evals stride/pattern, and bind_poly_var_top to ensure “top” binds the MSB.These tests catch silent index/order regressions across components.
Compute ℓ_i(X) = eqe(w[<i], r[<i]) · eqe(w_i, X) values for sum-check rounds. Compute ℓ_i(0)=α_i(1−w_i), ℓ_i(1)=α_i w_i, ℓ_i(∞)=α_i(2w_i−1) for sum-check rounds
Replace range-indexed loops and a redundant closure with iterator forms
Add eq-round linear factor utilities and accumulator evaluation to derive t_i and build s_i polynomials.
Track R_i and ℓ_i state to compare accumulator evals with EqSumCheckInstance rounds.
indexing Switch Spartan t_i to D=2 aliases/tests, precompute idx4 prefix/suffix data, and flatten accumulator caches to cut allocations.
Csr (Compressed Sparse Row) stores variable-length lists with 2 allocations instead of N+1, improving cache locality. Replaces ad-hoc offsets/entries arrays in build_accumulators
- Add prove_cubic_with_three_inputs_small_value combining small-value optimization for first ℓ₀ rounds with eq-poly optimization for remaining - Introduce SPARTAN_T_DEGREE constant to centralize polynomial degree parameter - Add sumcheck_sweep.rs examples for performance comparison
build_accumulators The new from_boolean_evals_with_buffer_reusing method takes caller-provided scratch buffers and alternates between them during extension. This reduces allocations from O(num_x_in × num_x_out) per call to O(num_threads) buffers allocated once per thread.
variants
Spartan version (D=2) skips binary betas since satisfying witnesses have
Az·Bz = Cz on {0,1}^n. Generic version supports arbitrary polynomial
products.
Adds a new example that tests prove_cubic_with_three_inputs and prove_cubic_with_three_inputs_small_value produce identical proofs when used with a real SHA256 circuit (Algorithm 6 validation). Changes: - Add PartialEq, Eq derive to SumcheckProof for proof comparison - Add extract_outer_sumcheck_inputs helper to SpartanSNARK - Add examples/sumcheck_sha256_equivalence.rs
Implement the small × large multiplication optimization from "Speeding
Up Sum-Check Proving" using Barrett reduction for ~3× speedup over naive
field multiplication.
Key changes:
- Add SmallValueField trait for type-safe i32/i64 small-value
operations
- Implement Barrett reduction for Pallas Fp and Fq (sl_mul, isl_mul)
- Add SpartanAccumulatorInput trait to unify field and i32 witness
handling
- Make LagrangeEvaluatedMultilinearPolynomial generic over element
type
- Update sumcheck prover to accept separate i32 witness polynomials
- Clean up MultilinearPolynomial<i32>: remove unused
from_u32/from_u64/from_field
2828f04 to
67674c4
Compare
evaluations
Replace raw arrays and ad-hoc structs with proper abstractions for U_d =
{∞, 0, 1, ..., D-1} and Û_d = U_d \ {1} evaluation domains. Remove
EqRoundValues in favor of UdEvaluations<F, 2>.
- Delete unused constructor/predicate methods from UdPoint and UdHatPoint - Move test-only methods (alpha, prefix_len, suffix_len, extend_from_boolean) to cfg(test) impl blocks - Add CachedPrefixIndex struct with From impl to accumulator_index.rs - Remove unused QuadraticTAccumulatorPrefixIndex type alias - Delete unused eq_factor_alpha method from sumcheck
Hoist scratch buffers to thread-local state in build_accumulators_spartan and build_accumulators. Previously, 5 vectors were allocated on every x_out iteration; now allocations happen once per Rayon thread subdivision. - Add extend_in_place to LagrangeEvaluatedMultilinearPolynomial (avoids .to_vec()) - Add SpartanThreadState and GenericThreadState structs for buffer reuse - Extract thread state structs to thread_state_accumulators module Reduces allocations from O(num_x_out × num_x_in) to O(num_threads).
Move the witness polynomial abstraction trait from accumulators.rs to its own module for better code organization. Rename from SpartanAccumulatorInput to SpartanAccumulatorInputPolynomial to clarify that it abstracts over multilinear polynomial representations (field elements vs small values).
Extract WideMul trait (small × small → product) into wide_mul.rs,
separating widening multiplication from delayed reduction. This
enables cleaner generic bounds where only one capability is needed.
Key changes:
- Add WideMul trait with i32→i64 and i64→i128 implementations
- Refactor DelayedReduction to be generic over value type
(i32/i64/i128/F)
- Rename multiply_accumulate → unreduced_multiply_accumulate for
clarity
- Remove LagrangeAccumulatorField convenience trait (use explicit
bounds)
- Make multiply_vec_small and fold functions generic over SmallValue
- Delete impls.rs, consolidating implementations into trait modules
- Update documentation: fix outdated terminology and add limb size
context
The trait hierarchy is now:
- WideMul: small × small → product (pure integer widening)
- SmallValueField: field ↔ small conversions
- DelayedReduction<T>: field × T accumulation without reduction
function - Replace BenchmarkBox wrapper with direct enum dispatch - Move run logic to standalone run_benchmark<E, B> function - Simplify SumcheckBenchmark trait to just convert() and prove() - Remove build_benchmarks function in favor of direct dispatch - Add i32 support for BN254 (SupportsSmallI32 marker trait) - Fix clippy warning for unnecessary cast in small_field
- Add timed() helper to reduce timing boilerplate across 5 call sites - Extract TRANSCRIPT_LABEL constant (was hardcoded 3 times) - Rename run_chain_benchmark → run_sumcheck_benchmark for clarity - Remove outdated comment referencing PallasHyraxEngine - Reorganize imports into spartan2 block
- Extract FieldChoice into shared cli module - Consolidate timing phases into (name, short_name) tuples with HashMap storage - Simplify NeutronNova benchmark: remove rounds param and subcommands - Clean up phase names: remove prep_ml, add nifs_prove, unify span names - Make sha256.rs generic over engine type
Delete examples/sumcheck_sha256_equivalence.rs and add equivalent test in src/small_sumcheck.rs. Tests that prove_cubic_small_value produces identical output to prove_cubic_with_three_inputs using SmallSha256Circuit
- ExtensionBound<SV, D> precomputes max_safe bound once for reuse - Make prove_small_value and helpers generic over SV (i32/i64) - Remove unused Debug bounds from SV and SV::Product - Update docs to reference ExtensionBound
- Remove `build_accumulators` generic function (was marked dead_code) - Remove `gather_prefix_evals` from MultilinearPolynomial - Add standalone `extend_to_lagrange_domain` function (Procedure 6) - Remove helper functions only used by deleted code: - `fill_eyx`, `scatter_beta_contributions` - Clean up unused fields and imports across lagrange_accumulator module - Remove tests that depended on deleted functions
- Change `LagrangeAccumulators::rounds` from `pub` to `pub(crate)` - Change `LagrangeEvals::infinity` and `finite` from `pub` to `pub(crate)` - Fix stale comment referencing non-existent `cz_ext` and `cz_pref` variables - Use `data_mut()` accessor instead of direct field access for consistency
Change eq_cache from [round][y * num_x + x] to [round][x * num_y + y] so each parallel task (fixed x_out) accesses a contiguous memory block. Also remove unnecessary clone of e_in by borrowing directly.
4d49748 to
f614127
Compare
NeutronNova NIFS BenchmarkNeutronNova folds multiple R1CS instances using a NIFS (Non-Interactive Folding Scheme) sumcheck. This benchmark measures the end-to-end proving time with detailed phase breakdown. 8 instances × 128 SHA-256 hashes each (4,194,304 constraints per instance): NIFS Folding Only
Full ZK Prove
Key observations:
Future optimizations with native small-value types: If the R1CS matrices and witness are stored natively as small-value types (rather than field elements), additional speedups are possible:
Remaining bottlenecks and GPU acceleration opportunities: The phase breakdown reveals that the NIFS sumcheck (
|
End-to-End Spartan Proving BreakdownFull Spartan proving pipeline with detailed phase breakdown, measured with Single SHA-256 Hash (varying message sizes)msg=1024B, constraints=1,048,576:
msg=2048B, constraints=2,097,152:
|
|
At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.91× speedup on the BN254 scalar field (30 trials):
Scaling Across Problem Sizes
Key observations:
|
- Replace bare .unwrap() with .expect() containing descriptive messages for ilog2 conversions and field inversions in spartan.rs, spartan_zk.rs, neutronnova_zk.rs, and zk.rs - Convert field inversions in neutronnova_zk.rs to return Result with ProofVerifyError instead of panicking, protecting against zero challenges - Add overflow bounds documentation to all DelayedReduction accumulator types in delayed_reduction.rs explaining bit capacities and max accumulation counts - Promote debug_assert to assert for critical invariants in extension.rs that would silently produce incorrect results if violated in release - Add # Panics documentation to enforce_sc_claim in zk.rs - Add descriptive expect message to batch_invert_array in basis.rs
Cross-Field Comparison: Pallas and VestaThe optimization provides consistent speedups across different field implementations, demonstrating that the benefit is not specific to BN254's field arithmetic. BN254 (Fr):
Pallas (Fq):
Vesta (Fp):
Cross-field observations:
|
Assembly Analysis: Why Delayed Reduction is FasterTo understand the performance difference at the instruction level, we analyze the ARM64 assembly for inner-product operations. The example at cargo asm --example asm_compare inner_product_field_field_base # Eager reduction
cargo asm --example asm_compare inner_product_field_field_dmr # Delayed reduction
cargo asm --example asm_compare inner_product_field_i64 # Field × i64
cargo asm --example asm_compare inner_product_field_i128 # Field × i128Source Code ( use halo2curves::bn256::Fr as Bn254Fr;
use spartan2::small_field::{DelayedReduction, SignedWideLimbs, WideLimbs};
/// Function 1: Field × Field with EAGER reduction (base approach)
/// Each multiplication triggers a Montgomery reduction inside the loop.
#[inline(never)]
pub fn inner_product_field_field_base(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
let mut acc = Bn254Fr::zero();
for (ai, bi) in a.iter().zip(b.iter()) {
acc += *ai * *bi; // Montgomery reduction happens here on each iteration
}
acc
}
/// Function 2: Field × Field with DELAYED modular reduction (DMR)
/// Accumulates in WideLimbs<9> (576 bits) and reduces once at the end.
#[inline(never)]
pub fn inner_product_field_field_dmr(a: &[Bn254Fr], b: &[Bn254Fr]) -> Bn254Fr {
let mut acc = WideLimbs::<9>::default();
for (ai, bi) in a.iter().zip(b.iter()) {
// Wide accumulation - no modular reduction here
<Bn254Fr as DelayedReduction<Bn254Fr>>::unreduced_multiply_accumulate(&mut acc, ai, bi);
}
// Single reduction at the end
<Bn254Fr as DelayedReduction<Bn254Fr>>::reduce(&acc)
}
/// Function 3: Field × i64 with delayed modular reduction
/// Uses SignedWideLimbs<6> (384 bits) for signed small values.
#[inline(never)]
pub fn inner_product_field_i64(fields: &[Bn254Fr], values: &[i64]) -> Bn254Fr {
let mut acc = SignedWideLimbs::<6>::default();
for (f, v) in fields.iter().zip(values.iter()) {
// Fused multiply-accumulate into wide limbs
<Bn254Fr as DelayedReduction<i64>>::unreduced_multiply_accumulate(&mut acc, f, v);
}
// Single reduction at the end
<Bn254Fr as DelayedReduction<i64>>::reduce(&acc)
}
/// Function 4: Field × i128 with delayed modular reduction
/// Uses SignedWideLimbs<7> (448 bits) for larger signed values.
#[inline(never)]
pub fn inner_product_field_i128(fields: &[Bn254Fr], values: &[i128]) -> Bn254Fr {
let mut acc = SignedWideLimbs::<7>::default();
for (f, v) in fields.iter().zip(values.iter()) {
// Two-pass multiply-accumulate (low 64 bits, then high 64 bits)
<Bn254Fr as DelayedReduction<i128>>::unreduced_multiply_accumulate(&mut acc, f, v);
}
// Single reduction at the end
<Bn254Fr as DelayedReduction<i128>>::reduce(&acc)
}Instruction Count Per Multiply-Accumulate Operation:
Detailed Breakdown: Base (250 instructions/iteration): The 36 muls = 16 (4×4 product) + 20 (Montgomery REDC) DMR (109 instructions/iteration): Pure 4×4 multiply + 9-limb accumulation. No reduction in loop. i64 (54 instructions/iteration): Only 8 multiply instructions (4 mul + 4 umulh) for 4×1 product. i128 (92 instructions/iteration): 16 multiply instructions (8 mul + 8 umulh) for two-pass 4×2 product. Cost Comparison for n iterations:
Key insight: The base method spends ~170 instructions (68%) on Montgomery reduction inside the loop. Delayed reduction moves this cost outside the loop, paying it only once regardless of iteration count. Timing Benchmark (n = 65536, 10 trials, BN254): cargo run --example asm_compare --release |


Implement Small-Value Sum-Check Optimization (Algorithm 6)
Summary
This PR implements Algorithm 6 ("Small-Value Sum-Check with Eq-Poly Optimization") from the paper "Speeding Up Sum-Check Proving" by Bagad, Dao, Domb, and Thaler. The optimization targets Spartan's first sum-check invocation where witness polynomial evaluations are small integers (fitting in i32/i64), enabling significant prover speedups by replacing expensive field multiplications with cheaper native integer operations.
Key Insight
In the sum-check protocol, round 1 computations involve only small values (the original witness evaluations). From round 2 onward, evaluations become "large" due to binding to random verifier challenges. Algorithm 6 delays this binding using Lagrange interpolation, computing accumulators over small values in the first ℓ₀ rounds before switching to the standard linear-time prover.
Multiplication Cost Hierarchy:
For Spartan with degree-2 polynomials, Algorithm 6 reduces ll multiplications from O(N) to O(N/2^ℓ₀) at the cost of O((3/2)^ℓ₀ · N) ss multiplications.
Benchmarks
Measured on M1 Max MacBook Pro (10 cores, 64GB RAM) with
jemalloc.Note:
halo2curves/asmis not enabled (unavailable on Apple Silicon).Headline Result: 1.83× Speedup on BN254
At n = 2²⁶ (67M constraints) with ℓ₀ = 3, the small-value optimization achieves 1.83× speedup on the BN254 scalar field (30 trials):
Scaling Across Problem Sizes
Key observations:
Delayed Modular Reduction Impact
To isolate the impact of delayed modular reduction (DMR), we compare performance with and without DMR enabled.
Accumulator Building Phase
The accumulator building phase (Procedure 9) benefits most dramatically from DMR, as it performs many small×small multiplications that would otherwise require modular reduction after each operation.
Key observations:
Time Breakdown: First l0 Rounds vs Remaining Rounds
With l0=3, the work is balanced between the accumulator-based first rounds and the standard sumcheck remaining rounds:
Key observations:
Split-Eq Sumcheck with DMR
For the split-eq sumcheck (which uses pre-split eq-polynomial tables), DMR provides additional speedup by delaying modular reductions in the remaining rounds.
Key observations:
SHA-256 Chain Benchmark
To demonstrate real-world applicability, we benchmark proving SHA-256 hash chains. This workload approximates a major component of Solana light client verification.
Key observations:
Solana Light Client Comparison
A Solana light client verifying block finality requires:
SHA-256 equivalent cost:
Implementation
Core Components
SmallValueFieldtrait (src/small_field.rs)SmallValue(i32) andIntermediateSmallValue(i64) typessl_mulandisl_mulfor BN254/BLS12-381 (~3× faster than ll)Lagrange Domain Extension (
src/lagrange.rs)LagrangeEvaluatedMultilinearPolynomial<T, D>for extending boolean evaluations to U_d = {∞, 0, 1, ..., d-1}extend_in_placewith ping-pong buffersgather_prefix_evalsfor efficient prefix collection (Procedure 6)Accumulator Data Structures (
src/accumulators.rs,src/accumulator_index.rs)SmallValueAccumulators<S, D>storing A_i(v, u) with O(1) indexing viaUdTupleidx4mapping (Definition A.5) for distributing products to correct accumulatorsUdEvaluationsandUdHatEvaluationswrappersProcedure 9 Implementation (
src/accumulators.rs)build_accumulators_spartan: Optimized for Spartan's Az·Bz structurebuild_accumulators: Generic version for arbitrary polynomial productsThread-Local Buffer Reuse (
src/thread_state_accumulators.rs)SpartanThreadStateandGenericThreadStateeliminate O(num_x_out) allocationsSum-Check Integration (
src/sumcheck.rs)SmallValueSumCheck::from_accumulatorsfactory methodAlgorithm Flow
Test Plan
cargo test test_build_accumulators- Verifies accumulator constructioncargo test test_small_value- SmallValueField arithmetic correctnesscargo test lagrange- Lagrange extension and interpolationcargo test sumcheck- Full sum-check protocol equivalencecargo clippy- No warningsexamples/sumcheck_sha256_equivalence.rs- Verifies new method produces identical proofs to baselineexamples/sha256_chain_benchmark.rs- SHA-256 chain proving with CSV outputReferences