perf: reduce allocation contention and STIR equality cache traffic #203
perf: reduce allocation contention and STIR equality cache traffic #203Barnadrot wants to merge 14 commits into
Conversation
Rayon's into_par_iter().chunks(n) creates a heap-allocated Vec per chunk. compute_eval_eq_packed and compute_eval_eq_base_packed are called per STIR query (274 in round 1) on a 2^22-element buffer split into 16-element chunks, producing 9.09M allocations per proof (70% of pipeline total). par_chunks_exact() iterates over existing slice chunks with zero allocation. Alloc count: 12.94M -> 3.64M (-71.8%). Criterion: -2.12%, p=0.00. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
handle_parallel_batch creates ~1399 parallel segments via Rayon, each building a TableTrace with ~170 columns. Columns started at capacity 0 and grew via push(), triggering ~150 reallocations per column per segment (~210K total realloc cascades), all competing for glibc arena locks. Add TableTrace::with_column_capacity() and Trace::with_capacity() that pre-size all column Vecs using the first segment's actual row counts. Alloc count: 3.64M -> 3.44M (-5.7%). Criterion: -3.25%, p=0.00. The large wall-clock improvement relative to the small alloc count reduction indicates this was primarily an allocator-contention fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…trixView Default vertically_packed_row_rtl calls wrapping_row_slices -> row_slice_unchecked -> collect_vec(), allocating a Vec per row. With P::WIDTH=16, each call allocates 16 Vecs. Merkle tree construction calls this for every row across all tree levels. DenseMatrix override: index the flat buffer directly with precomputed row_offsets[i] = ((r + i) % height) * width. Zero allocation, zero modulo in the inner loop. FlatMatrixView override: dispatch to inner DenseMatrix's wrapping_row_slices (1 alloc per call instead of 17), accessing extension coefficients via as_basis_coefficients_slice(). Alloc count: 3.44M -> 1.77M (-48.5%). Criterion: -1.90%, p=0.00. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Instead of 274 sequential full sweeps of the ~84MB weight buffer (one per STIR query point), split the output into L2-friendly tiles (2^14 = 16K unpacked elements = 320KB) and process all 274 queries per tile before moving to the next. Each tile's temp buffer and output slice fit in L2 (~640KB total), so DRAM traffic drops from ~67GB to ~168MB. The eq polynomial's multiplicative structure allows computing a per-tile prefix scalar (O(n-k) multiplications) that captures the contribution of upper bit positions, then filling only the tile-sized sub-range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2d2dff2 to
426de76
Compare
426de76 to
5aafee1
Compare
4cc57d4 to
9b44b59
Compare
|
Thanks! I will split this PR in independent small changes |
|
first: 0e78ca3 |
|
second: c889474 |
|
There remains 2 independent ideas in the PR:
About A] I was not able to repoduce the perf improvement. One explanation is that there is no reason each segment of the parallel execution will share exactly the same numbers of cycles. So on average half on the segments will be above the nb of cycles of the first segment, and will need re-allocation of memory. About B] this is weird. On mac M4 max I can see an improvement. On hetzner ax42u I can see a massive improvement when running on avx512, but there is an regession on avx2. And avx2 turns out to be faster than avx512 on this machine (and even after applying the change (B) to avx512, avx2 is still faster). I will continue investigating |
|
Ok I believe I found the bug |
|
Ok coming back to B], now that the avx512 bug iis fixed (I am now using avx512 for all the benchmarks on ax42u): M4 Max. Before: 38ms. After your PR: 27ms (improvement of almost 30%) I am investigating |
|
Ok found the issue I believe |
Co-Authored-By: Borna <94551425+Barnadrot@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>
|
Only A] remains, but I am not sure we can really gain here? |
|
agree with the conclusion, my framing for the reframed A] this is just a symptom level patch. The ongoing research into hardware agnostic alloc pressure is more future proof than this smaller fix. |
perf: reduce allocation contention and STIR equality cache traffic
Summary
Targeted heap allocation reduction and cache-traffic optimization across the
proving pipeline, guided by heaptrack and custom
GlobalAllocinstrumentation(phase-aware counters, size-class distribution, per-site atomic counters).
DHAT was attempted initially but abandoned — Valgrind serializes threads,
hiding the Rayon contention that turned out to be the actual performance
mechanism.
Four changes survived the gate (>= 1.0%, p < 0.01):
into_par_iter().chunks()with
par_chunks_exact_mut()in equality polynomial computation, eliminating9.09M Vec allocations from WHIR STIR queries.
parallel batch segments with capacity from first iteration, eliminating ~210K
reallocation cascades across 1399 Rayon segments.
vertically_packed_row_rtlonDenseMatrixandFlatMatrixViewwithzero-copy direct indexing using precomputed row offsets, eliminating 1.62M
allocs in Merkle tree construction.
query points per L2-sized output tile instead of 274 sequential full sweeps,
reducing DRAM traffic from ~67GB to ~168MB.
Combined: ~22% Criterion, -10.3% production (
xmss_leaf_1400sigs/fancy-aggregation), validated on AWS c7a.2xlarge (AMD EPYC Genoa, Zen 4,AVX-512).
Changes
(a) Equality polynomial: par_chunks_exact
crates/backend/poly/src/eq_mle.rscompute_eval_eq_packedandcompute_eval_eq_base_packedusedinto_par_iter().chunks(packing_width), which creates a heap-allocatedVecper chunk. With 274 STIR queries each calling these functions on a2^22-element buffer split into 16-element chunks, this produced 9.09M
allocations per proof — 70% of the pipeline total.
Fix:
par_chunks_exact_mut()iterates over existing slice chunks with zeroallocation. Alloc count: 12.94M -> 3.64M (-71.8%).
(b) Trace column pre-allocation
crates/lean_vm/src/execution/runner.rs,crates/lean_vm/src/tables/table_trait.rshandle_parallel_batchcreates 1399 parallel segments via Rayon, each buildinga
TableTracewith ~170 columns. Columns start empty and grow viapush(),triggering ~150 reallocations per column (average segment has ~1000 rows but
columns start at capacity 0). Total: ~210K reallocation cascades.
Fix:
TableTrace::with_column_capacity(n_columns, capacity)pre-sizes allcolumn Vecs from the first segment's actual row count. Alloc count:
3.64M -> 3.44M (-5.7%). The large improvement (-3.25%) relative to the small
alloc count reduction indicates this was primarily an allocator-contention fix:
210K concurrent reallocs from 1399 Rayon threads competing for glibc arena locks.
(c) Matrix row access: zero-alloc vertically_packed_row_rtl
crates/whir/src/matrix.rsDefault
vertically_packed_row_rtlcallswrapping_row_slices->row_slice_unchecked->
collect_vec(), allocating a Vec per row. WithP::WIDTH=16, each callallocates 16 Vecs. Merkle tree construction calls this once per row of the
polynomial matrix, across all tree levels.
Fix: Override on
DenseMatrixindexes the flat buffer directly with precomputedrow_offsets[i] = ((r + i) % height) * width— zero allocation, zero moduloin the inner loop. Override on
FlatMatrixViewdispatches to the innerDenseMatrix'swrapping_row_slices(1 alloc per call instead of 17,accessing extension coefficients via
as_basis_coefficients_slice). Alloc count:3.44M -> 1.77M (-48.5%).
(d) Batched STIR equality updates with L2-tiled output traversal
crates/backend/poly/src/eq_mle.rs,crates/whir/src/open.rsadd_new_base_equalityloops over 274 STIR query points, each callingcompute_eval_eq_base_packedwhich allocates an 84MB temporary buffer,fills it with the equality polynomial, then packs into the 84MB output
buffer. Total: 274 sequential full sweeps = ~67GB DRAM traffic, exceeding
L3 cache (32MB on Zen 4) by ~2000x.
Fix:
compute_eval_eq_base_packed_batchedsplits the output into 2^14-elementtiles (320KB each, fits L2) and processes all 274 queries per tile before
moving on. The equality polynomial's multiplicative structure allows computing
a per-tile prefix scalar in O(n-k) multiplications, then filling only the
tile-sized sub-range. Each Rayon thread reuses its tile buffer via
for_each_init. DRAM traffic: ~67GB -> ~168MB (output read + writeback only).Diff shape
Branch:
exp6_alloc_reduction_cleanonmyfork. 4 clean commits onorigin/main.Results
Criterion steady-state (
xmss_leaf_1400sigs)All measurements on AWS c7a.2xlarge, glibc system allocator, 100 Criterion
samples,
eval_paired.sh. Iter 17 confirmed across two independent runs(-17.4% and -16.2%, both p=0.00).
Production workload (
fancy-aggregation)AWS measured with
reproduce_prod.sh(cargo clean between builds,3 runs each, median comparison). First run excluded as cold-start outlier
(~368s both sides — consistent, not a regression).
Dilution from Criterion (~22%) to production (-10.3%) is 2.1x. Tighter than
prior experiments (exp4: 3.6x) because the STIR tiling is a cache-traffic
optimization that hits equally in both Criterion and production (unlike
allocation contention which is amplified in tight Criterion loops).
Pending Hetzner validation. These results are AWS-only (16GB, shared
tenancy). The 3 allocation KEEPs are hardware-agnostic by design. The STIR
tiling depends on L2 cache size (1MB on Zen 4 Genoa, 1MB on Zen 4/
Hetzner 8700GE) — expected to transfer, but should be confirmed on Hetzner bare
metal before claiming universality.
Hardware
Correctness
cargo test --release --workspace): 56tests pass, 0 failures. Includes 3 end-to-end proof generation +
verification tests (
test_xmss_signature,test_recursive_aggregation,test_aggregation), 2 WHIR tests (test_run_whir,test_eval_dft),4 ZK VM tests, 12 compiler tests.
(
test_batched_eq_base_packed_matches_sequential, experiment branch)confirms
compute_eval_eq_base_packed_batchedproduces identical outputto 274 sequential
compute_eval_eq_base_packedcalls across 7 buffersizes (n_vars 8–22), including the production size (22).
fancy-aggregationviareproduce_prod.shgenerates and verifies a full proof end-to-end (3 runs each for baseline
and candidate).
proof-of-work bits are unchanged.
identical outputs, fewer heap operations. Change (d) restructures
traversal order for cache locality — identical arithmetic, identical
outputs, reduced DRAM traffic.
Why the allocation surface is exhausted (and what replaced it)
Iters 1-16: 16 iterations targeting allocation reduction, 3 KEEPs, 12
consecutive discards. Iter 16 proved the allocation surface is exhausted:
WHIR's separate
FlatMatrixView::vertically_packed_row_rtl.1.6M allocs x 10ns = 16ms = 0.3% of runtime, below measurement noise.
The three allocation KEEPs worked because they targeted allocator contention
(threads blocking on glibc arena locks during concurrent realloc cascades), not
raw alloc/dealloc speed. Once contention sites are fixed, the remaining allocs
flow through tcache with no cross-thread interaction.
Iter 17 shifted to a different surface: cache-traffic reduction. The STIR
equality update was the single largest remaining bottleneck, and its cost was
dominated by DRAM bandwidth, not allocation. L2 tiling gave -16.8% Criterion —
larger than all 3 allocation KEEPs combined.
Relationship to mimalloc (PR #200)
mimalloc (PR #200) gives -24% production / -33% Criterion on AWS c7a.2xlarge
(16GB), but +3.6% regression on Hetzner AX42-U (64GB). The allocation
reduction in this PR is complementary but addresses a different layer:
The remaining gap between this PR and mimalloc is likely memory
fragmentation under pressure. With 12-13GB peak RSS on a 16GB AWS
machine (80% of RAM), glibc's arena-based allocation scatters live objects
across pages, inflating RSS and competing with the OS for physical memory.
mimalloc's per-thread size-class segregated pages keep objects compact,
reducing page faults and TLB misses.
Evidence: on 64GB Hetzner (12-13GB = 20% of RAM, no pressure), mimalloc
regresses — its aggressive page retention becomes overhead when fragmentation
is irrelevant. This PR's source-code allocation reduction is
hardware-agnostic (helps on both machines) but cannot fix glibc's memory
layout; only the allocator's page-management strategy can.
perf stat -e page-faults,minor-faultscomparison (glibc vs mimalloc)would confirm this hypothesis but has not been run.
Batched STIR queries: how iter 17 works
add_new_base_equalitywas the single largest remaining bottleneck: 274STIR query points, each sweeping the full ~84MB weight buffer = ~67GB
DRAM traffic, far exceeding L3 cache (32MB on Zen 4).
The fix exploits the equality polynomial's multiplicative structure: for a
tile of 2^14 elements (320KB, fits L2), all elements share the same "prefix"
product from the upper bit positions.
compute_eval_eq_base_packed_batchedprocesses all 274 queries per tile before moving on — both the temp buffer
and output slice stay in L2 across all queries. DRAM traffic drops to
~168MB (output read + writeback only). Estimated 8-13%, measured -16.8% Criterion (iter 17 only).
Experimentally ruled out (17 iterations total)
Details
This PR is the surviving output of a 17-iteration profiling-guided optimization
campaign (experiment 6) targeting heap allocation reduction (iters 1-16) and
cache-traffic optimization (iter 17).
Discarded iterations
Key findings
where Rayon threads competed for glibc arena locks. The 13 discards targeted
sites where alloc count was high but contention was absent (tcache-served).
allocs (128-640B each) produced zero measurable wall-clock improvement.
large allocations; eliminating a few large allocs has negligible effect.
all regressed because the alternative code path (indexed access, atomics,
array fill) was slower than the original alloc+collect pattern.