perf: reduce allocation contention and STIR equality cache traffic by Barnadrot · Pull Request #203 · leanEthereum/leanVM

Barnadrot · 2026-04-23T17:08:03Z

perf: reduce allocation contention and STIR equality cache traffic

Summary

Targeted heap allocation reduction and cache-traffic optimization across the
proving pipeline, guided by heaptrack and custom GlobalAlloc instrumentation
(phase-aware counters, size-class distribution, per-site atomic counters).
DHAT was attempted initially but abandoned — Valgrind serializes threads,
hiding the Rayon contention that turned out to be the actual performance
mechanism.

Four changes survived the gate (>= 1.0%, p < 0.01):

eq_mle par_chunks_exact (-2.12%): Replace Rayon into_par_iter().chunks()
with par_chunks_exact_mut() in equality polynomial computation, eliminating
9.09M Vec allocations from WHIR STIR queries.
Trace column pre-allocation (-3.25%): Pre-allocate trace column Vecs in
parallel batch segments with capacity from first iteration, eliminating ~210K
reallocation cascades across 1399 Rayon segments.
Backend matrix zero-alloc row access (-1.90%): Override
vertically_packed_row_rtl on DenseMatrix and FlatMatrixView with
zero-copy direct indexing using precomputed row offsets, eliminating 1.62M
allocs in Merkle tree construction.
Batched STIR equality with L2 tiling (-16.8%): Process all 274 STIR
query points per L2-sized output tile instead of 274 sequential full sweeps,
reducing DRAM traffic from ~67GB to ~168MB.

Combined: ~22% Criterion, -10.3% production (xmss_leaf_1400sigs /
fancy-aggregation), validated on AWS c7a.2xlarge (AMD EPYC Genoa, Zen 4,
AVX-512).

Changes

(a) Equality polynomial: par_chunks_exact

crates/backend/poly/src/eq_mle.rs

compute_eval_eq_packed and compute_eval_eq_base_packed used
into_par_iter().chunks(packing_width), which creates a heap-allocated
Vec per chunk. With 274 STIR queries each calling these functions on a
2^22-element buffer split into 16-element chunks, this produced 9.09M
allocations per proof — 70% of the pipeline total.

Fix: par_chunks_exact_mut() iterates over existing slice chunks with zero
allocation. Alloc count: 12.94M -> 3.64M (-71.8%).

(b) Trace column pre-allocation

crates/lean_vm/src/execution/runner.rs, crates/lean_vm/src/tables/table_trait.rs

handle_parallel_batch creates 1399 parallel segments via Rayon, each building
a TableTrace with ~170 columns. Columns start empty and grow via push(),
triggering ~150 reallocations per column (average segment has ~1000 rows but
columns start at capacity 0). Total: ~210K reallocation cascades.

Fix: TableTrace::with_column_capacity(n_columns, capacity) pre-sizes all
column Vecs from the first segment's actual row count. Alloc count:
3.64M -> 3.44M (-5.7%). The large improvement (-3.25%) relative to the small
alloc count reduction indicates this was primarily an allocator-contention fix:
210K concurrent reallocs from 1399 Rayon threads competing for glibc arena locks.

(c) Matrix row access: zero-alloc vertically_packed_row_rtl

crates/whir/src/matrix.rs

Default vertically_packed_row_rtl calls wrapping_row_slices -> row_slice_unchecked
-> collect_vec(), allocating a Vec per row. With P::WIDTH=16, each call
allocates 16 Vecs. Merkle tree construction calls this once per row of the
polynomial matrix, across all tree levels.

Fix: Override on DenseMatrix indexes the flat buffer directly with precomputed
row_offsets[i] = ((r + i) % height) * width — zero allocation, zero modulo
in the inner loop. Override on FlatMatrixView dispatches to the inner
DenseMatrix's wrapping_row_slices (1 alloc per call instead of 17,
accessing extension coefficients via as_basis_coefficients_slice). Alloc count:
3.44M -> 1.77M (-48.5%).

(d) Batched STIR equality updates with L2-tiled output traversal

crates/backend/poly/src/eq_mle.rs, crates/whir/src/open.rs

add_new_base_equality loops over 274 STIR query points, each calling
compute_eval_eq_base_packed which allocates an 84MB temporary buffer,
fills it with the equality polynomial, then packs into the 84MB output
buffer. Total: 274 sequential full sweeps = ~67GB DRAM traffic, exceeding
L3 cache (32MB on Zen 4) by ~2000x.

Fix: compute_eval_eq_base_packed_batched splits the output into 2^14-element
tiles (320KB each, fits L2) and processes all 274 queries per tile before
moving on. The equality polynomial's multiplicative structure allows computing
a per-tile prefix scalar in O(n-k) multiplications, then filling only the
tile-sized sub-range. Each Rayon thread reuses its tile buffer via
for_each_init. DRAM traffic: ~67GB -> ~168MB (output read + writeback only).

Diff shape

 crates/backend/poly/src/eq_mle.rs        | 100 +++++++++++++++++++++++++++--
 crates/lean_vm/src/execution/runner.rs   |  20 ++++++-
 crates/lean_vm/src/tables/table_trait.rs |   8 +++
 crates/whir/src/matrix.rs                |  47 +++++++++++++++
 crates/whir/src/open.rs                  |  16 ++---
 5 files changed, 176 insertions(+), 15 deletions(-)

Branch: exp6_alloc_reduction_clean on myfork. 4 clean commits on origin/main.

Results

Criterion steady-state (`xmss_leaf_1400sigs`)

Keep	Delta	p-value	Mechanism
Iter 4 (eq_mle)	-2.12%	0.00	Allocation reduction: 12.94M -> 3.64M (-71.8%)
Iter 7 (trace columns)	-3.25%	0.00	Allocation reduction: 3.64M -> 3.44M (-5.7%)
Iter 8 (matrix rows)	-1.90%	0.00	Allocation reduction: 3.44M -> 1.77M (-48.5%)
Iter 17 (STIR tiling)	-16.8%	0.00	Cache-traffic reduction: ~67GB -> ~168MB DRAM
Combined	~22%	—

All measurements on AWS c7a.2xlarge, glibc system allocator, 100 Criterion
samples, eval_paired.sh. Iter 17 confirmed across two independent runs
(-17.4% and -16.2%, both p=0.00).

Production workload (`fancy-aggregation`)

	AWS c7a.2xlarge	Hetzner AX42-U
Median	50.89s → 45.64s	pending
Delta	-10.32%	pending
Runs	3	—
Binary hash	eaaf5d8a / 71b6858f	—

AWS measured with reproduce_prod.sh (cargo clean between builds,
3 runs each, median comparison). First run excluded as cold-start outlier
(~368s both sides — consistent, not a regression).

Dilution from Criterion (~22%) to production (-10.3%) is 2.1x. Tighter than
prior experiments (exp4: 3.6x) because the STIR tiling is a cache-traffic
optimization that hits equally in both Criterion and production (unlike
allocation contention which is amplified in tight Criterion loops).

Pending Hetzner validation. These results are AWS-only (16GB, shared
tenancy). The 3 allocation KEEPs are hardware-agnostic by design. The STIR
tiling depends on L2 cache size (1MB on Zen 4 Genoa, 1MB on Zen 4/
Hetzner 8700GE) — expected to transfer, but should be confirmed on Hetzner bare
metal before claiming universality.

Hardware

	AWS c7a.2xlarge	Hetzner AX42-U
CPU	AMD EPYC Genoa (Zen 4)	AMD Ryzen 7 PRO 8700GE (Zen 4)
Cores	8 vCPU (shared tenancy)	8C/16T (dedicated)
RAM	16 GB	64 GB DDR5
AVX-512	Yes	Yes

Correctness

Full workspace test suite (cargo test --release --workspace): 56
tests pass, 0 failures. Includes 3 end-to-end proof generation +
verification tests (test_xmss_signature, test_recursive_aggregation,
test_aggregation), 2 WHIR tests (test_run_whir, test_eval_dft),
4 ZK VM tests, 12 compiler tests.
Iter 17 bitwise equivalence: dedicated test
(test_batched_eq_base_packed_matches_sequential, experiment branch)
confirms compute_eval_eq_base_packed_batched produces identical output
to 274 sequential compute_eval_eq_base_packed calls across 7 buffer
sizes (n_vars 8–22), including the production size (22).
Production proof verified: fancy-aggregation via reproduce_prod.sh
generates and verifies a full proof end-to-end (3 runs each for baseline
and candidate).
No security parameter changes. FRI query count, blowup factor, and
proof-of-work bits are unchanged.
No API or interface changes.
Changes (a)–(c) are pure allocation strategy — identical computation,
identical outputs, fewer heap operations. Change (d) restructures
traversal order for cache locality — identical arithmetic, identical
outputs, reduced DRAM traffic.

Why the allocation surface is exhausted (and what replaced it)

Iters 1-16: 16 iterations targeting allocation reduction, 3 KEEPs, 12
consecutive discards. Iter 16 proved the allocation surface is exhausted:

Eliminated 1.6M additional allocs (-12.4% of remaining) by overriding
WHIR's separate FlatMatrixView::vertically_packed_row_rtl.
Alloc counter measurement: 12,940K -> 11,334K allocs, 468MB less total.
Wall-clock: 5136ms vs 5146ms — zero effect.
glibc's per-thread tcache serves remaining small allocs in ~10ns each.
1.6M allocs x 10ns = 16ms = 0.3% of runtime, below measurement noise.

The three allocation KEEPs worked because they targeted allocator contention
(threads blocking on glibc arena locks during concurrent realloc cascades), not
raw alloc/dealloc speed. Once contention sites are fixed, the remaining allocs
flow through tcache with no cross-thread interaction.

Iter 17 shifted to a different surface: cache-traffic reduction. The STIR
equality update was the single largest remaining bottleneck, and its cost was
dominated by DRAM bandwidth, not allocation. L2 tiling gave -16.8% Criterion —
larger than all 3 allocation KEEPs combined.

Relationship to mimalloc (PR #200)

mimalloc (PR #200) gives -24% production / -33% Criterion on AWS c7a.2xlarge
(16GB), but +3.6% regression on Hetzner AX42-U (64GB). The allocation
reduction in this PR is complementary but addresses a different layer:

	This PR	mimalloc
Mechanism	Eliminate contention allocs + cache tiling	Replace allocator entirely
Criterion effect	-22%	-33%
Production effect	-10.3%	-24%
Hardware-dependent	No	Yes (regresses on 64GB Hetzner)
Captures contention	Yes (3 targeted sites)	Yes (all sites, via thread-local heaps)
Captures layout/fragmentation	No	Yes

The remaining gap between this PR and mimalloc is likely memory
fragmentation under pressure. With 12-13GB peak RSS on a 16GB AWS
machine (80% of RAM), glibc's arena-based allocation scatters live objects
across pages, inflating RSS and competing with the OS for physical memory.
mimalloc's per-thread size-class segregated pages keep objects compact,
reducing page faults and TLB misses.

Evidence: on 64GB Hetzner (12-13GB = 20% of RAM, no pressure), mimalloc
regresses — its aggressive page retention becomes overhead when fragmentation
is irrelevant. This PR's source-code allocation reduction is
hardware-agnostic (helps on both machines) but cannot fix glibc's memory
layout; only the allocator's page-management strategy can.

perf stat -e page-faults,minor-faults comparison (glibc vs mimalloc)
would confirm this hypothesis but has not been run.

Batched STIR queries: how iter 17 works

add_new_base_equality was the single largest remaining bottleneck: 274
STIR query points, each sweeping the full ~84MB weight buffer = ~67GB
DRAM traffic, far exceeding L3 cache (32MB on Zen 4).

The fix exploits the equality polynomial's multiplicative structure: for a
tile of 2^14 elements (320KB, fits L2), all elements share the same "prefix"
product from the upper bit positions. compute_eval_eq_base_packed_batched
processes all 274 queries per tile before moving on — both the temp buffer
and output slice stay in L2 across all queries. DRAM traffic drops to
~168MB (output read + writeback only). Estimated 8-13%, measured -16.8% Criterion (iter 17 only).

out.par_chunks_exact_mut(tile_packed_size)
    .enumerate()
    .for_each_init(
        || uninitialized_vec(tile_unpacked_size),  // one buffer per thread
        |tile_buf, (tile_idx, out_tile)| {
            for (eval, scalar) in evals.zip(scalars) {
                let prefix = tile_prefix(eval, tile_idx, scalar);  // O(8) muls
                eval_eq_basic(eval_tail, tile_buf, prefix);         // fill 320KB
                pack_and_accumulate(out_tile, tile_buf);             // += into output
            }
        },
    );

Experimentally ruled out (17 iterations total)

Details

This PR is the surviving output of a 17-iteration profiling-guided optimization
campaign (experiment 6) targeting heap allocation reduction (iters 1-16) and
cache-traffic optimization (iter 17).

Discarded iterations

Iter	Target	Delta	Why it failed
1	Allocation profiling	—	Profile-only iteration (heaptrack + custom GlobalAlloc counters; DHAT abandoned — Valgrind serializes threads). 12.94M allocs, 16.7GB total. Top: Rayon par_collect, WHIR eq_poly, Merkle row slices
2	Fused unpack+bit_reverse_chunks	-0.28%	Only 2 large allocs eliminated; glibc handles large allocs efficiently
3	Sumcheck eval_fn Vec->slice	-0.62%	Real but below 1.0% gate. Small n_cols (2-40 elements) = cheap allocs
5	Backend matrix vertically_packed_row_rtl	-0.47%	Real but below 1.0% gate. Each eliminated alloc was ~128B
6	In-place bit_reverse_chunks	-0.22%	20 large allocs eliminated. Large alloc/dealloc is fast in glibc
9	Poseidon get_slice -> individual gets	+1.13%	Per-element trait dispatch slower than batch collect
10	Fused read+trace in extension_op	+0.71%	Extra indexing overhead exceeded alloc savings
11	FlatMatrixView flat_buffer zero-alloc	+0.31%	Only 14K allocs eliminated — too few to matter
12	Thread-local Vec pool for exec_multi_row	+0.08%	TLS lookup overhead equals alloc savings; tcache already optimal
13	get_slice_into buffer API for Poseidon	+0.05%	1M allocs eliminated but each Vec was 32B — tcache near-zero cost
14	Parallel memory_acc/bytecode_acc with atomics	+0.31%	lock xadd (~20-30ns) exceeds sequential load+add+store (~15ns)
15	Array-based point/diff buffers in sumcheck	+0.46%	push() with pre-allocated capacity already optimized by compiler
16	WHIR FlatMatrixView vertically_packed_row_rtl	0.00%	1.6M allocs eliminated, zero wall-clock effect. tcache absorbs all

Key findings

Contention is the mechanism, not alloc count. All 3 KEEPs targeted sites
where Rayon threads competed for glibc arena locks. The 13 discards targeted
sites where alloc count was high but contention was absent (tcache-served).
glibc tcache is effectively free for small allocs. Eliminating up to 1.6M
allocs (128-640B each) produced zero measurable wall-clock improvement.
Large alloc reduction doesn't help either. glibc uses mmap/mremap for
large allocations; eliminating a few large allocs has negligible effect.
Code restructuring to avoid allocs often regresses. Iters 9, 10, 14, 15
all regressed because the alternative code path (indexed access, atomics,
array fill) was slower than the original alloc+collect pattern.

Rayon's into_par_iter().chunks(n) creates a heap-allocated Vec per chunk. compute_eval_eq_packed and compute_eval_eq_base_packed are called per STIR query (274 in round 1) on a 2^22-element buffer split into 16-element chunks, producing 9.09M allocations per proof (70% of pipeline total). par_chunks_exact() iterates over existing slice chunks with zero allocation. Alloc count: 12.94M -> 3.64M (-71.8%). Criterion: -2.12%, p=0.00. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

handle_parallel_batch creates ~1399 parallel segments via Rayon, each building a TableTrace with ~170 columns. Columns started at capacity 0 and grew via push(), triggering ~150 reallocations per column per segment (~210K total realloc cascades), all competing for glibc arena locks. Add TableTrace::with_column_capacity() and Trace::with_capacity() that pre-size all column Vecs using the first segment's actual row counts. Alloc count: 3.64M -> 3.44M (-5.7%). Criterion: -3.25%, p=0.00. The large wall-clock improvement relative to the small alloc count reduction indicates this was primarily an allocator-contention fix. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…trixView Default vertically_packed_row_rtl calls wrapping_row_slices -> row_slice_unchecked -> collect_vec(), allocating a Vec per row. With P::WIDTH=16, each call allocates 16 Vecs. Merkle tree construction calls this for every row across all tree levels. DenseMatrix override: index the flat buffer directly with precomputed row_offsets[i] = ((r + i) % height) * width. Zero allocation, zero modulo in the inner loop. FlatMatrixView override: dispatch to inner DenseMatrix's wrapping_row_slices (1 alloc per call instead of 17), accessing extension coefficients via as_basis_coefficients_slice(). Alloc count: 3.44M -> 1.77M (-48.5%). Criterion: -1.90%, p=0.00. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Instead of 274 sequential full sweeps of the ~84MB weight buffer (one per STIR query point), split the output into L2-friendly tiles (2^14 = 16K unpacked elements = 320KB) and process all 274 queries per tile before moving to the next. Each tile's temp buffer and output slice fit in L2 (~640KB total), so DRAM traffic drops from ~67GB to ~168MB. The eq polynomial's multiplicative structure allows computing a per-tile prefix scalar (O(n-k) multiplications) that captures the contribution of upper bit positions, then filling only the tile-sized sub-range. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…clean

TomWambsgans · 2026-04-25T09:31:48Z

Thanks! I will split this PR in independent small changes

TomWambsgans · 2026-04-25T09:32:12Z

first: 0e78ca3

TomWambsgans · 2026-04-25T09:49:41Z

second: c889474

TomWambsgans · 2026-04-25T11:04:17Z

There remains 2 independent ideas in the PR:

A] the "with_capacity" in the parallel (batch} execution of the leanISA bytecode (at witness generation)
B]compute_eval_eq_base_packed_batched

About A] I was not able to repoduce the perf improvement. One explanation is that there is no reason each segment of the parallel execution will share exactly the same numbers of cycles. So on average half on the segments will be above the nb of cycles of the first segment, and will need re-allocation of memory.

About B] this is weird. On mac M4 max I can see an improvement. On hetzner ax42u I can see a massive improvement when running on avx512, but there is an regession on avx2. And avx2 turns out to be faster than avx512 on this machine (and even after applying the change (B) to avx512, avx2 is still faster). I will continue investigating

TomWambsgans · 2026-04-25T12:06:15Z

Ok I believe I found the bug

TomWambsgans · 2026-04-25T12:45:32Z

Ok coming back to B], now that the avx512 bug iis fixed (I am now using avx512 for all the benchmarks on ax42u):
running:
cargo run --release -- xmss --n-signatures 1400 --tracing
and looking at the first 'add_new_base_equality' (the other ones are negligible, only the first really matter):

M4 Max. Before: 38ms. After your PR: 27ms (improvement of almost 30%)
ax42u: Before 26ms. After your PR: 59ms (more than 2x downgrade).

I am investigating

TomWambsgans · 2026-04-25T12:55:37Z

Ok found the issue I believe

Co-Authored-By: Borna <94551425+Barnadrot@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>

TomWambsgans · 2026-04-25T13:32:59Z

third: 5d6fe62

So B] is done
(actually we can likely do even better: cf 79e7b15)

TomWambsgans · 2026-04-25T13:44:35Z

Only A] remains, but I am not sure we can really gain here?
If you confirm I will close this branch, since all the other optis have been merged, tks again!

Barnadrot · 2026-04-26T10:57:49Z

agree with the conclusion, my framing for the reframed A] this is just a symptom level patch.

The ongoing research into hardware agnostic alloc pressure is more future proof than this smaller fix.

Barnadrot and others added 7 commits April 23, 2026 15:15

fix: rustfmt + clippy needless_range_loop lint

0c483dd

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

fix: clippy needless_range_loop in matrix row_offsets

4a8e4ee

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

avoid unnecessary allocation in initial Merkle tree

54b849b

TomWambsgans force-pushed the exp6_alloc_reduction_clean branch from 2d2dff2 to 426de76 Compare April 25, 2026 08:54

Merge remote-tracking branch 'origin/main' into exp6_alloc_reduction_…

5aafee1

…clean

TomWambsgans force-pushed the exp6_alloc_reduction_clean branch from 426de76 to 5aafee1 Compare April 25, 2026 08:58

TomWambsgans force-pushed the main branch 2 times, most recently from 4cc57d4 to 9b44b59 Compare April 25, 2026 09:11

Merge branch 'main' into exp6_alloc_reduction_clean

0d74286

remove usused vertically_packed_row_rtl

29a8ac2

Merge branch 'main' into exp6_alloc_reduction_clean

c25a81b

Merge branch 'main' into exp6_alloc_reduction_clean

9e4138c

TomWambsgans and others added 2 commits April 25, 2026 15:30

batch stir querries computations (compute_eval_eq_base_packed_batched)

c7e5cda

Co-Authored-By: Borna <94551425+Barnadrot@users.noreply.github.com> Co-authored-by: Copilot <copilot@github.com>

Merge branch 'main' into exp6_alloc_reduction_clean

91ad238

TomWambsgans force-pushed the main branch from c7e5cda to 5d6fe62 Compare April 25, 2026 13:33

TomWambsgans closed this Apr 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: reduce allocation contention and STIR equality cache traffic #203

perf: reduce allocation contention and STIR equality cache traffic #203
Barnadrot wants to merge 14 commits into
leanEthereum:mainfrom
Barnadrot:exp6_alloc_reduction_clean

Barnadrot commented Apr 23, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026 •

edited

Loading

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026 •

edited

Loading

Uh oh!

TomWambsgans commented Apr 25, 2026 •

edited

Loading

Uh oh!

Barnadrot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Barnadrot commented Apr 23, 2026

perf: reduce allocation contention and STIR equality cache traffic

Summary

Changes

(a) Equality polynomial: par_chunks_exact

(b) Trace column pre-allocation

(c) Matrix row access: zero-alloc vertically_packed_row_rtl

(d) Batched STIR equality updates with L2-tiled output traversal

Diff shape

Results

Criterion steady-state (xmss_leaf_1400sigs)

Production workload (fancy-aggregation)

Hardware

Correctness

Experimentally ruled out (17 iterations total)

Discarded iterations

Key findings

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026

Uh oh!

TomWambsgans commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TomWambsgans commented Apr 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Barnadrot commented Apr 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Criterion steady-state (`xmss_leaf_1400sigs`)

Production workload (`fancy-aggregation`)

TomWambsgans commented Apr 25, 2026 •

edited

Loading

TomWambsgans commented Apr 25, 2026 •

edited

Loading

TomWambsgans commented Apr 25, 2026 •

edited

Loading