feat/benchmark_new_version by Barnadrot · Pull Request #65 · Barnadrot/zk-autoresearch

Barnadrot · 2026-05-14T15:39:27Z

Redesigned benchmarking to improve experiment gate.

Session result: 1 confirmed real win (h5 orphan, -0.83% wall-clock, p=8e-06, n=24, reproduced n=40), 9 discards, 1 dead-end (D.2 already applied). Counter at 9/12 — ended early because all further attempts on the SIMD Poseidon surface produced LLVM-neutralized null results and the h5 orphan is shippable on its own. h5 mechanism: manual unroll of the 15-iter rank-1 update in partial round SIMD body. Eliminated the 0x780(%rsp,%rsi,1) indexed-stack spill that perf annotate showed at 2.28%+1.31%=3.6% on baseline. Codegen verified: vmovdqa64 0x780(%rsp,%rsi,1) instance count went from 2 (baseline) to 0 (h5 binary). Held on branch as orphan (commit 943b65a1 on leanMultisig:pw4_2-2026-05-13). Recommend brain manually promote to shipped — the gain is real, statistically robust, codegen-verified, and free. Bundle attempts h6-h11 all failed to push the cumulative over the 1.0% gate threshold, but h5 alone is worth shipping. Verdict + pr_body include: full hypothesis table, confirmed dead-ends (loop unrolls outside h5 case, SSA-hoist, lambda fusion, x2 batching, prefetch, rayon pairing — all LLVM-neutralized), and recommendation for follow-on profiling post-h5.

Post pw4 audit: baseline drift inflated per-iter deltas by 5-10x due to position bias, low sample count, and page cache asymmetry. This commit addresses all identified noise sources: - N=1→N=3 default (12 samples/side, ~85% power vs ~40%) - Counterbalanced round ordering (odd=base→cand, even=cand→base) - Core pinning via taskset (one thread per physical core) - Page cache drop between sides (sync + drop_caches) - A-vs-A noise floor calibration (soft-fail, no loop disruption) - New eval_steady_state.sh for production-regime measurement (proof 200+) - Auto-chain steady-state on keeps exceeding 2% Tested on Hetzner AX42-U: noise floor 0.035%, counterbalancing visibly corrects position bias (round 2 Δ=-0.23% vs rounds 1,3 Δ=-0.48%). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

pw4_2 agent and others added 2 commits May 14, 2026 10:21

Barnadrot merged commit f76ce46 into main May 14, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/benchmark_new_version#65

feat/benchmark_new_version#65
Barnadrot merged 2 commits into
mainfrom
pw4_2-2026-05-13

Barnadrot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Barnadrot commented May 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant