zk-autoresearch — Autonomous ZK prover research

Profile-guided experiments across multiple proving systems, using Claude as the optimization agent.

Method: For each target, an agent receives a focused program (constraints, eval gates, writable scope), proposes one change per iteration, and keeps it only if it passes correctness + performance gates. All iterations are logged.

Hardware: Hetzner AX42-U (AMD Ryzen 7 PRO 8700GE, 8C/16T, 64GB DDR5), Hetzner CCX33 (AMD EPYC, 8C, AVX512), and AWS c7a.2xlarge (AMD EPYC Genoa, 8 vCPU, AVX512).

Results

The tables below report optimization gains measured and merged in upstream repos. If you believe any measurement methodology could be improved, please open an issue with a suggested adjustment.

Plonky3

Target: Plonky3 — ZK proving framework. Optimization target: coset_lde_batch NTT/DFT on BabyBear 2^20 × 256, Radix2DitParallel.

Experiment	Optimization	Result	Status
NTT butterfly (PR #1492)	Butterfly micro-optimizations for Radix2DitParallel	2.1%–10.4% across sizes	Merged
Bench fix (PR #1575)	`iter_batched` to exclude clone cost from DFT measurement	42% of measured time was `Vec::clone`, not FFT	Merged
AVX-512 Montgomery + butterfly (PR #1555)	`vpminud` reduction, drop `confuse_compiler`, manual unroll	~3.96% faster `coset_lde_batch` on Zen 4	Merged

leanMultisig

Target: leanMultisig — XMSS signature aggregation prover (Plonky3/WHIR-based, BabyBear field).

Experiment	Optimization	Result	Status
Inline quintic extension (PR #197)	`#[inline(always)]` on quintic field arithmetic	-3.6% on `xmss_leaf_1400sigs`	Merged
Degree-split AIR sumcheck (PR #202)	Skip partial-round constraints at high z-points	-7.64% on `fancy-aggregation` (Hetzner AX42-U)	Merged
Alloc contention + STIR tiling (PR #203)	Eliminate alloc contention, L2-tiled STIR equality	-10.3% on `fancy-aggregation` (3/4 changes merged as independent commits)	Merged
zk-alloc arena allocator (PR #205)	Bump+reset arena allocator	-27% warm proof (3.3s → 2.3s)	Merged
Sumcheck, GKR & WHIR proving optimizations (PR #235)	Bump+reset arena allocator	-6% warm proof (AVX-512) (2.3s → 2.2s)	Merged

Vortex / gnark-crypto

Target: Linea Vortex prover (KoalaBear field) and gnark-crypto (Go, upstream dependency).

Experiment	Optimization	Result (microbench, c7a.2xlarge)	Status
LinearCombination + commitment hashing (PR #2898)	MulAccByElement, eliminate copy, MDHasher buffer reuse, Compressx16 SIMD	-72% LinearCombination, -17% commitment hashing, -99.9% allocs	Merged
FFT kernels + SIS LimbIterator (gnark-crypto PR #834)	Unrolled FFT64/128 kernels, inline small-m stages, LimbIterator devirtualization	-56% SIS ns/op, -98% SIS allocs	Merged

Production infrastructure operates at significantly larger scale and was not available for end-to-end benchmarking.

Repository Structure

zk-autoresearch/
├── harness/                       Benchmark + correctness tooling per target
│   ├── plonky3/
│   │   ├── bench/                 Plonky3 benchmark crate (Poseidon1/2, Keccak)
│   │   ├── correctness/           Bitwise-identical DFT validation crate
│   │   └── scripts/               eval.sh, correctness.sh
│   ├── leanmultisig/
│   │   ├── bench/                 prove_loop + Criterion benchmarks
│   │   ├── correctness/           correctness.sh, test_integrity.sha256
│   │   └── scripts/               eval_paired.sh, eval_gate.sh, config.env, ...
│   ├── vortex/
│   │   ├── correctness/           correctness.sh
│   │   └── scripts/               eval_bench.sh, noise_floor.sh, config.env
│   └── gnark-crypto/              Placeholder (currently benchmarked via Vortex)
│
├── experiment_logs/               Audit trail — append-only, never delete
│   ├── Plonky3/NTT/              NTT/DFT optimization experiments
│   ├── leanMultisig/             Sumcheck, Poseidon, LogUp, allocator experiments
│   ├── linea/                    Vortex/KoalaBear experiments
│   └── zk-alloc/                 Arena allocator research (multi-prover)
│
├── scripts/
│   ├── setup/                    Server provisioning (server.sh, zk_alloc.sh, ...)
│   ├── run_benchmark.sh          Cross-branch Criterion comparison
│   └── watch.py                  Live experiment monitor (iters.tsv + jsonl)
│
└── .github/workflows/            CI: build harness crates, regression gates

External repos (cloned locally, gitignored)

Directory	Repo	Role
`plonky3/`	Plonky3/Plonky3	Optimization target
`leanMultisig/`	leanEthereum/leanMultisig	Optimization target
`linea-monorepo/`	Consensys/linea-monorepo	Optimization target
`gnark-crypto/`	Consensys/gnark-crypto	Optimization target (upstream dependency)
`jolt/`	a16z/jolt	Benchmarked (zk-alloc null result)
`zk-alloc/`	Barnadrot/zk-alloc	Standalone arena allocator crate
`mimalloc/`, `snmalloc/`, `glibc-malloc/`	Reference allocators	Study material
`sp1/`	succinctlabs/sp1	Study material

Running Experiments

Experiments run via Claude Code CLI in a tmux session. Each experiment has:

program.md — Agent instructions (role, target, constraints, eval gates)
iters.tsv — Iteration log (hash, delta, decision, rationale)
Eval scripts in harness/<project>/scripts/

# Start a tmux session for the experiment
tmux new-session -s autoresearch

# Run Claude Code with the experiment program
claude --dangerously-skip-permissions \
  -p "Read experiment_logs/<project>/<experiment>/program.md and start the experiment"

Monitor from another terminal:

python3 scripts/watch.py experiment_logs/<project>/<experiment>/iters.tsv

Development

Enable the pre-commit hook:

git config core.hooksPath .githooks

Critical: RUSTFLAGS for benchmarking

Always set RUSTFLAGS="-C target-cpu=native" (or your equivalent -C target-cpu= value) when benchmarking. Without it, no AVX-512 — measurements are silently 2x slower.

Prior Art

Karpathy's autoresearch pattern: LLM + benchmark feedback loop for nanoGPT kernel optimization
Gassmann et al. (2025): autotuned LLVM flags for SP1/RISC Zero → ~17% improvement
Gap this fills: source-level autoresearch on a production ZK prover (first known application)

Inspired by Karpathy's autoresearch pattern. First known application to a production ZK prover.

Name		Name	Last commit message	Last commit date
Latest commit History 300 Commits
.githooks		.githooks
.github/workflows		.github/workflows
brain_example		brain_example
experiment_logs		experiment_logs
harness		harness
scripts		scripts
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CLAUDE_executor.md		CLAUDE_executor.md
README.md		README.md
rust-toolchain.toml		rust-toolchain.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

zk-autoresearch — Autonomous ZK prover research

Results

Plonky3

leanMultisig

Vortex / gnark-crypto

Repository Structure

External repos (cloned locally, gitignored)

Running Experiments

Development

Critical: RUSTFLAGS for benchmarking

Prior Art

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

zk-autoresearch — Autonomous ZK prover research

Results

Plonky3

leanMultisig

Vortex / gnark-crypto

Repository Structure

External repos (cloned locally, gitignored)

Running Experiments

Development

Critical: RUSTFLAGS for benchmarking

Prior Art

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages