Hardware-accelerated proof generation in Zerokit via ICICLE #386

vinhtc27 · 2026-02-24T14:45:26Z

vinhtc27
Feb 24, 2026
Collaborator

Introduction

I started an experimental PR on using ICICLE for hardware acceleration of proof generation in Zerokit a month ago. This document explains the technical details and benchmark results of that integration. (gpu-acceleration-using-icicle branch)

TL;DR: Benchmark results show that this improvement does not yield a significant speedup in proof generation time compared to the existing CPU-based implementation. This report aims to document these findings to prevent duplicated efforts in this direction and to redirect focus toward other optimization avenues.

Overview

Currently, Zerokit uses Groth16 over the BN254 curve. Given a proving key and a witness, the prover computes three group elements (A, B, C) that together form the proof. Each element is assembled from a mix of scalar-point multiplications, blinding terms, and polynomial commitments, but the dominant cost in all three comes from Multi-Scalar Multiplication (MSM).

A single Groth16 proof requires 5 MSM operations, which account for roughly 70-80% of total proving time. The full cost breaks down as:

~70-80%: Multi-Scalar Multiplication across 5 operations
~15-25%: QAP polynomial evaluation via 4 FFTs, sequential on CPU
~5%: Witness calculation, proof assembly, and blinding

Implementation

ICICLE

ICICLE is a GPU-accelerated library by Ingonyama designed for ZK primitives like MSM and NTT. It supports CUDA and Metal backends selected at runtime, with a CPU fallback. However, the CPU fallback is slower than the existing parallel feature flag with the standard arkworks prover in Zerokit. This means the ICICLE path is only beneficial when a real GPU is available; otherwise, the standard arkworks prover is the better choice.

How the integration works

All code changes are in this icicle folder. It acts as a plug-and-play replacement for the standard proof generation path (same inputs, same outputs, identical proofs). The only difference is that the 5 MSM operations are dispatched to ICICLE instead of arkworks. Everything else, including witness calculation, QAP polynomial evaluation, and final proof assembly, remains unchanged on CPU.

Type conversion: Arkworks and ICICLE represent field elements and curve points differently in memory. This layer converts between them using little-endian byte serialization. This conversion happens on every MSM call, making it one of the known sources of overhead. icicle/convert.rs

MSM wrappers: Two functions, one for G1 and one for G2, that convert inputs to ICICLE format, run the MSM on GPU, and convert the result back to arkworks types. icicle/msm.rs

Proof construction: This module reimplements the Groth16 prover, swapping all 5 MSM calls with the ICICLE wrappers described above. It takes the same inputs and produces the same proof. Only the MSMs go through ICICLE; everything else remains on CPU. icicle/proof.rs

Benchmarks

Benchmarks were conducted on two machines:

Small machine: Intel Core i5-10400F, RTX 2060 Super
Big machine: 64-core server, no GPU

All benchmarks used criterion via a sequential proving benchmark. An async benchmark was added later to test throughput under parallel workloads.

Sequential benchmark: proof_benchmark.rs
Async benchmark: async_proof_benchmark.rs
Full benchmark reports and logs: Discord thread

Sequential proving (single core)

Machine	Backend	Avg proof time
Small	Standard	386.62 ms
Small	ICICLE CPU	193.06 ms
Small	ICICLE CPU (pinned to 1 core)	701.70 ms
Small	ICICLE CUDA (pinned to 1 core)	156.02 ms

The initial ICICLE CPU result of 193 ms appeared promising, roughly 2x faster than standard. However, pinning the process to a single core revealed that the speedup came entirely from ICICLE using additional CPU cores, not from any algorithmic improvement. On a single core, ICICLE CPU is ~1.8x slower than standard arkworks.

With the CUDA backend on a single core, proof time dropped to 156 ms. This is faster than the standard single-core result and even slightly faster than running in parallel mode.

Sequential proving (parallel mode)

Machine	Backend	Avg proof time
Small	Standard (parallel)	161.79 ms
Small	ICICLE CPU (parallel)	248.90 ms

When Zerokit's normal parallel mode is enabled, the standard arkworks prover already uses all available cores via rayon. In this configuration, ICICLE CPU is slower because it does not benefit from rayon and competes for the same cores. The same result was observed on the big machine.

Async throughput (parallel workers)

Backend	Avg batch time	Throughput
Standard	1.55 s	15.47 proofs/s
ICICLE CUDA	1.68 s	14.28 proofs/s

Under a parallel workload with multiple workers, the GPU path performed worse than CPU-only. Although each individual GPU proof is faster, GPU proofs are processed sequentially because only one MSM can run on the GPU at a time. Meanwhile, the data transfer overhead between CPU and GPU reduces bandwidth and degrades performance for other CPU workers, lowering overall throughput.

Rayon worker optimizations used in the prover service only improved the standard path further, since those optimizations favor heavy CPU computation. The ICICLE path is dominated by GPU I/O and response-time latency rather than raw computational throughput.

Conclusion

Considering the cost-benefit tradeoff, the GPU optimization is not worthwhile. GPU workloads are harder to scale and significantly more expensive to rent, while adding more CPU cores or machines is cheaper and delivers better overall performance.

Future optimization efforts should focus on other avenues such as circuit-level improvements, witness generation, or alternative proving systems, rather than GPU offloading.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hardware-accelerated proof generation in Zerokit via ICICLE #386

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Hardware-accelerated proof generation in Zerokit via ICICLE #386

Uh oh!

vinhtc27 Feb 24, 2026 Collaborator

Introduction

Overview

Implementation

ICICLE

How the integration works

Benchmarks

Sequential proving (single core)

Sequential proving (parallel mode)

Async throughput (parallel workers)

Conclusion

Replies: 0 comments

vinhtc27
Feb 24, 2026
Collaborator