You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I started an experimental PR on using ICICLE for hardware acceleration of proof generation in Zerokit a month ago. This document explains the technical details and benchmark results of that integration. (gpu-acceleration-using-icicle branch)
TL;DR: Benchmark results show that this improvement does not yield a significant speedup in proof generation time compared to the existing CPU-based implementation. This report aims to document these findings to prevent duplicated efforts in this direction and to redirect focus toward other optimization avenues.
Overview
Currently, Zerokit uses Groth16 over the BN254 curve. Given a proving key and a witness, the prover computes three group elements (A, B, C) that together form the proof. Each element is assembled from a mix of scalar-point multiplications, blinding terms, and polynomial commitments, but the dominant cost in all three comes from Multi-Scalar Multiplication (MSM).
A single Groth16 proof requires 5 MSM operations, which account for roughly 70-80% of total proving time. The full cost breaks down as:
~70-80%: Multi-Scalar Multiplication across 5 operations
~15-25%: QAP polynomial evaluation via 4 FFTs, sequential on CPU
~5%: Witness calculation, proof assembly, and blinding
Implementation
ICICLE
ICICLE is a GPU-accelerated library by Ingonyama designed for ZK primitives like MSM and NTT. It supports CUDA and Metal backends selected at runtime, with a CPU fallback. However, the CPU fallback is slower than the existing parallel feature flag with the standard arkworks prover in Zerokit. This means the ICICLE path is only beneficial when a real GPU is available; otherwise, the standard arkworks prover is the better choice.
How the integration works
All code changes are in this icicle folder. It acts as a plug-and-play replacement for the standard proof generation path (same inputs, same outputs, identical proofs). The only difference is that the 5 MSM operations are dispatched to ICICLE instead of arkworks. Everything else, including witness calculation, QAP polynomial evaluation, and final proof assembly, remains unchanged on CPU.
Type conversion: Arkworks and ICICLE represent field elements and curve points differently in memory. This layer converts between them using little-endian byte serialization. This conversion happens on every MSM call, making it one of the known sources of overhead. icicle/convert.rs
MSM wrappers: Two functions, one for G1 and one for G2, that convert inputs to ICICLE format, run the MSM on GPU, and convert the result back to arkworks types. icicle/msm.rs
Proof construction: This module reimplements the Groth16 prover, swapping all 5 MSM calls with the ICICLE wrappers described above. It takes the same inputs and produces the same proof. Only the MSMs go through ICICLE; everything else remains on CPU. icicle/proof.rs
Benchmarks
Benchmarks were conducted on two machines:
Small machine: Intel Core i5-10400F, RTX 2060 Super
Big machine: 64-core server, no GPU
All benchmarks used criterion via a sequential proving benchmark. An async benchmark was added later to test throughput under parallel workloads.
The initial ICICLE CPU result of 193 ms appeared promising, roughly 2x faster than standard. However, pinning the process to a single core revealed that the speedup came entirely from ICICLE using additional CPU cores, not from any algorithmic improvement. On a single core, ICICLE CPU is ~1.8x slower than standard arkworks.
With the CUDA backend on a single core, proof time dropped to 156 ms. This is faster than the standard single-core result and even slightly faster than running in parallel mode.
Sequential proving (parallel mode)
Machine
Backend
Avg proof time
Small
Standard (parallel)
161.79 ms
Small
ICICLE CPU (parallel)
248.90 ms
When Zerokit's normal parallel mode is enabled, the standard arkworks prover already uses all available cores via rayon. In this configuration, ICICLE CPU is slower because it does not benefit from rayon and competes for the same cores. The same result was observed on the big machine.
Async throughput (parallel workers)
Backend
Avg batch time
Throughput
Standard
1.55 s
15.47 proofs/s
ICICLE CUDA
1.68 s
14.28 proofs/s
Under a parallel workload with multiple workers, the GPU path performed worse than CPU-only. Although each individual GPU proof is faster, GPU proofs are processed sequentially because only one MSM can run on the GPU at a time. Meanwhile, the data transfer overhead between CPU and GPU reduces bandwidth and degrades performance for other CPU workers, lowering overall throughput.
Rayon worker optimizations used in the prover service only improved the standard path further, since those optimizations favor heavy CPU computation. The ICICLE path is dominated by GPU I/O and response-time latency rather than raw computational throughput.
Conclusion
Considering the cost-benefit tradeoff, the GPU optimization is not worthwhile. GPU workloads are harder to scale and significantly more expensive to rent, while adding more CPU cores or machines is cheaper and delivers better overall performance.
Future optimization efforts should focus on other avenues such as circuit-level improvements, witness generation, or alternative proving systems, rather than GPU offloading.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
Introduction
I started an experimental PR on using ICICLE for hardware acceleration of proof generation in Zerokit a month ago. This document explains the technical details and benchmark results of that integration. (gpu-acceleration-using-icicle branch)
TL;DR: Benchmark results show that this improvement does not yield a significant speedup in proof generation time compared to the existing CPU-based implementation. This report aims to document these findings to prevent duplicated efforts in this direction and to redirect focus toward other optimization avenues.
Overview
Currently, Zerokit uses Groth16 over the BN254 curve. Given a proving key and a witness, the prover computes three group elements
(A, B, C)that together form the proof. Each element is assembled from a mix of scalar-point multiplications, blinding terms, and polynomial commitments, but the dominant cost in all three comes from Multi-Scalar Multiplication (MSM).A single Groth16 proof requires 5 MSM operations, which account for roughly 70-80% of total proving time. The full cost breaks down as:
Implementation
ICICLE
ICICLE is a GPU-accelerated library by Ingonyama designed for ZK primitives like MSM and NTT. It supports CUDA and Metal backends selected at runtime, with a CPU fallback. However, the CPU fallback is slower than the existing parallel feature flag with the standard arkworks prover in Zerokit. This means the ICICLE path is only beneficial when a real GPU is available; otherwise, the standard arkworks prover is the better choice.
How the integration works
All code changes are in this icicle folder. It acts as a plug-and-play replacement for the standard proof generation path (same inputs, same outputs, identical proofs). The only difference is that the 5 MSM operations are dispatched to ICICLE instead of arkworks. Everything else, including witness calculation, QAP polynomial evaluation, and final proof assembly, remains unchanged on CPU.
Type conversion: Arkworks and ICICLE represent field elements and curve points differently in memory. This layer converts between them using little-endian byte serialization. This conversion happens on every MSM call, making it one of the known sources of overhead. icicle/convert.rs
MSM wrappers: Two functions, one for G1 and one for G2, that convert inputs to ICICLE format, run the MSM on GPU, and convert the result back to arkworks types. icicle/msm.rs
Proof construction: This module reimplements the Groth16 prover, swapping all 5 MSM calls with the ICICLE wrappers described above. It takes the same inputs and produces the same proof. Only the MSMs go through ICICLE; everything else remains on CPU. icicle/proof.rs
Benchmarks
Benchmarks were conducted on two machines:
All benchmarks used criterion via a sequential proving benchmark. An async benchmark was added later to test throughput under parallel workloads.
Sequential proving (single core)
The initial ICICLE CPU result of 193 ms appeared promising, roughly 2x faster than standard. However, pinning the process to a single core revealed that the speedup came entirely from ICICLE using additional CPU cores, not from any algorithmic improvement. On a single core, ICICLE CPU is ~1.8x slower than standard arkworks.
With the CUDA backend on a single core, proof time dropped to 156 ms. This is faster than the standard single-core result and even slightly faster than running in parallel mode.
Sequential proving (parallel mode)
When Zerokit's normal parallel mode is enabled, the standard arkworks prover already uses all available cores via rayon. In this configuration, ICICLE CPU is slower because it does not benefit from rayon and competes for the same cores. The same result was observed on the big machine.
Async throughput (parallel workers)
Under a parallel workload with multiple workers, the GPU path performed worse than CPU-only. Although each individual GPU proof is faster, GPU proofs are processed sequentially because only one MSM can run on the GPU at a time. Meanwhile, the data transfer overhead between CPU and GPU reduces bandwidth and degrades performance for other CPU workers, lowering overall throughput.
Rayon worker optimizations used in the prover service only improved the standard path further, since those optimizations favor heavy CPU computation. The ICICLE path is dominated by GPU I/O and response-time latency rather than raw computational throughput.
Conclusion
Considering the cost-benefit tradeoff, the GPU optimization is not worthwhile. GPU workloads are harder to scale and significantly more expensive to rent, while adding more CPU cores or machines is cheaper and delivers better overall performance.
Future optimization efforts should focus on other avenues such as circuit-level improvements, witness generation, or alternative proving systems, rather than GPU offloading.
Beta Was this translation helpful? Give feedback.
All reactions