MSM optimisations: CycloneMSM#130
Conversation
|
Can you use something like this to get around the |
@jonathanpwang It would work for curves in |
|
ah, it is always pasta curves that's the problem... You could have a default implementation of |
There was a problem hiding this comment.
LGTM! Found 2 low-hanging fruits that speed up another ~10% on my machine (i9-13900K 24 cores).
Also it might be worth mentioning that if there is multiple same points in bases, there is a high probability to panic in batch_add, perhaps we could fallback to add into the jacobian bucket when batch_add fails, or we could leave this to user to make sure the bases shouldn't contain same point multiple times (but still there could have chance to panic).
| } | ||
|
|
||
| fn contains(&self, buck_idx: usize) -> bool { | ||
| self.set.iter().any(|sch| sch.buck_idx == buck_idx) |
There was a problem hiding this comment.
It seems buck_idx == 0 would never be added because the default buck_idx of SchedulePoint is also 0.
I tried to set it as max to avoid this like:
impl Default for SchedulePoint {
fn default() -> Self {
Self {
base_idx: 0,
buck_idx: usize::MAX,
sign: false,
}
}
}but it actually doesn't affect the performance.
There was a problem hiding this comment.
I think it never hits buck_idx == 0 since in the main loop we check as if buck_idx != 0 { ... . So do you still think I should make that change?
There was a problem hiding this comment.
hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389
I think it's fine to keep it as is, since I didn't observe performance difference
@han0110 do you think we should keep old implementation and prefix new one as independent_msm or something like it? |
han0110
left a comment
There was a problem hiding this comment.
do you think we should keep old implementation and prefix new one as independent_msm or something like it?
Sounds like a good idea, just in case in certain environments the older one is better.
| } | ||
|
|
||
| fn contains(&self, buck_idx: usize) -> bool { | ||
| self.set.iter().any(|sch| sch.buck_idx == buck_idx) |
There was a problem hiding this comment.
hmm but it seems we have set it to buck_idx - 1 after we've checked it's non-zero here https://github.com/privacy-scaling-explorations/halo2curves/pull/130/files#diff-ebe254da862cf489fe020d422527386871313f19c242195dfd55c4f1ac06b6e5R389
I think it's fine to keep it as is, since I didn't observe performance difference
What do you mean by environment, x86 vs ARM? |
| *t = acc * (buckets[*buck_idx].y() - bases[*base_idx].y); | ||
| } else { | ||
| *t = acc * (buckets[*buck_idx].y() + bases[*base_idx].y); | ||
| } |
There was a problem hiding this comment.
Does this handle the case Q = -P or P = infinity?
There was a problem hiding this comment.
No it does not it expects independent points. I aggree that we should change the name
| acc *= *z; | ||
| } | ||
|
|
||
| acc = acc.invert().unwrap(); |
There was a problem hiding this comment.
If it doesn't, we might want to change the name to batch_add_nonexceptional otherwise we will likely divide by 0 and propagate 0 everywhere.
| acc += bases[coeff_idx]; | ||
| pub fn best_multiexp<C: CurveAffine>(coeffs: &[C::Scalar], bases: &[C]) -> C::Curve { | ||
| // TODO: consider adjusting it with emprical data? | ||
| let batch_size = 64; |
There was a problem hiding this comment.
This is likely to be too small for large batches
Assuming c with the number of buckets being 2^(c-1), I use the formula: 4c² - 16*c - 128:
- https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L260-L263
- For 2^10 = 1024 points, c = 9 (in Constantine, formula for c is gives result lower than logarithm to account for memory bandwidth limits): this is a batch size of 52
- For 2^26 = 67M points, c = 16, batch size of 640
There was a problem hiding this comment.
4c² - 16*c - 128 makes this implementation much slower. I think it is because our parallelazing methods are different?
with k = 20
batch_size = 4c² - 16*c - 128takes 618.459msbatch_size = 64takes 887.797ms
There was a problem hiding this comment.
Are the numbers swapped? because 618.459ms is 69.7% of 887.797ms hence the compute is faster.
There was a problem hiding this comment.
Yes it copy pasted wrongly
| let bases_local: Vec<_> = bases.par_iter().map(Affine::from).collect(); | ||
|
|
||
| // number of windows | ||
| let number_of_windows = C::Scalar::NUM_BITS as usize / c + 1; |
There was a problem hiding this comment.
Parallelism for a 256 bits scalar will be limited to 256 / 16 = 16 cores here. We need another layer of parallelism which can easily be added by having schedulers be responsible for a range.
This can be added as a refinement, see https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L243-L250
Scheduler*[NumNZBuckets, QueueLen: static int, EC, ECaff] = object
points: ptr UncheckedArray[ECaff]
buckets*: ptr Buckets[NumNZBuckets, EC, ECaff]
start, stopEx: int32 # Bucket range
numScheduled, numCollisions: int32
collisionsMap: BigInt[NumNZBuckets] # We use a BigInt as a bitmap, when all you have is an axe ...
queue: array[QueueLen, ScheduledPoint]
collisions: array[QueueLen, ScheduledPoint]The start, stopEx: int32 # bucket range fields
Then you can create as many schedulers as there are cores, and because the number of buckets is 2^(c-1) so often in the thousands to millions https://github.com/mratsim/constantine/blob/58d8d2c/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim#L115-L133 even with conservative c and can benefit from system with hundreds of cores.
Then we only need to change the if buck_idx != 0 { to if scheduler.start < buck_idx < scheduler.stop {
This will make each threads read the full data but it's linear reads so speed is OK.
There was a problem hiding this comment.
@mratsim I tried to implement this strategy but results are not good and close to serial implementation. Probably I'm missing something. I didn't want to pollute this PR so you can give some feedbaack at there kilic/cyclone-msm-expreriment#1
| // jacobian buckets for already scheduled points | ||
| let mut j_bucks = vec![Bucket::<C>::None; 1 << (c - 1)]; | ||
|
|
||
| // schedular for affine addition |
There was a problem hiding this comment.
In my benchmarks, affine is only worth it starting from c = 9 or ~1024 points.
| for (base_idx, coeff) in coeffs.iter().enumerate() { | ||
| let buck_idx = get_booth_index(w, c, coeff.as_ref()); | ||
|
|
||
| if buck_idx != 0 { |
There was a problem hiding this comment.
Change here to if scheduler.start < buck_idx < scheduler.stop {, and have as many schedulers as there are threads, and partition the start/stop.
I mean in certain environment we might not want the possibility for the msm to panic. |
Co-authored-by: Han <tinghan0110@gmail.com>
Co-authored-by: Han <tinghan0110@gmail.com>
postfix new one as `_independent_points`
|
Ready to merge? |
Recently I've been thinking about optimizations for commitments to small values, and I just found a case where the old implementation is faster. My machine is AMD Ryzen 5 3600 6-Core Processor I ran the same benchmark but the scalars were only 1 bit values (this would be the case for committing selector columns for example). This is the patch I applied: --- a/src/msm.rs
+++ b/src/msm.rs
@@ -481,7 +481,7 @@ mod test {
use ff::{Field, PrimeField};
use group::{Curve, Group};
use pasta_curves::arithmetic::CurveAffine;
- use rand_core::OsRng;
+ use rand_core::{OsRng, RngCore};
#[test]
fn test_booth_encoding() {
@@ -537,8 +537,10 @@ mod test {
C::Curve::batch_normalize(&points[..], &mut affine_points[..]);
let points = affine_points;
+ let bits = 1;
+ let max_val = 2u64.pow(bits);
let scalars = (0..1 << max_k)
- .map(|_| C::Scalar::random(OsRng))
+ .map(|_| C::Scalar::from(OsRng.next_u64() % max_val))
.collect::<Vec<_>>();
for k in min_k..=max_k {For reference here is the benchmark on my machine with the original test: |
* impl msm with batch addition * bring back multiexp serial * parallelize coeffs to repr Co-authored-by: Han <tinghan0110@gmail.com> * parallelize bases to affine Co-authored-by: Han <tinghan0110@gmail.com> * add missing dependency * bring back old implementation postfix new one as `_independent_points` --------- Co-authored-by: Han <tinghan0110@gmail.com>
* impl msm with batch addition * bring back multiexp serial * parallelize coeffs to repr Co-authored-by: Han <tinghan0110@gmail.com> * parallelize bases to affine Co-authored-by: Han <tinghan0110@gmail.com> * add missing dependency * bring back old implementation postfix new one as `_independent_points` --------- Co-authored-by: Han <tinghan0110@gmail.com>
- Leverage cyclone msm privacy-ethereum/halo2curves#130 - Leverage improved FFT implementations - Much improved parallelism for mv-lookup and permutation commitment calcs - ASM in h2curves Results: 30-80% reduction in proving time for benchmark circuits
Batch affine addition with scheduler approach as in CycloneMSM is implemented. This PR also should cover some part of the optimisation suggestions in here
Batch addition looks like below for example for 16 additions.
And in affine coordinates we use a shared inversion for all add operation so that batch affine addition becomes cheaper than jacobian addition. In scheduler technique we update multiple buckets at once. Natuurally same buckets in the
BATCH_SIZErange shouldn't be selected twice in order not to miss the update. So if we hit the same bucket index we delay this particular update.Other notes about the implementation:
number_of_threadschunks. This approach also appears in gnark-crypto and ashWhiteHat-PRbatch_sizewindow if a bucket is not already selected by slice of scalar (booth index) we add the point into the affine scheduler. Otherwise point goes to corresponding jacobian bucket.CoordinatesAPI to keep current msm API and make it also work for pasta_curves. it means:Affinestruct and for each base point a uselessis_on_curvecheck runsis_on_curvecheck runs in aggregation phase for each affine bucketIf one wants to play with the implementation in a place that is more experimental I'd recommend to see https://github.com/kilic/cyclone-msm-expreriment
With benchmarks on M1 it seems like we achieve 30-40% gain even with coordinates API. Also we might want to file an RFC for zkcrypto to cheaper access to coordinates of
CurveAffinewithoutis_on_curvecheck in order to get ~20% more gain. Leaving the related issue and PR at zkcrypto side.