Improve Mulmod perf #80

benaadams · 2025-12-30T13:33:00Z

Reorganized UInt256 implementation by splitting into multiple partial class files (core logic, operators, constructors, and conversions)
Modified the scalar multiplication implementation (MultiplyScalar) to use a more efficient algorithm
Added exception handling for division by zero in modular arithmetic operations
Improved performance of MulMod (still needs more when 512bit multiplies are be done)

Method	Param	Old	New	Improvement
MultiplyMod_UInt256	2,1,2 bits	10.836 ns	3.559 ns	3.04x
MultiplyMod_UInt256	2,64,2 bits	21.067 ns	5.194 ns	4.06x
MultiplyMod_UInt256	2,128,2 bits	43.343 ns	4.245 ns	10.21x
MultiplyMod_UInt256	2,192,2 bits	45.316 ns	5.094 ns	8.90x
MultiplyMod_UInt256	2,256,2 bits	49.468 ns	4.512 ns	10.96x
MultiplyMod_UInt256	64,1,2 bits	11.527 ns	4.405 ns	2.62x
MultiplyMod_UInt256	64,64,2 bits	21.644 ns	5.346 ns	4.05x
MultiplyMod_UInt256	128,1,2 bits	39.575 ns	6.071 ns	6.52x
MultiplyMod_UInt256	128,64,2 bits	44.438 ns	4.602 ns	9.66x
MultiplyMod_UInt256	192,1,2 bits	43.550 ns	6.915 ns	6.30x
MultiplyMod_UInt256	192,64,2 bits	47.299 ns	5.752 ns	8.22x
MultiplyMod_UInt256	256,1,2 bits	43.148 ns	8.713 ns	4.95x
MultiplyMod_UInt256	256,64,2 bits	49.598 ns	5.332 ns	9.30x
MultiplyMod_UInt256	256,128,2 bits	53.207 ns	4.592 ns	11.59x
MultiplyMod_UInt256	256,192,2 bits	54.829 ns	5.543 ns	9.89x
MultiplyMod_UInt256	256,256,2 bits	58.823 ns	4.860 ns	12.10x
MultiplyMod_UInt256	2,1,64 bits	10.262 ns	2.444 ns	4.20x
MultiplyMod_UInt256	2,64,64 bits	20.997 ns	5.588 ns	3.76x
MultiplyMod_UInt256	2,128,64 bits	41.648 ns	9.767 ns	4.26x
MultiplyMod_UInt256	2,192,64 bits	46.161 ns	10.232 ns	4.51x
MultiplyMod_UInt256	2,256,64 bits	49.688 ns	10.083 ns	4.93x
MultiplyMod_UInt256	64,1,64 bits	10.679 ns	3.667 ns	2.91x
MultiplyMod_UInt256	64,64,64 bits	20.646 ns	6.916 ns	2.99x
MultiplyMod_UInt256	128,1,64 bits	36.752 ns	4.957 ns	7.41x
MultiplyMod_UInt256	128,64,64 bits	42.562 ns	9.604 ns	4.43x
MultiplyMod_UInt256	192,1,64 bits	40.934 ns	5.890 ns	6.95x
MultiplyMod_UInt256	192,64,64 bits	45.279 ns	9.750 ns	4.64x
MultiplyMod_UInt256	256,1,64 bits	43.285 ns	6.702 ns	6.46x
MultiplyMod_UInt256	256,64,64 bits	47.322 ns	10.014 ns	4.73x
MultiplyMod_UInt256	256,128,64 bits	49.380 ns	12.720 ns	3.88x
MultiplyMod_UInt256	256,192,64 bits	57.842 ns	12.516 ns	4.62x
MultiplyMod_UInt256	256,256,64 bits	61.100 ns	12.524 ns	4.88x
MultiplyMod_UInt256	2,1,128 bits	10.197 ns	2.482 ns	4.11x
MultiplyMod_UInt256	2,64,128 bits	9.663 ns	10.860 ns	0.89x
MultiplyMod_UInt256	2,128,128 bits	52.711 ns	24.210 ns	2.18x
MultiplyMod_UInt256	2,192,128 bits	58.223 ns	18.225 ns	3.19x
MultiplyMod_UInt256	2,256,128 bits	66.959 ns	24.787 ns	2.70x
MultiplyMod_UInt256	64,1,128 bits	10.449 ns	2.959 ns	3.53x
MultiplyMod_UInt256	64,64,128 bits	9.555 ns	11.452 ns	0.83x
MultiplyMod_UInt256	128,1,128 bits	41.848 ns	9.580 ns	4.37x
MultiplyMod_UInt256	128,64,128 bits	49.564 ns	13.409 ns	3.70x
MultiplyMod_UInt256	192,1,128 bits	51.241 ns	9.964 ns	5.14x
MultiplyMod_UInt256	192,64,128 bits	57.935 ns	25.689 ns	2.26x
MultiplyMod_UInt256	256,1,128 bits	56.772 ns	9.485 ns	5.98x
MultiplyMod_UInt256	256,64,128 bits	65.921 ns	22.711 ns	2.90x
MultiplyMod_UInt256	256,128,128 bits	63.975 ns	33.877 ns	1.89x
MultiplyMod_UInt256	256,192,128 bits	83.193 ns	64.623 ns	1.29x
MultiplyMod_UInt256	256,256,128 bits	88.687 ns	58.266 ns	1.52x
MultiplyMod_UInt256	2,1,192 bits	9.839 ns	2.526 ns	3.90x
MultiplyMod_UInt256	2,64,192 bits	9.893 ns	7.513 ns	1.32x
MultiplyMod_UInt256	2,128,192 bits	24.371 ns	22.502 ns	1.08x
MultiplyMod_UInt256	2,192,192 bits	59.319 ns	40.578 ns	1.46x
MultiplyMod_UInt256	2,256,192 bits	63.385 ns	46.876 ns	1.35x
MultiplyMod_UInt256	64,1,192 bits	10.241 ns	2.690 ns	3.81x
MultiplyMod_UInt256	64,64,192 bits	9.612 ns	7.527 ns	1.28x
MultiplyMod_UInt256	128,1,192 bits	24.059 ns	2.741 ns	8.78x
MultiplyMod_UInt256	128,64,192 bits	24.249 ns	22.895 ns	1.06x
MultiplyMod_UInt256	192,1,192 bits	46.350 ns	14.956 ns	3.10x
MultiplyMod_UInt256	192,64,192 bits	53.871 ns	37.197 ns	1.45x
MultiplyMod_UInt256	256,1,192 bits	54.677 ns	14.854 ns	3.68x
MultiplyMod_UInt256	256,64,192 bits	66.220 ns	47.914 ns	1.38x
MultiplyMod_UInt256	256,128,192 bits	67.148 ns	48.439 ns	1.39x
MultiplyMod_UInt256	256,192,192 bits	80.158 ns	64.521 ns	1.24x
MultiplyMod_UInt256	256,256,192 bits	91.011 ns	77.917 ns	1.17x
MultiplyMod_UInt256	2,1,256 bits	10.354 ns	2.479 ns	4.18x
MultiplyMod_UInt256	2,64,256 bits	9.882 ns	7.492 ns	1.32x
MultiplyMod_UInt256	2,128,256 bits	23.911 ns	24.109 ns	0.99x
MultiplyMod_UInt256	2,192,256 bits	25.065 ns	23.036 ns	1.09x
MultiplyMod_UInt256	2,256,256 bits	62.323 ns	51.123 ns	1.22x
MultiplyMod_UInt256	64,1,256 bits	9.920 ns	2.470 ns	4.02x
MultiplyMod_UInt256	64,64,256 bits	10.203 ns	8.293 ns	1.23x
MultiplyMod_UInt256	128,1,256 bits	23.537 ns	3.056 ns	7.70x
MultiplyMod_UInt256	128,64,256 bits	23.729 ns	23.212 ns	1.02x
MultiplyMod_UInt256	192,1,256 bits	23.024 ns	2.321 ns	9.92x
MultiplyMod_UInt256	192,64,256 bits	22.689 ns	21.476 ns	1.06x
MultiplyMod_UInt256	256,1,256 bits	47.235 ns	14.885 ns	3.17x
MultiplyMod_UInt256	256,64,256 bits	58.625 ns	39.194 ns	1.50x
MultiplyMod_UInt256	256,128,256 bits	63.913 ns	50.311 ns	1.27x
MultiplyMod_UInt256	256,192,256 bits	73.730 ns	51.345 ns	1.44x
MultiplyMod_UInt256	256,256,256 bits	84.343 ns	65.590 ns	1.29x

Copilot

Pull request overview

This pull request refactors the UInt256 implementation to improve the performance of modular arithmetic operations, particularly Mulmod. The changes include reorganizing code into separate partial class files for better maintainability and potentially optimizing the underlying implementations.

Key changes:

Reorganized UInt256 implementation by splitting into multiple partial class files (core logic, operators, constructors, and conversions)
Modified the scalar multiplication implementation (MultiplyScalar) to use a more efficient algorithm
Added exception handling for division by zero in modular arithmetic operations

Reviewed changes

Copilot reviewed 4 out of 6 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`UInt256.cs`	Removed large sections of code (moved to new partial files); updated multiplication and carry logic; added constant `Len`; disabled AVX-512 multiplication path
`UInt256.Operators.cs`	New file containing all operator overloads and type conversion operators previously in main file
`UInt256.Ctors.cs`	New file containing constructors and factory methods previously in main file
`UInt256.Conversions.cs`	New file containing conversion and parsing methods previously in main file
`UInt256Tests.cs`	Updated tests to check for `DivideByZeroException` and `ArgumentException` in modular arithmetic operations with zero modulus

src/Nethermind.Int256/UInt256.cs

src/Nethermind.Int256/UInt256.Operators.cs

src/Nethermind.Int256/UInt256.Conversions.cs

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

LukaszRozmej · 2025-12-31T08:41:23Z

Fallback to old code for 64,64,128 bits?

benaadams · 2025-12-31T09:12:38Z

Added specialized path

Method	Param	Old	New	Improvement
MultiplyMod_UInt256	64,64,192 bits	9.555 ns	7.453 ns	1.28x

src/Nethermind.Int256/UInt256.cs

LukaszRozmej · 2025-12-31T09:02:42Z

src/Nethermind.Int256/UInt256.cs


-                qhat--;
-            }
+    public int CompareTo(object? obj) => obj is not UInt256 int256 ? throw new InvalidOperationException() : CompareTo(int256);


Can it be compared to other number types?

src/Nethermind.Int256/UInt256.cs

LukaszRozmej · 2025-12-31T09:10:05Z

src/Nethermind.Int256/UInt256.DivideMod.cs

+        // y != 0
+        // x > y
+
+        if (x.IsUint64)


this check is somewhat redundant with x.IsZero?

Not sure you what you mean? Will have returned already if x.IsZero; so different check

So both checks, are testing some same thing to some extent.
Like we could check IsUint64, IsZero and IsOne only once and save few instructions?

All 3 test 0 on last 3 fields and differ only on test on 1st field

LukaszRozmej · 2025-12-31T09:10:25Z

src/Nethermind.Int256/UInt256.DivideMod.cs

+        if (y.IsZero) ThrowDivideByZeroException();
+        if (x.IsZero || y.IsOne)


y.IsZero and y.IsOne are somewhat redundant?

x.IsZero or y.IsOne; different variables

same variables on different lines

LukaszRozmej · 2025-12-31T09:11:18Z

src/Nethermind.Int256/UInt256.DivideMod.cs

+        if (m.IsZero) ThrowDivideByZeroException();
+        if (m.IsOne)
+        {
+            // Any value mod 1 is mathematically 0.
+            res = default;
+            return;
+        }
+
+        // Compute 257-bit sum S = x + y as 5 limbs (s0..s3, s4=carry)
+        bool overflow = AddOverflow(in x, in y, out UInt256 sum);
+        ulong s4 = !overflow ? 0UL : 1UL;
+
+        if (m.IsUint64)


m.IsZero, m.IsOne and m.IsUint64 are somewhat redundant? They check same fields for most.

Using BitLen is slower

uint modBits = (uint)m.BitLen; uint xBits = (uint)x.BitLen; uint yBits = (uint)y.BitLen;

LukaszRozmej · 2025-12-31T09:33:01Z

src/Nethermind.Int256/UInt256.DivideMod.cs

+        else if (m.u3 != 0)
+        {
+            Remainder257By256Bits(in sum, in m, out res);
+        }
+        else if (m.u2 != 0)


again comparing m fields

Can't really find a better way that measurably shows up

LukaszRozmej · 2025-12-31T09:33:21Z

src/Nethermind.Int256/UInt256.DivideMod.cs

+        if (m.IsZero) ThrowDivideByZeroException();
+        if (m.IsOne || x.IsZero || y.IsZero)
+        {
+            res = default;
+            return;
+        }
+
+        // Trivial no-mul cases first.
+        if (y.IsOne) { Mod(in x, in m, out res); return; }
+        if (x.IsOne) { Mod(in y, in m, out res); return; }
+
+        // Modulus-size dispatch first - keeps all the tiny-mod magic.
+        if (m.IsUint64)
+        {
+            MulModBy64Bits(in x, in y, m.u0, out res);
+            return;
+        }
+
+        if ((m.u2 | m.u3) == 0)


same redundant compares?

Is > 128bit check?

Will measure leading zeros as single check 🤔

LukaszRozmej · 2025-12-31T10:37:28Z

Anything more for?

2,64,128 bits
2,128,256 bits
128,64,256 bits	
192,64,256 bits

Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>

benaadams · 2025-12-31T12:52:19Z

Anything more for?

2,64,128 bits
2,128,256 bits
128,64,256 bits	
192,64,256 bits

Yes but wasn't great for the amount of additional code added. Are opportunities there, will revisit

benaadams added 13 commits December 29, 2025 20:50

Faster mul mod

092b730

Split to multiple files

4df2cc5

Better SubMulTo2

5616f3c

Better Add2

63ea14a

Better Remainder256By128Bits

6b6e263

Faster Remainder256By128Bits

ec8b235

Refactor div and mod to different file

0cd62df

Tidy up

65e2e9f

Tidy up

be3f57b

Better asm

b05611f

Better asm

9eb24c8

Tidy up

e75f8b6

Add Division by zero

f2f94ce

Copilot AI review requested due to automatic review settings December 30, 2025 13:33

Copilot started reviewing on behalf of benaadams December 30, 2025 13:33 View session

Copilot AI reviewed Dec 30, 2025

View reviewed changes

src/Nethermind.Int256/UInt256.cs Show resolved Hide resolved

src/Nethermind.Int256/UInt256.cs Show resolved Hide resolved

src/Nethermind.Int256/UInt256.Operators.cs Outdated Show resolved Hide resolved

src/Nethermind.Int256/UInt256.Conversions.cs Outdated Show resolved Hide resolved

benaadams and others added 4 commits December 30, 2025 13:52

fix

0dbcddf

Update src/Nethermind.Int256/UInt256.Operators.cs

a60c679

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update src/Nethermind.Int256/UInt256.Conversions.cs

40711a1

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Merge branch 'main' into mulmod-perf

73bf322

benaadams requested review from LukaszRozmej and rubo December 30, 2025 14:28

Update doc comments for throwing divide by zero

94db1f3

Add specialised path for 64,64,128 bits

2ff5afc

LukaszRozmej approved these changes Dec 31, 2025

View reviewed changes

benaadams and others added 2 commits December 31, 2025 11:24

Apply suggestions from code review

aab477a

Co-authored-by: Lukasz Rozmej <lukasz.rozmej@gmail.com>

Feedback + fix

870b351

Remove LeadingZeroCount wrapper

aafcc7b

benaadams merged commit f272b7e into main Dec 31, 2025
11 checks passed

benaadams deleted the mulmod-perf branch December 31, 2025 12:57

		if (y.IsZero) ThrowDivideByZeroException();
		if (x.IsZero \|\| y.IsOne)

Improve Mulmod perf #80

Improve Mulmod perf #80

Uh oh!

Conversation

benaadams commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

LukaszRozmej commented Dec 31, 2025

Uh oh!

benaadams commented Dec 31, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benaadams Dec 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LukaszRozmej commented Dec 31, 2025

Uh oh!

benaadams commented Dec 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

benaadams commented Dec 30, 2025 •

edited

Loading

benaadams Dec 31, 2025 •

edited

Loading