Lowram: Share buffers with non-overlapping lifetimes in verify#1007
Conversation
CBMC Results (ML-DSA-44)
Full Results (200 proofs)
|
CBMC Results (ML-DSA-65)
Full Results (200 proofs)
|
CBMC Results (ML-DSA-87)
Full Results (200 proofs)
|
b09a9aa to
86374eb
Compare
79b7670 to
f4f7b85
Compare
e090aec to
eacc983
Compare
eacc983 to
545201d
Compare
545201d to
3fc9d89
Compare
CBMC Results (ML-DSA-44, REDUCE-RAM)Full Results (200 proofs)
|
CBMC Results (ML-DSA-65, REDUCE-RAM)Full Results (200 proofs)
|
CBMC Results (ML-DSA-87, REDUCE-RAM)Full Results (200 proofs)
|
bb6f745 to
ea14d62
Compare
ea14d62 to
88e1324
Compare
There was a problem hiding this comment.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)
Details
| Benchmark suite | Current: 740c4bb | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
310360 cycles |
329877 cycles |
0.94 |
ML-DSA-44 sign |
1176137 cycles |
1224108 cycles |
0.96 |
ML-DSA-44 verify |
337367 cycles |
348071 cycles |
0.97 |
ML-DSA-65 keypair |
560473 cycles |
565199 cycles |
0.99 |
ML-DSA-65 sign |
1907791 cycles |
1930855 cycles |
0.99 |
ML-DSA-65 verify |
535703 cycles |
545481 cycles |
0.98 |
ML-DSA-87 keypair |
872107 cycles |
851474 cycles |
1.02 |
ML-DSA-87 sign |
2487117 cycles |
2383353 cycles |
1.04 |
ML-DSA-87 verify |
878306 cycles |
873058 cycles |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton4 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 740c4bb | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
385265 cycles |
373790 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 740c4bb | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 sign |
2487117 cycles |
2383353 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton3 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 740c4bb | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
429474 cycles |
402939 cycles |
1.07 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)
Details
| Benchmark suite | Current: 740c4bb | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
459483 cycles |
459447 cycles |
1.00 |
ML-DSA-44 sign |
2133292 cycles |
2132638 cycles |
1.00 |
ML-DSA-44 verify |
544400 cycles |
547432 cycles |
0.99 |
ML-DSA-65 keypair |
772643 cycles |
771917 cycles |
1.00 |
ML-DSA-65 sign |
3477028 cycles |
3470907 cycles |
1.00 |
ML-DSA-65 verify |
847003 cycles |
850045 cycles |
1.00 |
ML-DSA-87 keypair |
1248446 cycles |
1247649 cycles |
1.00 |
ML-DSA-87 sign |
4301115 cycles |
4322692 cycles |
1.00 |
ML-DSA-87 verify |
1358980 cycles |
1368274 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Mac Mini (M1, 2020) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 17733c0 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-65 verify |
82838 cycles |
79828 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)
Details
| Benchmark suite | Current: 17733c0 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
212266 cycles |
225432 cycles |
0.94 |
ML-DSA-44 sign |
599827 cycles |
631252 cycles |
0.95 |
ML-DSA-44 verify |
212113 cycles |
222128 cycles |
0.95 |
ML-DSA-65 keypair |
383924 cycles |
391237 cycles |
0.98 |
ML-DSA-65 sign |
1006349 cycles |
1008674 cycles |
1.00 |
ML-DSA-65 verify |
370840 cycles |
370666 cycles |
1.00 |
ML-DSA-87 keypair |
650184 cycles |
665032 cycles |
0.98 |
ML-DSA-87 sign |
1361289 cycles |
1398865 cycles |
0.97 |
ML-DSA-87 verify |
628257 cycles |
636893 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)
Details
| Benchmark suite | Current: 17733c0 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
309376 cycles |
329877 cycles |
0.94 |
ML-DSA-44 sign |
1190695 cycles |
1224108 cycles |
0.97 |
ML-DSA-44 verify |
330999 cycles |
348071 cycles |
0.95 |
ML-DSA-65 keypair |
571739 cycles |
565199 cycles |
1.01 |
ML-DSA-65 sign |
1937242 cycles |
1930855 cycles |
1.00 |
ML-DSA-65 verify |
558280 cycles |
545481 cycles |
1.02 |
ML-DSA-87 keypair |
876657 cycles |
851474 cycles |
1.03 |
ML-DSA-87 sign |
2435835 cycles |
2383353 cycles |
1.02 |
ML-DSA-87 verify |
906876 cycles |
873058 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 17733c0 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
906876 cycles |
873058 cycles |
1.04 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)
Details
| Benchmark suite | Current: 17733c0 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
460079 cycles |
459447 cycles |
1.00 |
ML-DSA-44 sign |
2137097 cycles |
2132638 cycles |
1.00 |
ML-DSA-44 verify |
545716 cycles |
547432 cycles |
1.00 |
ML-DSA-65 keypair |
772851 cycles |
771917 cycles |
1.00 |
ML-DSA-65 sign |
3477329 cycles |
3470907 cycles |
1.00 |
ML-DSA-65 verify |
848550 cycles |
850045 cycles |
1.00 |
ML-DSA-87 keypair |
1248048 cycles |
1247649 cycles |
1.00 |
ML-DSA-87 sign |
4309673 cycles |
4322692 cycles |
1.00 |
ML-DSA-87 verify |
1357941 cycles |
1368274 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
17733c0 to
b50e8d6
Compare
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Intel Xeon 4th gen (c7i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
149579 cycles |
128257 cycles |
1.17 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton4 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
385420 cycles |
373790 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
oqs-bot
left a comment
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Graviton3 (no-opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-87 verify |
429522 cycles |
402939 cycles |
1.07 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)
Details
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
224513 cycles |
225432 cycles |
1.00 |
ML-DSA-44 sign |
653904 cycles |
631252 cycles |
1.04 |
ML-DSA-44 verify |
228026 cycles |
222128 cycles |
1.03 |
ML-DSA-65 keypair |
401530 cycles |
391237 cycles |
1.03 |
ML-DSA-65 sign |
1041844 cycles |
1008674 cycles |
1.03 |
ML-DSA-65 verify |
376898 cycles |
370666 cycles |
1.02 |
ML-DSA-87 keypair |
644118 cycles |
665032 cycles |
0.97 |
ML-DSA-87 sign |
1352613 cycles |
1398865 cycles |
0.97 |
ML-DSA-87 verify |
626459 cycles |
636893 cycles |
0.98 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Arm Cortex-A72 (Raspberry Pi 4) benchmarks (opt)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 sign |
653904 cycles |
631252 cycles |
1.04 |
ML-DSA-65 sign |
1041844 cycles |
1008674 cycles |
1.03 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A72 (Raspberry Pi 4) benchmarks (no-opt)
Details
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
300757 cycles |
329877 cycles |
0.91 |
ML-DSA-44 sign |
1154134 cycles |
1224108 cycles |
0.94 |
ML-DSA-44 verify |
328761 cycles |
348071 cycles |
0.94 |
ML-DSA-65 keypair |
565075 cycles |
565199 cycles |
1.00 |
ML-DSA-65 sign |
1922224 cycles |
1930855 cycles |
1.00 |
ML-DSA-65 verify |
532483 cycles |
545481 cycles |
0.98 |
ML-DSA-87 keypair |
851416 cycles |
851474 cycles |
1.00 |
ML-DSA-87 sign |
2424125 cycles |
2383353 cycles |
1.02 |
ML-DSA-87 verify |
883409 cycles |
873058 cycles |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
There was a problem hiding this comment.
Arm Cortex-A55 (Snapdragon 888) benchmarks (no-opt)
Details
| Benchmark suite | Current: b50e8d6 | Previous: 339782e | Ratio |
|---|---|---|---|
ML-DSA-44 keypair |
458880 cycles |
459447 cycles |
1.00 |
ML-DSA-44 sign |
2129039 cycles |
2132638 cycles |
1.00 |
ML-DSA-44 verify |
544470 cycles |
547432 cycles |
0.99 |
ML-DSA-65 keypair |
772136 cycles |
771917 cycles |
1.00 |
ML-DSA-65 sign |
3473186 cycles |
3470907 cycles |
1.00 |
ML-DSA-65 verify |
847275 cycles |
850045 cycles |
1.00 |
ML-DSA-87 keypair |
1246443 cycles |
1247649 cycles |
1.00 |
ML-DSA-87 sign |
4300514 cycles |
4322692 cycles |
0.99 |
ML-DSA-87 verify |
1359094 cycles |
1368274 cycles |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
b50e8d6 to
a94dd20
Compare
|
I tried digging into why there is a substantial performance regression for x86_64 in this PR, but I cannot quite figure it out. When I run this PR on a c7i.metal-24xl, I get about the same performance as in main - most of the time. Verify on x86 Sapphire Rapids seems to behave bimodally - most runs land in a fast band, ~6-12% land in a slow band that's 10-18% slower. As far as I can tell it's true on main as well as on this branch, at similar rates. It seems to correlate with where stack buffers happen to land mod 4096, which varies per run because of ASLR. If I align w1 to 4096 bytes, the performance regression goes away in CI, but I don't know how that can help us with fixing it. I tried disabling ASLR (see #1091). That indeed eliminates the variance for me and it definitely also impacts performance in CI, but it doesn't result in the performance regression in this PR to go away. Also, it does not allow to reproduce the performance regression locally. My initial hypothesis was that this is x86's 4 KiB store-buffer aliasing: when a load has the same low-12 address bits as a recently-issued store, x86_64 flags a false dependency and delays the load. That model fits the basic shape: stack addresses varies with ASLR while rodata low-12 is fixed within the binary, so you'd expect the alias to fire when the stack happens to land at a low-12 that overlaps a hot rodata access. My conclusion is that this PR is likely not introducing the regression. Still we should investigate why ASLR has such a massive impact on our performance - it really shouldn't. #1091's benchmark show a speed-up for our native implementations just by disabling ASLR suggesting some of our assembly routines suffer proof performance in the case of unlucky stack placement due to ASLR. @hanno-becker, can we go ahead and merge this PR and investigate this separately? |
|
@mkannwischer Thank you for the analysis. I think it gives sufficient confidence that this is a pre-existing issue, not something introduced in this PR. We should analyze it independently. Can you open an issue? |
hanno-becker
left a comment
There was a problem hiding this comment.
Approving under the assumption that we will investigate the performance degradation on x86 with some urgency.
Uh oh!
There was an error while loading. Please reload this page.