Add ppc64le backend (supports p8 and above architectures)#1648
Add ppc64le backend (supports p8 and above architectures)#1648dannytsen wants to merge 29 commits into
Conversation
|
@bhess @mkannwischer Please review. Thanks. |
Thanks @dannytsen for rebasing this PR. Could you get it through CI please? |
bhess
left a comment
There was a problem hiding this comment.
Thanks for the updates, @dannytsen.
Minor comments are inline.
supports p8 and above architectures.
I do not think this is correct as written. I see illegal instructions with qemu-ppc64le -cpu power8, while the same code runs with -cpu power9. In particular, lxvx and stxvx do not appear to be supported on p8.
Can you please double-check this and update the requirement to POWER9 if that is indeed the minimum supported target?
@mkannwischer I tested on p8 system. I have to double check that. Ah, I found the problem, my p8 supports ISA 2.07 which means the qemu supports older one. I'll fix that along with @bhess comments. Thanks. |
|
@mkannwischer @bhess Just submitted my changes and stated that the supporting architecture to be p9 and above or Power systems support ISA 2.07 and other fixes. Hope this solve all the issues. Thanks. |
|
@mkannwischer @bhess Submitted an updated code after re-run autogen. Please review. |
|
@mkannwischer @bhess Not sure what's the errors for 4 CI errors. This was happened after I ran the autogen. |
Thanks for the progress. You can ignore the two OpenTitan tests (those should not be run on a fork because the AWS auth fails) and the markdown lint (that is a spurious failure). |
OK. I'll check into the ppc one. Thanks. |
Thanks for the updates. I have a few more minor follow-ups inline. |
@bhess I did not see your follow-ups or I missed it. Can you post it again? Thanks. |
| #define V_26 2 | ||
| #define V_MKQ 3 | ||
|
|
||
| .machine "any" |
There was a problem hiding this comment.
I believe this directive will get removed after simpasm.
There was a problem hiding this comment.
does it still make sense to include it in the dev-folder then, e.g., for documentation purposes?
There was a problem hiding this comment.
@bhess Well, this code still supports p8 but with ISA 2.07. Some older p8s don't support ISA 2.07. Looks like qemu supports older p8 not with ISA 2.07. In general, we support p9 but also p8 with ISA 2.07.
There was a problem hiding this comment.
hm, is there any way to make this distinction with the .machine pseudo-op?
There was a problem hiding this comment.
That's possible. I'll think about it.
Should be visible now. |
If you rebase on top of main, the OpenTitan tests should be properly skipped (fixed in #1649) |
|
@mkannwischer @bhess I manually modified mlkem/mlkem_native_asm.S and mlkem/mlkem_native.c to add ppc64le include so to pass the monolithic_build_multilevel_native CI tests because autogen can't handle the ppc64le platform. So I am not sure how to fix the test unless autogen get fixed. |
I’ve opened a PR against your branch that updates autogen. |
@bhess Thanks. |
@bhess I merged the autogen and rerun the autogen. Hope this commit fix the problem. Thanks. |
|
@bhess I just added a capability check of PPC_FEATURE2_ARCH_2_07. It will fall back to the original implementation if the architecture doesn't support ISA 2.07 and above which include some p8 and p9 above. Please review. Thanks. |
|
@mkannwischer @bhess Any feedback on the recent commit? Thanks. |
|
@mkannwischer In qemu, power8e supports ISA 2.07 which is a variant of p8 with DD2. |
There was a problem hiding this comment.
Thanks for the update, @dannytsen.
The runtime capability check goes in a slightly different direction than what I had in mind. My suggestion was about whether the .machine pseudo-op can express something like "POWER8 with ISA 2.07 support". If there's no way to express this with .machine, I'd suggest just removing it from the assembly and documenting the ISA 2.07 requirement in the project docs / README instead (instead of machine "any").
For the dynamic getauxval approach: I wonder if runtime feature detection might be better left to downstream consumers of mlkem-native (e.g., liboqs) rather than adding it here. There's also the consideration that getauxval/<sys/auxv.h> is not portable AFAIK. Any thoughts @mkannwischer ?
@bhess ".machine any" is already a psudo-op. I can remove that in dev/ppc64le/src. The README is already had the update. For getauxval approach, it's to prevent the code run on a older Power systems that did not support proper instaructions. |
There was a problem hiding this comment.
Coming back to this after a closer look at runtime CPU checks in mlkem-native:
my remaining concern is the runtime capability check in the PR. mlkem-native already has a designed-in mechanism for exactly this case, and I think the PR should use it rather than doing its own in the native backend.
- Hook:
MLK_CONFIG_CUSTOM_CAPABILITY_FUNC+ the
mlk_sys_check_capability(mlk_sys_cap)callback declared in
mlkem/src/sys.h.
Backends call it; consumers override it for runtime dispatch. - Examples in other backends:
Concretely, this would mean adding MLK_SYS_CAP_PPC_ISA_2_07 to the enum,
replacing the four mlkem_ppc_check_cap() calls with
mlk_sys_check_capability(...), and shipping the auxv code as a new
test/configs/custom_native_capability_config_PPC_AUXV.h. Specifics are inline.
@mkannwischer: please correct me if that's not the intended pattern, and feel free to share any additional review.
| #include "../api.h" | ||
| #include "src/arith_native_ppc64le.h" | ||
|
|
||
| #include <sys/auxv.h> |
There was a problem hiding this comment.
Because this header is pulled in by everyone that uses the native backend (via src/native/meta.h), this include makes the whole native build glibc-dependent.
The existing AVX2 and NEON backends don't query the OS, they assume the ISA the build was configured for, and runtime dispatch is delegated to the consumer through MLK_CONFIG_CUSTOM_CAPABILITY_FUNC (see test/configs/custom_native_capability_config_CPUID_AVX2.h and ..._ID_AA64PFR1_EL1.h).
I'd advise to use the same pattern here: drop this include, and move the auxv-based detection into a new
test/configs/custom_native_capability_config_PPC_AUXV.h example.
There was a problem hiding this comment.
I believe this is auto generated.
There was a problem hiding this comment.
Ah yes, anyway I think this is moot and not a blocker if the checks are removed for now (see the other thread). I can handle the autogen in a followup.
| } | ||
|
|
||
| MLK_MUST_CHECK_RETURN_VALUE | ||
| static MLK_INLINE int mlk_ntt_native(int16_t data[MLKEM_N]) |
There was a problem hiding this comment.
Suggested replacement (matching with AVX2/NEON, should be applied to the other functions in this file as well):
/* in mlkem/src/sys.h: add MLK_SYS_CAP_PPC_ISA_2_07 to mlk_sys_cap */
MLK_MUST_CHECK_RETURN_VALUE
static MLK_INLINE int mlk_ntt_native(int16_t data[MLKEM_N])
{
if (!mlk_sys_check_capability(MLK_SYS_CAP_PPC_ISA_2_07))
return MLK_NATIVE_FUNC_FALLBACK;
mlk_ntt_ppc(data, mlk_ppc_qdata);
return MLK_NATIVE_FUNC_SUCCESS;
}The default mlk_sys_check_capability returns 1 -> the native path is always taken when the backend is selected, matching the build-time default used elsewhere. Consumers that want runtime dispatch supply their own; the auxv code from this PR can ship as a test/configs/custom_native_capability_config_PPC_AUXV.h example.
Once the capability check is centralised on mlk_sys_check_capability(MLK_SYS_CAP_PPC_ISA_2_07), all four wrappers stay structurally identical but no longer reference a private static (ppc_inited).
There was a problem hiding this comment.
@bhess I disagreed with you. The purpose of that I want it to check the capability is not just checking the ISA 2.07 but also to be able to added different architecture support in the future like p10 specific or p11 etc. So, I'll be able to re-assign function pointer to a different entry point if needed. So, mlk_ntt_ppc will be a function pointer to a different entry depending on the architecture. That's for future arch support. I can remove this check now. I just want to get the implementation merged quickly because we have been over this implementation for months if not 6 months.
There was a problem hiding this comment.
Ok sounds good, I don't think runtime checks are a blocker for this PR, so removing them seems fine and can be iterated later.
There was a problem hiding this comment.
I don't think we want a function pointer based dispatch; unless there's a good reason to do it otherwise, dispatch should follow the pattern of checking architecture flags and calling the correct function directly.
There was a problem hiding this comment.
I don't think we want a function pointer based dispatch; unless there's a good reason to do it otherwise, dispatch should follow the pattern of checking architecture flags and calling the correct function directly.
I agree. For this PR, I’m in favor of removing the checks. We can defer runtime capability checks to a later iteration if we need to support multiple architectures/ISA levels in one build.
bhess
left a comment
There was a problem hiding this comment.
Thanks for the updates @dannytsen. I re-reviewed this together with the earlier discussion in #1184. The removal of the runtime capability check resolves my previous comment, and the main readability points raised on the earlier PR also seem addressed for this iteration. I’m not the maintainer, but from my side this iteration now looks good and I’d support merging it.
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
… twisted zetas. Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
|
@bhess @mkannwischer merged all @bhess 's new autogen, simpasm and Use Barrett for NTT twiddle products. |
mkannwischer
left a comment
There was a problem hiding this comment.
Thanks for the update! I left another round of comments.
Could you kindly also rebase the changes against main and make sure the diff is as you expect? Right now this PR contains a lot of unrelated changes making it hard to review.
I'm still testing the autogeneration and will get back on that shortly.
There was a problem hiding this comment.
Please remove the liboqs integration from this PR. @bhess will add it in a follow-up once it's confirmed to work.
| @@ -9,6 +9,7 @@ endif | |||
| SOURCES += $(wildcard mlkem/src/*.c) | |||
| ifeq ($(OPT),1) | |||
| SOURCES += $(wildcard mlkem/src/native/aarch64/src/*.[csS]) $(wildcard mlkem/src/native/x86_64/src/*.[csS]) $(wildcard mlkem/src/native/riscv64/src/*.[csS]) | |||
| SOURCES += $(wildcard mlkem/src/native/ppc64le/src/*.[csS]) | |||
There was a problem hiding this comment.
Indentation here is still off.
| #include "../../../common.h" | ||
|
|
||
| #if defined(MLK_ARITH_BACKEND_PPC64LE_DEFAULT) && \ | ||
| !defined(MLK_CONFIG_MULTILEVEL_NO_SHARED) |
There was a problem hiding this comment.
Add the POWER8_VECTOR guard here.
|
|
||
| #include "consts.h" | ||
|
|
||
| MLK_ALIGN const int16_t mlk_ppc_qdata[] = { |
There was a problem hiding this comment.
Please add MLK_INTERNAL_DATA_DEFINITION here (allowing to mark this as static - see #1658).
|
|
||
| #ifndef __ASSEMBLER__ | ||
| #define mlk_ppc_qdata MLK_NAMESPACE(ppc_qdata) | ||
| extern const int16_t mlk_ppc_qdata[]; |
There was a problem hiding this comment.
| extern const int16_t mlk_ppc_qdata[]; | |
| MLK_INTERNAL_DATA_DECLARATION const int16_t mlk_ppc_qdata[]; |
There was a problem hiding this comment.
Since we are autogenerating ppc asm now, the autogeneration CI needs to be adjusted.
Can you extend the cross-autogen devShell please:
devShells.cross-autogen = util.mkShell {
packages = builtins.attrValues { inherit (config.packages) linters toolchain_ppc64le; inherit (pkgs) gcc-arm-embedded; }
++ pkgs.lib.optionals pkgs.stdenv.hostPlatform.isx86_64 [ config.packages.toolchain_aarch64 ]
++ pkgs.lib.optionals pkgs.stdenv.hostPlatform.isAarch64 [ config.packages.toolchain_x86_64 ];
};
| * Yields the signed canonical representative of (b*z) mod q, | ||
| * bounded by q/2. |
There was a problem hiding this comment.
This is not true. Barrett multiplication doesn't yield results bounded by q/2.
@hanno-becker, could you possibly take a look at this and suggest how to phrase this?
| /* | ||
| * Copyright (c) The mlkem-native project authors | ||
| * SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT | ||
| * | ||
| * Copyright IBM Corp. 2025, 2026 | ||
| * | ||
| * =================================================================================== | ||
| * Written by Danny Tsen <dtsen@us.ibm.com> | ||
| */ |
There was a problem hiding this comment.
| /* | |
| * Copyright (c) The mlkem-native project authors | |
| * SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT | |
| * | |
| * Copyright IBM Corp. 2025, 2026 | |
| * | |
| * =================================================================================== | |
| * Written by Danny Tsen <dtsen@us.ibm.com> | |
| */ | |
| * Copyright(c) The mlkem-native project authors | |
| * Copyright (c) IBM Corp. 2025, 2026 | |
| * SPDX-License-Identifier: Apache-2.0 OR ISC OR MIT | |
| * | |
| * Written by Danny Tsen <dtsen@us.ibm.com> | |
| */ |
There was a problem hiding this comment.
My comment https://github.com/pq-code-package/mlkem-native/pull/1648/changes#r3187159006 regarding the CI wasn't addressed.
Could you please extend the cross ppc64le CI, so that is exercises the new backend.
Also to test the guard, we should also add a test for a platform not supporting the power 8 vector instructions.
This patch should do the job:
mkannwischer
left a comment
There was a problem hiding this comment.
When I run
nix develop .#cross-ppc64le
autogen
I get a diff in the following files:
- mlkem/src/native/ppc64le/src/intt_ppc_asm.S
- mlkem/src/native/ppc64le/src/ntt_ppc_asm.S
- mlkem/src/native/ppc64le/src/poly_tomont_ppc_asm.S
- mlkem/src/native/ppc64le/src/reduce_ppc_asm.S
Please commit the autogen'd files. Once the cross-autogen shell has been updated this should also surface in CI.
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Thanks for the review and for checking this @mkannwischer. The only item I have not updated yet is the Barrett-bounds wording in the comment, since I did not want to guess the preferred phrasing. |
Thanks for the quick turnaround and all the changes! That autogen issue sounds annoying. Was that both within nix?
Okay, got it! I'm planning to do one more pass on the documentation once everything works. |
Not initially. The earlier macOS run used the Homebrew toolchain directly, not the nix cross-ppc64le environment. Now I'm using a dockerized nix which seems the most convenient way to use it on macOS. |
Okay, then I am less surprised that it produced different asm. |
I could, but on macOS the nix setup was much slower, likely because fewer prebuilt packages available than in the dockerized Linux environment. Setting it up in docker was much faster. I may be missing a simpler way to do this on macOS though, since I am not very familiar with nix. |
Sounds plausible that |
|
I agree the cross shells are extremely slow to build (hours for me), but as @mkannwischer said, it's a one-off. |
ppc64le: address review comments from mkannwischer
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
I pushed an update merging main into this branch, so the diff should look clean now. |
Thanks! The diff looks much easier on the eyes now. At first glance the changes look great - thank you! Once CI works here, I'll update the other PR to run the complete CI and then do another review. I expect that all that's left is aligning documentation. Thanks for the patience! |
- ruff format scripts/autogen (formatting fix) - Add check-magic annotation for array size 2072 in consts.c and consts.h (7 groups of 8 base constants + 4 twiddle tables * 63 rows * 8 values) Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
|
Thanks @mkannwischer. I pushed an update that fixes the linting (linting now passes locally). |
Thanks! There are some issues in the POWER-7 tests: https://github.com/pq-code-package/mlkem-native/actions/runs/26212206823/job/77130946223?pr=1648. Could you take a look please? |
Yes just noticed it, will take a look. |
|
For the HOL-Light failures, a workaround would be to add |
…power8 (nix libc is compiled for power8 and otherwise causes illegal instructions. Avoids unused data parameter errors in the fallback code path. Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
Signed-off-by: Basil Hess <bhe@zurich.ibm.com>
The Illegal instruction issue doen't appear to come from the mlkem code itself, but from the nix toolchain that uses a glibc compiled for power8, so binaries fault when run under qemu-ppc64le -cpu power7. I also reproduced the fallback path locally. While doing that I hit an issue in the fallback-path (unused variable) fixed in commit 3442786.
Done in the latest commit. |
Thanks for investigating. That's annoying. I think this solution is acceptable for now and I'm glad that it at least revealed some problems we would have otherwise missed. I have updated #1677 so we can run the full CI and benchmarks there. |
Updated patch for ML_KEM for ppc64le supports p8 and above architectures.
Signed-off-by: Danny Tsen dtsen@us.ibm.com
The following tests were run on p10.
[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests func
INFO > Functional Test Compile (native no_opt): make func OPT=0 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native no_opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native no_opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native no_opt): make run_func_1024 -j40
INFO > Functional Test Compile (native opt): make func OPT=1 AUTO=1 -j40
INFO > Functional Test ML-KEM-512 (native opt): make run_func_512 -j40
INFO > Functional Test ML-KEM-768 (native opt): make run_func_768 -j40
INFO > Functional Test ML-KEM-1024 (native opt): make run_func_1024 -j40
All good!
[09:28] danny@ltcden12-lp1 new_ppc64le_mlkem % ./scripts/tests bench -c PERF
INFO > Benchmark Compile (native no_opt): make bench OPT=0 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native no_opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native no_opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 66982
encaps cycles = 78820
decaps cycles = 100923
keypair percentiles: 66438 66690 66791 66857 66920 66982 67043 67122 67218 67306 71905
encaps percentiles: 78322 78516 78618 78687 78752 78820 78878 78933 79012 79116 83825
decaps percentiles: 100427 100634 100733 100804 100869 100923 100985 101056 101131 101253 105852
INFO > Benchmark ML-KEM-768 (native no_opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native no_opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 111380
encaps cycles = 125891
decaps cycles = 154364
keypair percentiles: 110575 110914 111083 111192 111291 111380 111496 111617 111821 112414 116725
encaps percentiles: 125081 125403 125526 125655 125776 125891 125998 126122 126293 126771 131358
decaps percentiles: 153575 153870 154008 154131 154261 154364 154487 154630 154782 155313 159863
INFO > Benchmark ML-KEM-1024 (native no_opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native no_opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 166809
encaps cycles = 185315
decaps cycles = 220229
keypair percentiles: 165339 165995 166236 166435 166616 166809 167007 167200 167505 171058 175606
encaps percentiles: 183839 184563 184778 184951 185158 185315 185518 185744 186123 189637 192014
decaps percentiles: 218911 219430 219705 219841 220027 220229 220436 220673 221029 224484 226901
INFO > Benchmark Compile (native opt): make bench OPT=1 AUTO=1 CYCLES=PERF -j40
INFO > Benchmark ML-KEM-512 (native opt): make run_bench_512
INFO > Benchmark ML-KEM-512 (native opt): test/build/mlkem512/bin/bench_mlkem512
keypair cycles = 45750
encaps cycles = 50661
decaps cycles = 63561
keypair percentiles: 45248 45469 45546 45620 45690 45750 45806 45886 45954 46063 50703
encaps percentiles: 50192 50367 50468 50542 50600 50661 50710 50771 50858 50954 55652
decaps percentiles: 63091 63276 63381 63436 63497 63561 63623 63679 63743 63857 68437
INFO > Benchmark ML-KEM-768 (native opt): make run_bench_768
INFO > Benchmark ML-KEM-768 (native opt): test/build/mlkem768/bin/bench_mlkem768
keypair cycles = 79045
encaps cycles = 86455
decaps cycles = 103878
keypair percentiles: 78313 78578 78742 78847 78954 79045 79169 79285 79470 79978 84430
encaps percentiles: 85628 86009 86172 86272 86363 86455 86592 86711 86879 87292 92038
decaps percentiles: 103041 103399 103540 103676 103788 103878 103993 104104 104274 104736 109361
INFO > Benchmark ML-KEM-1024 (native opt): make run_bench_1024
INFO > Benchmark ML-KEM-1024 (native opt): test/build/mlkem1024/bin/bench_mlkem1024
keypair cycles = 124072
encaps cycles = 134500
decaps cycles = 157090
keypair percentiles: 122727 123259 123515 123720 123929 124072 124253 124527 125009 128466 133334
encaps percentiles: 133064 133681 133933 134129 134320 134500 134711 134933 135346 138753 141067
decaps percentiles: 155503 156261 156510 156694 156894 157090 157285 157605 158014 161592 166723
All good!