Added optimized ppc64le support functions for ML-KEM.#1184
Conversation
There was a problem hiding this comment.
Thank you, @dannytsen, this is an exciting contribution 🎉
I think as the first stage of review, the goal should be to get your changes through CI, and extend it so that the PPC64 backend is exercised (to this end: do you know if your assembly works with qemu-ppc64le, and what flags are needed?).
In a second phase, we can dive into the backend itself and hopefully convince ourselves that it is functionally correct and upholds the assumptions made by the frontend.
I left a few comments to kick things off, but additionally I can see that there are failures related to autogen and format, so a good starting point would be to resolve those. You should be able to run simpasm with a PPC cross compiler to get simplified assembly that you can check in to main source tree.
|
@hanno-becker I believe the code will work on qemu-ppc64le even though I did not run on it. My testing platform are p9 and p10 systems. I will go thru the comments and fix issues. Thanks. |
|
@dannytsen Please see https://github.com/pq-code-package/mlkem-native/commits/ppc64le_backend for the changes to get the asm through the usual format/autogen/simpasm pipeline. Feel free to amend your commit(s). At least the base CI is happy with this: https://github.com/pq-code-package/mlkem-native/actions/runs/17640154327 NOTE: The resulting ASM in mlkem/* is currently unusable because the references to the .data section have been messed up during simpasm. As mentioned above, please see if you can follow the approach from the AArch64 backend: Define the NTT and invNTT twiddle tables in *.c and pass them to the ASM routines as arguments. The other constants can be generated in the code itself, as in https://github.com/pq-code-package/mlkem-native/blob/main/mlkem/src/native/aarch64/src/ntt.S#L79 for example. If it's inconvenient to do this, you can also go with a single large constant table including all constants you need, pass the pointer to that to each ASM function, and load from a suitable offset in the ASM. This is the approach used in the x86_64 backend, see dev/x86_64/src/consts.c. |
|
@hanno-becker Thanks for the pointer. But I am not a python programmer and don't really can comprehend python so changes scripts will not be my first choice. I just want to get the simpasm work on my code. I can change my code to use data array from a C file. But I need an example (a command line example) to generate a simplified assembly. So, where do you run simpasm from? from scripts directory or dev directory? And what are the options I need to pass thru? Like simpasm -???? Just a example for x86 or arm will be fine. I just want to know how to run it so I can fix my assembly code accordingly. Thanks. I have a t.S file with .data section stripped. And here is the output for your reference. So, you know what I am talking about. [07:06] danny@ltc-zz4-lp9 dev % ../scripts/simpasm -i ${PWD}/ppc64le/src/t.S Traceback (most recent call last): The above exception was the direct cause of the following exception: Traceback (most recent call last): |
|
@dannytsen I have basically done this for you in the branch (atop of your changes), so you won't need to fiddle with Python anymore. But you will need to change the ASM to pass in the constants as arguments, rather than having If you checkout the branch, enter the |
@hanno-becker Sure. I can do that. |
@hanno-becker BTW, there is no nix for ppc. |
|
@dannytsen You should work in an x86_64 or AArch64 Linux/Mac environment and use the PPC64Le cross compiler, which is already part of the environment established by |
@hanno-becker Ok. I'll check that. Thanks. |
|
@dannytsen What indicators/assurances have you obtained so far that the assembly is correct? Also, have you successfully run the code on QEMU, or real HW only? |
@hanno-becker The code was run successfully in liboqs and mlkem-native project on HW. The code was originally written for liboqs. |
|
@dannytsen Independent of the work of separating the twiddles from the assembly: I ran the code in a ppc64le emulator, but it fails as soon as I start to use the NTT or invNTT. Specifically, in a Linux/Mac environment, and using your current This gives: If I comment out It could just be some CPU configuration missing. Are you assuming a particular vector length, for example? I can also see the code failing when running under Bottomline: I'll help with the integration details, but you'd need to find out / demonstrate that/how the code works in an emulated QEMU environment so we can test it in CI -- can you do that please? |
@hanno-becker Which means that it doesn't work with p8 or some instructions was not supported in qmenu. |
|
@hanno-becker Here is my output from p9. [00:01] danny@ltc-zz4-lp9 mlkem-native_dev % make test |
|
@dannytsen Thank you. As mentioned, can you please find out how to test the code using qemu? The
Can you find out which one it is? The PR documentation states that the ASM works P8 upwards. |
@hanno-becker It looks like qmenu cross compiler soen't support "xxpermdi" instruction. I'll check.
|
|
@dannytsen I don't know. You should be able to find out assembling a minimal example and using |
I'll check. |
|
@hanno-becker I don't have qemu on my system. But I installed nix on my Mac and run the following command under nix, These are the final build output. FUNC ML-KEM-1024: test/build/mlkem1024/bin/test_mlkem1024 I'll check about qemu. |
Please see #1184 (comment) again -- once in the |
|
How are you getting on @dannytsen? |
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
@hanno-becker Updated. |
|
@dannytsen Apologies if I was not clear, but the ask was not to have prefix names |
@hanno-becker Updated.
@hanno-becker I'll take a look when I have some cycle. Thanks. |
|
@dannytsen @bhess Do you have an update? I'd like to push this over the line and iterate, but we need at least the changes requested in #1184 (review). |
Thanks! It would be great to get this wrapped up soon. @dannytsen, do you have a sense of when Hanno's comment might be addressed? Happy to help if needed, just let me know. |
|
@hanno-becker @bhess This may happen in the early next year or sooner when I get some cycles. Thanks. |
|
@dannytsen @bhess I think the aliases and comments are the main issue for now, which should not be hard to resolve. So seeing @dannytsen is busy, @bhess if you have cycles to help advance this in the meantime, that would be great. |
1. Added detailed comments on NTT and INTT implementations. 2. Used C type symbols to improve readability. Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
|
Thanks a lot @dannytsen for your work, this is looking good! Could you extend your use of aliases to also cover parts like .macro Load_4Rjp
lxvd2x 32+vdata_b1, 3, 10 /* V8: vector r'0 */
lxvd2x 32+vdata_b2, 3, 17 /* V12: vector for r'1 */
lxvd2x 32+vdata_b3, 3, 19 /* V16: vector for r'2 */
lxvd2x 32+vdata_b4, 3, 21 /* V20: vector for r'3 */
lxvd2x 32+vdata_a1, 3, 9 /* V21: vector r0 */
lxvd2x 32+vdata_a2, 3, 16 /* V22: vector r1 */
lxvd2x 32+vdata_a3, 3, 18 /* V23: vector r2 */
lxvd2x 32+vdata_a4, 3, 20 /* V24: vector r3 */
.endmWe should have as little hardcoded registers as possible. |
Replaced more resgiters number with C type variable names. Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
Signed-off-by: Danny Tsen <dtsen@us.ibm.com>
|
I see you’ve updated the PR, @dannytsen, is there any work still outstanding on your end? Also, @hanno-becker and @mkannwischer, do you have any feedback on the latest changes? I'd love to get this back on track and merged. |
|
The last updates based on the last request has been in this PR for more than 3 months without any more feedback. Will re-evaluate. Close. |
|
@dannytsen The relation between the C and the ASM is not at all obvious -- the multiplication instructions are doubling high multiplies with rounding, whereas the C code is a plain high multiplication. The correction from two different rounding operations are used to emulate the correction needed from what's actually happening, namely a 32-bit addition where the low 16-bits can overflow into the upper 32-bit. If the compiler really turned this into the rounding Montgomery by itself, without this even being documented anywhere in the literature, this would be pretty amazing. Which compiler is that? |
|
@hanno-becker the instruction is documented in the ISA 2.07 and above. |
|
Yes, the instruction is documented, but it is entirely unclear that it would apply to the C code that's being assembled. The C code is a widening 16x16->32 bit multiply-and-add, while what's being used on the ASM here are two rounding+doubling high multiplications, with the rounding taking care of the carry-in from low to high bits in the 32-bit add. cc @bhess This is stunning. Can you explain this? |
|
I'm not certain about the origin of the code, but it appears correct to me. Per Power ISA, we have
The four steps are:
Here In B we can write We have Using the two steps above we can write
Finally, the saturation in |
|
@bhess I agree the code is correct -- I am just stunned that the compiler does this reasoning and figures out that two rounding-multiply operations can be used to emulate a plain multiply-add. As I said, we did document the approach in the Neon NTT paper, but erroneously said one operand would need to be odd. So what I asked "Can you explain this?" I was wondering if you have any idea how/why the PPC compiler is able to do this. |
Yeah, this would be quite surprising if a compiler were able to do this. I asked around and indeed, the colleagues derived the method independently for the POWER implementation, without knowing your paper, so no compiler magic involved. |
Brilliant, thank you for finding that out. Well done to the team, then 👍 |

Added optimized ppc64le support functions for ML-KEM.
The supported native functions include:
And other interface functions and headers.
Signed-off-by: Danny Tsen dtsen@us.ibm.com" .