Skip to content

ML-DSA: import and enable aarch64 assembly backend from mldsa-native#3219

Draft
jakemas wants to merge 4 commits into
aws:mainfrom
jakemas:mldsa-native-aarch64-backend
Draft

ML-DSA: import and enable aarch64 assembly backend from mldsa-native#3219
jakemas wants to merge 4 commits into
aws:mainfrom
jakemas:mldsa-native-aarch64-backend

Conversation

@jakemas
Copy link
Copy Markdown
Contributor

@jakemas jakemas commented May 5, 2026

Summary

  • Imports the AArch64 native arithmetic backend from mldsa-native into ML-DSA, providing Neon-accelerated assembly for ten formally-verified polynomial operations.
  • Only pure assembly (.S) files with completed HOL-Light functional correctness proofs are imported; NTT/INTT, rej_uniform* and polyz_unpack* are intentionally excluded because they do not (yet) have AArch64 proofs in upstream mldsa-native. The C reference implementation is used on those paths.
  • Follows the same integration pattern as ML-DSA: import and enable x86_64 assembly backend from mldsa-native #3195 (x86_64 backend) and ML-KEM's AArch64 backend, using s2n-bignum macros for symbol visibility.

Stacked on top of #3195. Also requires aws/aws-lc-rs#1113 so the aws-lc-rs CC builder can discover the new aarch64 .S files (mirrors #1110 for the x86_64 backend).

Stacked on top of #3195. The first two commits here are from #3195 (x86_64 backend). The last two commits are the new work for AArch64 (and a refresh of the x86_64 import against the same upstream revision). Reviewers should focus on the two commits starting at db008b7.

Benchmark

Measured on AWS r8g.4xlarge (Neoverse-V2), bssl speed -timeout 3, single run (ops/sec, higher is better):

┌─────────────────────┬──────────┬──────────┬─────────────┐
│ Operation           │  Before  │  After   │ Improvement │
├─────────────────────┼──────────┼──────────┼─────────────┤
│ MLDSA44 keygen      │  28,309  │  29,184  │   +3.1%     │
│ MLDSA44 signing     │   6,863  │   7,855  │  +14.4%     │
│ MLDSA44 verify      │  27,752  │  30,601  │  +10.3%     │
│ MLDSA65 keygen      │  14,597  │  14,980  │   +2.6%     │
│ MLDSA65 signing     │   4,378  │   5,047  │  +15.3%     │
│ MLDSA65 verify      │  17,528  │  19,103  │   +9.0%     │
│ MLDSA87 keygen      │  10,939  │  11,280  │   +3.1%     │
│ MLDSA87 signing     │   3,631  │   4,134  │  +13.9%     │
│ MLDSA87 verify      │  11,102  │  12,021  │   +8.3%     │
└─────────────────────┴──────────┴──────────┴─────────────┘

Speedups are smaller than the x86_64 numbers in #3195 because NTT/INTT — which dominate keygen on the C side — are not replaced on AArch64 (no proofs yet upstream). Signing/verify still see meaningful wins from accelerated pointwise multiplication, decompose, use_hint, caddq and chknorm.

Changes

  • New files: 10 AArch64 assembly files (mldsa_poly_caddq_asm.S, mldsa_poly_chknorm_asm.S, mldsa_poly_decompose_{32,88}_asm.S, mldsa_poly_use_hint_{32,88}_asm.S, mldsa_pointwise_montgomery.S, mldsa_polyvecl_pointwise_acc_montgomery_l{4,5,7}.S), plus meta.h and arith_native_aarch64.h headers.
  • Modified: mldsa_native_backend.h — dispatches to aarch64/meta.h on OPENSSL_AARCH64.
  • Modified: CMakeLists.txt — adds AArch64 assembly sources to BCM build (mirrors the x86_64 block).
  • Modified: importer.sh — extended to import the AArch64 backend, restricted to the HOL-Light-proved routines, and refactored the s2n-bignum macro fixups into a shared helper used by both backends. Also excludes poly_caddq_avx2.S on x86_64 which upstream recently switched from a C intrinsic to pure assembly but without a proof.

Functions accelerated

All imported AArch64 functions have completed HOL-Light formal verification proofs:

  • poly_caddq — mldsa_poly_caddq.ml
  • poly_chknorm — mldsa_poly_chknorm.ml
  • poly_decompose (l=5,7 and l=4) — poly_decompose_{32,88}_aarch64_asm.ml
  • poly_use_hint (l=5,7 and l=4) — poly_use_hint_{32,88}_aarch64_asm.ml
  • Pointwise Montgomery multiplication — mldsa_pointwise.ml
  • Polyvec pointwise accumulate for L=4/5/7 — mldsa_pointwise_acc_l{4,5,7}.ml

See the mldsa-native HOL Light README for the authoritative list.

Call-outs

Testing

  • All 76 ML-DSA and PQDSA tests pass on r8g.4xlarge (KAT, Wycheproof, expanded key validation, context-string round-trips).

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license and the ISC license.

jakemas added 2 commits May 1, 2026 07:04
Add the build infrastructure and importer script changes needed to
enable the x86_64 native arithmetic backend from mldsa-native:

- CMakeLists.txt: add ML-DSA x86_64 assembly sources to BCM build
- mldsa_native_config.h: enable native backend with MLD_CONFIG_USE_NATIVE_BACKEND_ARITH
- mldsa_native_backend.h: platform dispatcher for x86_64
- importer.sh: extend to import x86_64 backend, process assembly with
  s2n-bignum macros, strip C-intrinsic operations, and rename files
  with mldsa_ prefix to avoid basename collisions with ML-KEM
Clean output of running the importer script:
  GITHUB_SHA=b61e84f0c73d4ed612ffcaea4282a9d682de3f46 ./importer.sh --force

This imports formally verified AVX2 assembly for:
- NTT (forward and inverse)
- NTT unpack (custom coefficient order)
- Pointwise Montgomery multiplication
- Polyvec pointwise accumulate for L=4/5/7
@jakemas jakemas requested a review from a team as a code owner May 5, 2026 21:02
@jakemas jakemas marked this pull request as draft May 5, 2026 21:03
jakemas added 2 commits May 5, 2026 21:11
Add the build infrastructure and importer-script changes needed to
enable the AArch64 native arithmetic backend from mldsa-native:

- importer.sh: copy the AArch64 `native/aarch64/` tree; keep only the
  assembly files that have completed HOL-Light functional correctness
  proofs (poly_caddq, poly_chknorm, poly_decompose_{32,88},
  poly_use_hint_{32,88}, pointwise_montgomery and
  polyvecl_pointwise_acc_montgomery_l{4,5,7}). Exclude NTT/INTT,
  rej_uniform* and polyz_unpack* on AArch64 because they are not yet
  formally verified. Strip their declarations and inline wrappers from
  meta.h / arith_native_aarch64.h. Refactor the x86_64 assembly
  post-processing into a shared fixup_asm_backend() helper that also
  handles the AArch64 header (_internal_s2n_bignum_arm.h) and the
  MLD_ASM_FN_SIZE directive used on that side. Also exclude
  poly_caddq_avx2.S on x86_64, which upstream recently converted from
  a C intrinsic into pure assembly but without a HOL-Light proof.

- mldsa_native_backend.h: dispatch to aarch64/meta.h when
  OPENSSL_AARCH64 is defined, falling through to x86_64 otherwise.

- CMakeLists.txt: glob mldsa/native/aarch64/src/*.S into BCM_ASM_SOURCES
  for aarch64 Unix builds (mirrors the x86_64 block).

No generated sources change in this commit; running
`./importer.sh --force` against mldsa-native produces the ML-DSA tree
imported in the follow-up commit.
Clean output of running the updated importer script:
  GITHUB_SHA=45ba4b3e87aba0e6681f256a3e5f90e01b0e3af1 ./importer.sh --force

This imports the formally verified AArch64 assembly for:
- poly_caddq
- poly_chknorm
- poly_decompose (l=5,7 and l=4)
- poly_use_hint  (l=5,7 and l=4)
- pointwise multiplication (Montgomery)
- polyvecl_pointwise_acc (Montgomery) for L=4, 5, 7

Also refreshes the x86_64 backend from the same upstream revision.
AArch64 NTT/INTT, rej_uniform* and polyz_unpack* are intentionally not
imported because they do not yet have HOL-Light proofs; the C reference
implementation is used on those paths.
@jakemas jakemas force-pushed the mldsa-native-aarch64-backend branch from bc1c592 to 33b45b8 Compare May 5, 2026 21:12
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

There were too many comments to post at once. Showing the first 10 out of 17. Check the log or trigger a new build to see more.

void mld_pack_sig_h_poly(uint8_t sig[MLDSA_CRYPTO_BYTES], const mld_poly *h,
unsigned int k, unsigned int n)
{
unsigned int j;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'j' is not initialized [cppcoreguidelines-init-variables]

Suggested change
unsigned int j;
unsigned int j = 0;

*/
mld_memset(sig, 0, MLDSA_POLYVECH_PACKEDBYTES);
* coming from each of the K polynomials in h. */
uint8_t *sig_h = sig + MLDSA_CTILDEBYTES + MLDSA_L * MLDSA_POLYZ_PACKEDBYTES;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'sig_h' is not initialized [cppcoreguidelines-init-variables]

Suggested change
uint8_t *sig_h = sig + MLDSA_CTILDEBYTES + MLDSA_L * MLDSA_POLYZ_PACKEDBYTES;
uint8_t *sig_h = NULL = sig + MLDSA_CTILDEBYTES + MLDSA_L * MLDSA_POLYZ_PACKEDBYTES;

{
unsigned int i, j;
unsigned int old_hint_count;
const uint8_t *packed_hints =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'packed_hints' is not initialized [cppcoreguidelines-init-variables]

Suggested change
const uint8_t *packed_hints =
const uint8_t *packed_hints = NULL =

unsigned int old_hint_count;
const uint8_t *packed_hints =
sig + MLDSA_CTILDEBYTES + MLDSA_L * MLDSA_POLYZ_PACKEDBYTES;
const unsigned int old_hint_count =
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'old_hint_count' is not initialized [cppcoreguidelines-init-variables]

Suggested change
const unsigned int old_hint_count =
const unsigned int old_hint_count = 0 =

sig + MLDSA_CTILDEBYTES + MLDSA_L * MLDSA_POLYZ_PACKEDBYTES;
const unsigned int old_hint_count =
(i == 0) ? 0 : packed_hints[MLDSA_OMEGA + i - 1];
const unsigned int new_hint_count = packed_hints[MLDSA_OMEGA + i];
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'new_hint_count' is not initialized [cppcoreguidelines-init-variables]

Suggested change
const unsigned int new_hint_count = packed_hints[MLDSA_OMEGA + i];
const unsigned int new_hint_count = 0 = packed_hints[MLDSA_OMEGA + i];

const unsigned int old_hint_count =
(i == 0) ? 0 : packed_hints[MLDSA_OMEGA + i - 1];
const unsigned int new_hint_count = packed_hints[MLDSA_OMEGA + i];
unsigned int j;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'j' is not initialized [cppcoreguidelines-init-variables]

Suggested change
unsigned int j;
unsigned int j = 0;

void mld_polyvec_matrix_expand_eager(mld_polymat_eager *mat,
const uint8_t rho[MLDSA_SEEDBYTES])
{
unsigned int i, j;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'i' is not initialized [cppcoreguidelines-init-variables]

Suggested change
unsigned int i, j;
unsigned int i = 0, j;

void mld_polyvec_matrix_expand_eager(mld_polymat_eager *mat,
const uint8_t rho[MLDSA_SEEDBYTES])
{
unsigned int i, j;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'j' is not initialized [cppcoreguidelines-init-variables]

Suggested change
unsigned int i, j;
unsigned int i, j = 0;

decreases(MLDSA_K * MLDSA_L - i)
)
{
uint8_t x = (uint8_t)(i / MLDSA_L);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'x' is not initialized [cppcoreguidelines-init-variables]

Suggested change
uint8_t x = (uint8_t)(i / MLDSA_L);
uint8_t x = 0 = (uint8_t)(i / MLDSA_L);

)
{
uint8_t x = (uint8_t)(i / MLDSA_L);
uint8_t y = (uint8_t)(i % MLDSA_L);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

warning: variable 'y' is not initialized [cppcoreguidelines-init-variables]

Suggested change
uint8_t y = (uint8_t)(i % MLDSA_L);
uint8_t y = 0 = (uint8_t)(i % MLDSA_L);

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.94118% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.12%. Comparing base (c0fe8a9) to head (33b45b8).
⚠️ Report is 13 commits behind head on main.

Files with missing lines Patch % Lines
crypto/fipsmodule/ml_dsa/mldsa/sign.c 89.18% 12 Missing ⚠️
...rypto/fipsmodule/ml_dsa/mldsa/native/x86_64/meta.h 81.81% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3219      +/-   ##
==========================================
+ Coverage   78.06%   78.12%   +0.06%     
==========================================
  Files         689      692       +3     
  Lines      122732   123031     +299     
  Branches    17083    17114      +31     
==========================================
+ Hits        95816    96124     +308     
+ Misses      26014    26003      -11     
- Partials      902      904       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants