ML-IAP LAMMPS integration fails to parallelize #566

apoletayev · 2025-11-01T18:04:20Z

apoletayev
Nov 1, 2025

Sorry to bother again. I'm likely missing something very silly here, but the LAMMPS integration built as in (#556) fails to use multiple GPUs with the ML-IAP pair_style with OEQ. System: L40S GPU, CUDA 12.6, cuDNN 9.5, LAMMPS built with KOKKOS for this hardware. Models trained with nequip 0.15.0 and deployed as pair_nequip_allegro or for ML-IAP with OEQ. Models are TF32 but that should not matter?

call to LAMMPS (one GPU node, multiple GPUs on it):
mpirun -np $SLURM_NTASKS $LAMMPS -sf kk -k on g $SLURM_NTASKS -pk kokkos newton on neigh half -in input.in

works fine if running on one GPU so no OOM, structure is not too small (> 2x neighborlist radius), segfaults when running the same calculation over 2+ GPUs.
tested: increasing neighbor list cutoff (no change to errors), compiling the model as pair_nequip and running on one GPU (works as expected, no OOM), minimization / MD (no change to errors).

Potentially related unanswered issue from MACE: ACEsuit/mace#1171

Error message (two GPUs):

Setting up Verlet run ...
  Unit style    : metal
  Current step  : 0
  Time step     : 0.002
[htc-g061:3567714:0:3567714] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x14e9873c0a80)
[htc-g061:3567715:0:3567715] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x150de33c0a80)
==== backtrace (tid:3567714) ====
 0 0x000000000003e6f0 __GI___sigaction()  :0
 1 0x000000000019060d __memmove_avx512_unaligned_erms()  :0
 2 0x0000000000047551 ucp_dt_pack()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/dt/dt.c:118
 3 0x000000000007c5b5 ucp_tag_pack_eager_common()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:31
 4 0x000000000007c5b5 ucp_tag_pack_eager_first_dt()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:76
 5 0x000000000001bc10 uct_mm_ep_am_common_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/mm/base/mm_ep.c:326
 6 0x000000000001bc10 uct_mm_ep_am_bcopy()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/mm/base/mm_ep.c:416
 7 0x000000000007bd13 uct_ep_am_bcopy()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/api/uct.h:3020
 8 0x000000000007bd13 ucp_tag_eager_bcopy_multi()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:140
 9 0x000000000008619b ucp_request_try_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/core/ucp_request.inl:334
10 0x000000000008619b ucp_request_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/core/ucp_request.inl:357
11 0x000000000008619b ucp_tag_send_req()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/tag_send.c:116
12 0x000000000008619b ucp_tag_send_nbx()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/tag_send.c:298
13 0x000000000000690c mca_pml_ucx_send()  ???:0
14 0x000000000008719b MPI_Send()  ???:0
15 0x0000000001b4e3c7 LAMMPS_NS::CommKokkos::borders_device<Kokkos::Cuda>()  ???:0
16 0x0000000001b372d8 LAMMPS_NS::CommKokkos::borders()  ???:0
17 0x00000000024bef22 LAMMPS_NS::VerletKokkos::setup()  ???:0
18 0x00000000011f2212 LAMMPS_NS::Run::command()  ???:0
19 0x0000000000fd956a LAMMPS_NS::Input::execute_command()  ???:0
20 0x0000000000fd9c5e LAMMPS_NS::Input::file()  ???:0
21 0x000000000040245d main()  ???:0
22 0x0000000000029590 __libc_start_call_main()  ???:0
23 0x0000000000029640 __libc_start_main_alias_2()  :0
24 0x0000000000402515 _start()  ???:0
=================================
[htc-g061:3567714] *** Process received signal ***
[htc-g061:3567714] Signal: Segmentation fault (11)
[htc-g061:3567714] Signal code:  (-6)
[htc-g061:3567714] Failing at address: 0x1a9200367062
[htc-g061:3567714] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x14ea6883e6f0]
[htc-g061:3567714] [ 1] /lib64/libc.so.6(+0x19060d)[0x14ea6899060d]
[htc-g061:3567714] [ 2] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(ucp_dt_pack+0x61)[0x14e9d2d59551]
[htc-g061:3567714] [ 3] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(+0x7c5b5)[0x14e9d2d8e5b5]
[htc-g061:3567714] [ 4] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libuct.so.0(uct_mm_ep_am_bcopy+0x140)[0x14e9d80a4c10]
[htc-g061:3567714] [ 5] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(+0x7bd13)[0x14e9d2d8dd13]
[htc-g061:3567714] [ 6] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(ucp_tag_send_nbx+0x6db)[0x14e9d2d9819b]
[htc-g061:3567714] [ 7] /apps/system/easybuild/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xfc)[0x14e9d80d390c]
[htc-g061:3567714] [ 8] /apps/system/easybuild/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x14eac69e719b]
[htc-g061:3567714] [ 9] ==== backtrace (tid:3567715) ====
 0 0x000000000003e6f0 __GI___sigaction()  :0
 1 0x000000000019060d __memmove_avx512_unaligned_erms()  :0
 2 0x0000000000047551 ucp_dt_pack()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/dt/dt.c:118
 3 0x000000000007c5b5 ucp_tag_pack_eager_common()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:31
 4 0x000000000007c5b5 ucp_tag_pack_eager_first_dt()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:76
 5 0x000000000001bc10 uct_mm_ep_am_common_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/mm/base/mm_ep.c:326
 6 0x000000000001bc10 uct_mm_ep_am_bcopy()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/sm/mm/base/mm_ep.c:416
 7 0x000000000007bd13 uct_ep_am_bcopy()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/uct/api/uct.h:3020
 8 0x000000000007bd13 ucp_tag_eager_bcopy_multi()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/eager_snd.c:140
 9 0x000000000008619b ucp_request_try_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/core/ucp_request.inl:334
10 0x000000000008619b ucp_request_send()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/core/ucp_request.inl:357
11 0x000000000008619b ucp_tag_send_req()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/tag_send.c:116
12 0x000000000008619b ucp_tag_send_nbx()  /apps/system/easybuild/build/UCX/1.14.1/GCCcore-12.3.0/ucx-1.14.1/src/ucp/tag/tag_send.c:298
13 0x000000000000690c mca_pml_ucx_send()  ???:0
14 0x000000000008719b MPI_Send()  ???:0
15 0x0000000001b4e3c7 LAMMPS_NS::CommKokkos::borders_device<Kokkos::Cuda>()  ???:0
16 0x0000000001b372d8 LAMMPS_NS::CommKokkos::borders()  ???:0
17 0x00000000024bef22 LAMMPS_NS::VerletKokkos::setup()  ???:0
18 0x00000000011f2212 LAMMPS_NS::Run::command()  ???:0
19 0x0000000000fd956a LAMMPS_NS::Input::execute_command()  ???:0
20 0x0000000000fd9c5e LAMMPS_NS::Input::file()  ???:0
21 0x000000000040245d main()  ???:0
22 0x0000000000029590 __libc_start_call_main()  ???:0
23 0x0000000000029640 __libc_start_main_alias_2()  :0
24 0x0000000000402515 _start()  ???:0
=================================
[htc-g061:3567715] *** Process received signal ***
[htc-g061:3567715] Signal: Segmentation fault (11)
[htc-g061:3567715] Signal code:  (-6)
[htc-g061:3567715] Failing at address: 0x1a9200367063
[htc-g061:3567715] [ 0] /lib64/libc.so.6(+0x3e6f0)[0x150eb983e6f0]
[htc-g061:3567715] [ 1] /lib64/libc.so.6(+0x19060d)[0x150eb999060d]
[htc-g061:3567715] [ 2] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(ucp_dt_pack+0x61)[0x150e28ecf551]
[htc-g061:3567715] [ 3] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(+0x7c5b5)[0x150e28f045b5]
[htc-g061:3567715] [ 4] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libuct.so.0(uct_mm_ep_am_bcopy+0x140)[0x150e28e67c10]
[htc-g061:3567715] [ 5] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(+0x7bd13)[0x150e28f03d13]
[htc-g061:3567715] [ 6] /apps/system/easybuild/software/UCX/1.14.1-GCCcore-12.3.0/lib64/libucp.so.0(ucp_tag_send_nbx+0x6db)[0x150e28f0e19b]
[htc-g061:3567715] [ 7] /apps/system/easybuild/software/OpenMPI/4.1.5-GCC-12.3.0/lib/openmpi/mca_pml_ucx.so(mca_pml_ucx_send+0xfc)[0x150e2961090c]
[htc-g061:3567715] [ 8] /apps/system/easybuild/software/OpenMPI/4.1.5-GCC-12.3.0/lib/libmpi.so.40(PMPI_Send+0x11b)[0x150f17a9719b]
[htc-g061:3567715] [ 9] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS10CommKokkos14borders_deviceIN6Kokkos4CudaEEEvv+0x1ab7)[0x14ea6a81a3c7]
[htc-g061:3567714] [10] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS10CommKokkos14borders_deviceIN6Kokkos4CudaEEEvv+0x1ab7)[0x150ebb8ca3c7]
[htc-g061:3567715] [10] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS10CommKokkos7bordersEv+0x1c8)[0x14ea6a8032d8]
[htc-g061:3567714] [11] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS10CommKokkos7bordersEv+0x1c8)[0x150ebb8b32d8]
[htc-g061:3567715] [11] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS12VerletKokkos5setupEi+0x122)[0x14ea6b18af22]
[htc-g061:3567714] [12] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS12VerletKokkos5setupEi+0x122)[0x150ebc23af22]
[htc-g061:3567715] [12] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xb32)[0x14ea69ebe212]
[htc-g061:3567714] [13] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS3Run7commandEiPPc+0xb32)[0x150ebaf6e212]
[htc-g061:3567715] [13] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x6ca)[0x14ea69ca556a]
[htc-g061:3567714] [14] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS5Input15execute_commandEv+0x6ca)[0x150ebad5556a]
[htc-g061:3567715] [14] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x16e)[0x14ea69ca5c5e]
[htc-g061:3567714] [15] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/lmp_gpu[0x40245d]
[htc-g061:3567714] [16] /lib64/libc.so.6(+0x29590)[0x14ea68829590]
[htc-g061:3567714] [17] /lib64/libc.so.6(__libc_start_main+0x80)[0x14ea68829640]
[htc-g061:3567714] [18] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/lmp_gpu[0x402515]
[htc-g061:3567714] *** End of error message ***
/data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/liblammps_gpu.so.0(_ZN9LAMMPS_NS5Input4fileEv+0x16e)[0x150ebad55c5e]
[htc-g061:3567715] [15] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/lmp_gpu[0x40245d]
[htc-g061:3567715] [16] /lib64/libc.so.6(+0x29590)[0x150eb9829590]
[htc-g061:3567715] [17] /lib64/libc.so.6(__libc_start_main+0x80)[0x150eb9829640]
[htc-g061:3567715] [18] /data/oums-mlip-csp/oums1294/lammps_10Sep2025_KK28_A89/build/lmp_gpu[0x402515]
[htc-g061:3567715] *** End of error message ***
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 1 with PID 3567715 on node htc-g061 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

anjohan · 2025-11-09T02:36:49Z

anjohan
Nov 9, 2025
Collaborator

Hi,

We need more information to be able to debug this.

What are your LAMMPS and Slurm scripts?
Have you tried running this executable with a more "standard" pair style, e.g. LJ? If that works, at what complexity level does it break, is it with any NequIP model, or only with OpenEquivariance? Your stacktrace isn't ML-IAP/NequIP-specific.
Is LAMMPS trying to use CUDA-aware MPI (it will tell you if it isn't), and is your build compatible with this? You can try to use -pk kokkos ... gpu/aware off to turn it off.

1 reply

wcwitt Nov 10, 2025
Maintainer

Have you tried running this executable with a more "standard" pair style, e.g. LJ?

Just wanted to underscore this suggestion, particularly using the Kokkos version of LJ. There are several hard-to-debug Slurm possibilities, unrelated to nequip/allegro, and this test will help us narrow down.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ML-IAP LAMMPS integration fails to parallelize #566

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ML-IAP LAMMPS integration fails to parallelize #566

Uh oh!

Uh oh!

apoletayev Nov 1, 2025

Replies: 1 comment · 1 reply

Uh oh!

anjohan Nov 9, 2025 Collaborator

Uh oh!

wcwitt Nov 10, 2025 Maintainer

apoletayev
Nov 1, 2025

Replies: 1 comment 1 reply

anjohan
Nov 9, 2025
Collaborator

wcwitt Nov 10, 2025
Maintainer