test: Add MPI library test script by dgchinner · Pull Request #135 · linux-system-roles/hpc

dgchinner · 2026-05-21T05:16:55Z

This series:

adds a new MPI library test script installed into /opt/hpc/azure/tests
adds the source tree for the OSU (Ohio State University) MPI test infrastructure under /opt/hpc/azure/tests/osu-micro-benchmarks
iterates over all the installed MPI environments, building the tests and running them on a single host
tests both CPU based MPI as well as CUDA/GPU based MPI
runs installation, point-to-point and collective tests
When multiple GPUs are present in the system, it also builds and runs GPU based point-to-point and collective tests
currently only runs on a single host

This series is built on top of the mpi-updates branch posted in PR #134 and so will require that PR to be merged first.

$ ./test-mpi-omb.sh -g
[2026-05-20 23:35:25] ==========================================================
[2026-05-20 23:35:25] MPI Implementation Test Suite (OSU Micro-Benchmarks)
[2026-05-20 23:35:25] ==========================================================

[2026-05-20 23:35:25] Found 5 MPI module(s): mpi/hpcx-2.24.1 mpi/hpcx-2.24.1-pmix-4.2.9 mpi/openmpi-5.0.8-cuda12-gpu mpi/openmpi-x86_64 mpi/mvapich-4.0

[2026-05-20 23:35:25] ========================================
[2026-05-20 23:35:25] Testing MPI module: mpi/hpcx-2.24.1
[2026-05-20 23:35:25] ========================================

[PASS] mpi/hpcx-2.24.1: mpicc available
[PASS] mpi/hpcx-2.24.1: mpiexec available
[2026-05-20 23:35:26] Building OMB for mpi/hpcx-2.24.1 in /opt/hpc/azure/tests/osu-micro-benchmarks (CUDA/NCCL enabled)
[PASS] mpi/hpcx-2.24.1: OMB build succeeded
[2026-05-20 23:35:59] Running startup tests...
[PASS] mpi/hpcx-2.24.1: osu_hello
[PASS] mpi/hpcx-2.24.1: osu_init
[2026-05-20 23:36:00] Running omb_pt2pt with 2 processes...
[PASS] mpi/hpcx-2.24.1: omb_pt2pt
[2026-05-20 23:36:04] Running omb_coll with 8 processes...
[PASS] mpi/hpcx-2.24.1: omb_coll
[2026-05-20 23:36:35] Skipping NCCL tests: requires at least 2 GPUs (1 found)

[2026-05-20 23:36:36] ========================================
[2026-05-20 23:36:36] Testing MPI module: mpi/hpcx-2.24.1-pmix-4.2.9
[2026-05-20 23:36:36] ========================================
.....

[2026-05-20 23:42:07] ==========================================================
[2026-05-20 23:42:07] Test Summary
[2026-05-20 23:42:07] ==========================================================
[2026-05-20 23:42:07]   Passed:  35
[2026-05-20 23:42:07] ==========================================================

The tests will not run if GPU testing is requested and there are no GPUs in the system:

$ ./test-mpi-omb.sh -g
ERROR: -g requires NVidia GPUs but none were detected
$

If the MPI module doesn't load (e.g. because there are no GPUs in the system) then it will gracefully handle the load failure:

$ ./test-mpi-omb.sh
....
[2026-05-21 01:16:31] ========================================
[2026-05-21 01:16:31] Testing MPI module: mpi/mvapich-4.0
[2026-05-21 01:16:31] ========================================

Error: MVAPICH 4.0 was built with CUDA/UCX support and requires NVidia GPUs. This machine has no GPUs. Use a different MPI module (e.g. hpcx or openmpi). 
[2026-05-21 01:16:32] Skipping mpi/mvapich-4.0: module failed to load


[2026-05-21 01:16:32] ==========================================================
[2026-05-21 01:16:32] Test Summary
[2026-05-21 01:16:32] ==========================================================
[2026-05-21 01:16:32]   Passed:  28
[2026-05-21 01:16:32] ==========================================================
$

Issue Tracker Tickets (Jira or BZ if any): https://redhat.atlassian.net/browse/RHELHPC-118

Summary by CodeRabbit

New Features
- Added support for MVAPICH MPI implementation installation.
- Introduced mpifileutils and MPI test suite installation options.
- Added GPU-aware MPI module defaults with automatic fallback configuration when GPUs are unavailable.
- Introduced OSU Micro-Benchmarks test harness for validating MPI across installed modules.
Configuration Updates
- Restructured GPU support configuration for more granular control over MPI components.

Move all MPI-related tasks out of tasks/main.yml into a dedicated tasks/mpi.yml file for easier navigation and maintenance. This includes the precondition checks, OpenMPI/HPC-X/PMIx/GDRCopy build and install, etc. This is done in preparation for adding more MPI functionality. The main.yml file now includes mpi.yml via include_tasks at the point where the MPI blocks previously appeared (after RDMA packages, before Docker). Signed-off-by: Dave Chinner <dchinner@redhat.com>

mpifileutils provides MPI-based file utilities for parallel file operations including tools like dcp, drm, dsync, dfind, dwalk, dcmp, and dtar. The package is built from source using cmake with HPC-X MPI, matching the upstream azhpc-images build process. The build uses the same temporary directory pattern as the OpenMPI build: download and extract to a tempdir, build in a separate tempdir, install to the __hpc_azure_resource_dir/mpifileutils directory, then clean up both temp directories. A parameter check is added to ensure HPC-X MPI is available before attempting to build mpifileutils, since HPC-X provides the MPI compilers required for the cmake build. The package is only installed in Azure test environments (tests_azure.yml). All other test playbooks explicitly disable it to avoid requiring HPC-X MPI. Changes: - Add __hpc_mpifileutils_info to vars/RedHat_9.yml (version 0.12) - Add __hpc_mpifileutils_build_dependencies and __hpc_mpifileutils_install_dir to vars/main.yml - Add hpc_install_mpifileutils default (true) to defaults/main.yml - Add parameter validation check requiring hpc_build_openmpi_w_nvidia_gpu_support - Add download, build, and install tasks using tempdir pattern - Add mpifileutils build deps to the build dependency cleanup task - Disable mpifileutils in tests_default, tests_skip_toolkit, and tests_include_vars_from_parent Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: add mpifileutils package to the HPC system role. You will find the version to install in the versions.json file in the azhpc-images repository, and the way it needs to be built in components/install_mpifileutils.sh. You will install it to the __hpc_azure_resource_dir directory and use the same temporary build area construct as used for building the openmpi code. Refinements: - Disable mpifileutils in all non-Azure test playbooks so only tests_azure.yml installs it Signed-off-by: Dave Chinner <dchinner@redhat.com>

MVAPICH is a high-performance MPI implementation optimised for InfiniBand and other high-speed networks. Version 4.0 is built from source using the same temporary directory pattern as the OpenMPI build. The build uses ./configure with --enable-g=none --enable-fast=yes flags matching the upstream azhpc-images build process, and installs to /opt/mvapich-<version>. When hpc_build_mpi_w_nvidia_gpu_support is enabled, the build additionally passes --with-ucx and --with-cuda to configure so that MVAPICH is built with GPU-aware MPI support using the same UCX and CUDA paths as OpenMPI. An Lmod environment module is provided in lua format, consistent with the existing openmpi and hpcx modulefiles, allowing users to load MVAPICH via 'module load mpi/mvapich-4.0'. The module conflicts with other MPI modules so only one can be loaded at a time. When GPU support is enabled, the module also adds the UCX and CUDA library paths to LD_LIBRARY_PATH and PATH, matching the openmpi-cuda module. Changes: - Add __hpc_mvapich_info to vars/RedHat_9.yml (version 4.0) - Add __hpc_mvapich_install_dir to vars/main.yml - Add hpc_install_mvapich default (true) to defaults/main.yml - Add download, build, install, and modulefile tasks to tasks/mpi.yml - Add mvapich-ver.lua.j2 Lmod modulefile template - Disable hpc_install_mvapich in tests_default, tests_skip_toolkit, and tests_include_vars_from_parent - Rename hpc_build_openmpi_w_nvidia_gpu_support to hpc_build_mpi_w_nvidia_gpu_support as the flag now guards GPU support for both OpenMPI and MVAPICH builds - Conditionally pass --with-ucx and --with-cuda to MVAPICH configure when GPU support is enabled Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: add MVAPICH MPI library to the HPC system role. Use version 4.0 as per the reference versions.json, and the build instructions can be derived from components/install_mpis.sh. Ignore the other MPI libraries in that reference file. Add the lmod environment modules using the lua script format to needed to use the MVAPICH libraries similar to those installed by the system role for the openmpi library. Refinements: - configure with --with-device=ch4:ucx to use libucx as the network transport instead of the built in libfabrics code. - Add --with-ucx and --with-cuda configure flags guarded by hpc_build_mpi_w_nvidia_gpu_support for GPU-aware MPI support. - Rename hpc_build_openmpi_w_nvidia_gpu_support to hpc_build_mpi_w_nvidia_gpu_support since it now applies to multiple MPI library builds. - Add UCX and CUDA library/bin paths to the MVAPICH Lmod module when GPU support is enabled, matching the openmpi-cuda module. Signed-off-by: Dave Chinner <dchinner@redhat.com>

Separate the lmod environment module file installation from the MPI library build tasks into standalone task blocks. This allows modulefile changes to be deployed by re-running the playbook without triggering a rebuild of the MPI libraries, which significantly speeds up the iterative development and testing of lmod configuration changes. The OpenMPI-based module files (PMIx, HPC-X, HPC-X+PMIx, OpenMPI, and the no-GPU defaults helper) are grouped under a single block gated by hpc_build_mpi_w_nvidia_gpu_support. The MVAPICH module file has its own block gated by hpc_install_mvapich. Both blocks ensure the target directories exist before installing files. The template and copy modules are idempotent so these tasks are safe to run on every playbook invocation. Changes: - Remove PMIx modulefile install from the PMIx build block - Remove MPI module directory creation and HPC-X/OpenMPI/no-GPU helper installs from the GPU MPI build block - Remove MVAPICH module directory creation and modulefile install from the MVAPICH build block - Add new "Install OpenMPI-based lmod environment module files" block - Add new "Install MVAPICH lmod environment module file" block Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: having to rebuild the mpi libraries to install and test changes to the lmod configuration takes a long time. extract the lmod configuration file installation from each of the MPI library installs, and implement a single task that installs all of the individual lmod files. trigger the installation of the files if any of the MPI libraries is rebuilt, or if the /usr/share/modulefiles/mpi is missing. install the individual files according to the installation parameters for each of the MPI libraries that already exist. Signed-off-by: Dave Chinner <dchinner@redhat.com>

MPI libraries built with CUDA/GPU acceleration use UCX-based transports that cause warnings or failures on machines without GPUs. This adds runtime GPU detection to the lmod environment modules so that when no NVidia GPUs are present, the GPU transports are automatically disabled. For OpenMPI-derived libraries (OpenMPI, HPC-X), a shared Jinja include fragment (openmpi-no-gpu-defaults.lua.j2) checks for /dev/nvidia0 and sets OMPI_MCA environment variables to exclude ucx, smcuda, ucc, cuda, and hcoll transports. The fragment is inlined into each module file at template rendering time via {% include %}. For MVAPICH (when built with GPU support), the module refuses to load on machines without GPUs. MVAPICH hard-codes HPC-X UCX library paths into libmpi.so at build time so it cannot fall back to system UCX. The module issues an LmodError directing users to an alternative MPI module instead. Changes: - Add templates/openmpi-no-gpu-defaults.lua.j2 shared GPU detection fragment - Add {% include %} to openmpi-ver-cuda12-gpu.lua.j2 - Add {% include %} to hpcx-ver.lua.j2 - Add {% include %} to hpcx-ver-pmix-ver.lua.j2 - Add LmodError to mvapich-ver.lua.j2 to refuse loading on non-GPU machines Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: the MPI libraries that are optimised for CUDA and GPU acceleration need different option sets to run on machines without GPUs. All the OpenMPI derived libraries require mpirun/mpiexec to have "--mca pml ^ucx --mca btl ^smcuda --mca osc ^ucx --mca coll ^ucc,cuda,hcoll" to turn off all the underlying UCX-based GPU accelerations. MVAPICH will require a different set of parameters as it passes environment and config variables in a different manner. These need to be set up in the lmod environment modules for each MPI library. If the system does not have any GPUs in it, they should set up the default mpirun/exec environment to use these "avoid using cuda/GPU transports" mechanisms automatically. Refinements: - Use Jinja {% include %} to inline GPU detection at deploy time - MVAPICH refuses to load on non-GPU machines via LmodError because it hard-codes HPC-X UCX paths into libmpi.so at build time Signed-off-by: Dave Chinner <dchinner@redhat.com>

Add the OSU Micro-Benchmarks (OMB) package to the system role as an MPI implementation validation test suite. The role downloads and extracts the OMB source into the azure tests directory, and installs a test script that discovers all installed MPI modules via Lmod, builds OMB against each one, and runs a set of single-host MPI tests covering startup, point-to-point, and collective operations. If a module fails to load (e.g. mvapich on a non-GPU machine), the test script skips that module and continues testing the remaining modules rather than failing the entire test suite. The test script is designed to fail fast on the first error, leaving the build artifacts in /tmp/omb-builds/ for debugging. Startup tests run with 1 process, point-to-point tests with 2, and collective tests with nproc. Changes: - Add __hpc_omb_info to vars/RedHat_9.yml with OMB 8.0b2 URL and checksum - Add __hpc_azure_omb_dir to vars/main.yml for the OMB source location - Add hpc_install_mpi_tests default variable (true) - Add tasks to download, extract OMB and install the test script - Add test-mpi-omb.sh.j2 template for MPI validation - Disable mpi_tests in CI test configurations Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: start building a MPI implementation test suite. We will start with the OSU microbenchmark package, downloading it from https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-8.0b2.tar.gz, calculating the sha256sum and then adding it to the system role. The system role will unpack it into the azure tests directory, and from there we will write a test script that iterates all the installed mpi libraries (via module loading) to build and run a set of tests from the OMB suite. Initially the test script will focus on running the tests on a single host, running tests on a cluster via a scheduler is a future modification. The test script will also begin by focussing on the MPI tests in the suite, more expansive functional testing is a future modification. Refinements: - Use ml -t spider mpi/ for module discovery instead of filesystem scanning - Remove Lmod init - rely on user shell environment already having modules loaded - fail() exits immediately to leave a debuggable corpse - Set np per test category: 1 for startup, 2 for pt2pt, nproc for collective - Do not use --allow-run-as-root as tests should run as a regular user - Skip modules that fail to load instead of aborting the test suite Signed-off-by: Dave Chinner <dchinner@redhat.com>

Add a -g CLI flag to test-mpi-omb.sh that builds the OSU Micro-Benchmarks with CUDA GPU support enabled. The CUDA configure flags are only available when hpc_build_mpi_w_nvidia_gpu_support was set during deployment; if -g is passed but GPU support was not built, the script exits with an error. Changes: - Add ENABLE_CUDA variable and -g option to getopts parsing - Conditionally pass --enable-cuda and --with-cuda to OMB configure - Use Jinja2 conditional to gate CUDA paths on hpc_build_mpi_w_nvidia_gpu_support - Error out if -g is used but MPI was not built with GPU support Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: add a CLI parameter to the MPI test script that builds the test code with CUDA and GPU functionality enabled. Signed-off-by: Dave Chinner <dchinner@redhat.com>

When the -g flag is passed, extend the test suite to exercise NCCL functionality via the OMB xccl benchmarks. The NCCL tests run standalone pt2pt benchmarks (latency, bandwidth, bidirectional bandwidth) and collective benchmarks (allreduce, allgather, bcast, reduce, reduce_scatter, alltoall) which exercise the NCCL communication library directly. The OMB configure is extended with --enable-ncclomb to build the NCCL benchmark binaries when CUDA support is enabled. Includes a workaround for an upstream OMB 8.0b2 bug where the xccl Makefile.am files are missing omb_color.c from the UTILITIES list, causing link failures. The fix patches the Makefile.am files and runs autoreconf before configure. The autotools packages (autoconf, automake, libtool) are moved from __hpc_openmpi_build_dependencies to a new __hpc_mpi_packages list so they persist after the build phase and are available for the autoreconf workaround at test time. Changes: - Add --enable-ncclomb to CUDA configure flags - Add NCCL xccl pt2pt tests (latency, bw, bibw) - Add NCCL xccl collective tests (allreduce, allgather, bcast, reduce, reduce_scatter, alltoall) - Workaround OMB 8.0b2 xccl link failure by adding omb_color.c to UTILITIES in Makefile.am and running autoreconf before configure - Move autoconf/automake/libtool from __hpc_openmpi_build_dependencies to __hpc_mpi_packages so they are not removed after building Created-by-AI: Claude Opus 4.6 (1M context) Prompt: new modification: extend the MPI test script to cover CUDA, GPU and NCCL related functionality provided by the OMB suite. Refinements: - Workaround upstream OMB 8.0b2 bug where xccl Makefile.am files are missing omb_color.c from the UTILITIES list. - Move autotools packages to persistent __hpc_mpi_packages list. - Remove MPI launcher GPU memory tests (-d cuda D D) as the launcher does not support per-benchmark GPU memory placement options. Signed-off-by: Dave Chinner <dchinner@redhat.com>

coderabbitai · 2026-05-21T05:17:05Z

📝 Walkthrough

Walkthrough

This PR refactors the HPC role's MPI installation system from a monolithic OpenMPI workflow into a modular multi-MPI architecture supporting OpenMPI, MVAPICH, and mpifileutils with optional GPU/HPC-X builds, complemented by Lmod environment modules and a comprehensive OSU Micro-Benchmarks testing harness.

Changes

MPI Installation and Testing System Refactor

Layer / File(s)	Summary
Configuration defaults and internal variables `defaults/main.yml`, `vars/main.yml`, `vars/RedHat_9.yml`, `tests/tests_*.yml`	Public defaults now expose granular MPI options (`hpc_build_mpi_w_nvidia_gpu_support`, `hpc_install_mvapich`, `hpc_install_mpifileutils`, `hpc_install_mpi_tests`) replacing the prior OpenMPI-only GPU flag. Internal variables define build dependencies, installation paths, and package metadata for MVAPICH, mpifileutils, and OSU Micro-Benchmarks. Test configurations override defaults appropriately.
MPI library build and installation pipeline `tasks/main.yml`, `tasks/mpi.yml`	Removes inline OpenMPI workflow and delegates to `tasks/mpi.yml`. New pipeline validates GPU/NCCL prerequisites and mpifileutils constraints, installs base OpenMPI packages, orchestrates GPU/HPC-X builds including PMIx, GDRCopy, and HPC-X rebuild with CUDA support, conditionally builds mpifileutils against HPC-X, builds MVAPICH with GPU flags when enabled, and cleans temporary sources.
Lmod module environment configuration `templates/openmpi-no-gpu-defaults.lua.j2`, `templates/hpcx-ver.lua.j2`, `templates/hpcx-ver-pmix-ver.lua.j2`, `templates/openmpi-ver-cuda12-gpu.lua.j2`, `templates/mvapich-ver.lua.j2`	Shared `openmpi-no-gpu-defaults` template detects NVIDIA GPU and disables GPU/UCX transports on GPU-less systems. HPC-X, PMIx, and CUDA OpenMPI variants include the default settings. MVAPICH module enforces GPU presence check when built with CUDA/UCX support and sets transport paths and `MPI_*` environment variables.
OSU Micro-Benchmarks testing harness `tasks/mpi.yml`, `templates/test-mpi-omb.sh.j2`	Task downloads OMB sources and deploys templated test script. Script parses `-g` (GPU) and `-v` (verbose) flags, discovers MPI modules, and executes per-module testing: builds OMB, runs startup/point-to-point/collective benchmarks, conditionally runs NCCL benchmarks on GPU-enabled builds, and handles known OpenMPI/ORTE fork abort case. Reports overall test summary.

🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Description Format	⚠️ Warning	PR description lacks template format with explicit section headers: Feature, Reason, Result, and Issue Tracker Tickets as required by .github/pull_request_template.md.	Reformat PR description with explicit section headers: Feature/Enhancement, Reason, Result, and optional Issue Tracker Tickets per the template.
Description check	❓ Inconclusive	The PR description is comprehensive but does not follow the provided template structure with required sections (Enhancement, Reason, Result, Issue Tracker Tickets).	Reorganize the description to match the template: clearly separate Enhancement/what is added, Reason/why it's needed, and Result/what behavior is achieved. The Jira ticket is provided but format differs from template.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The PR title follows the Conventional Commits format with 'test' type and a clear description of adding an MPI library test script, which aligns with the primary changes.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tasks/mpi.yml`:
- Around line 478-485: The copy task "Copy OMB source to tests directory" uses
the Ansible copy module without an explicit mode causing risky-file-permissions
lint failures; update that task to set a safe explicit mode (e.g., mode: '0755'
or another appropriate octal) on the copy invocation so ownership and file
permissions are deterministic; modify the task that currently contains copy:
src/dest/owner/group to include mode: '0XXX' and ensure the chosen mode fits the
files' needs (executable vs data) to resolve the lint warning.
- Around line 6-17: The fail task that prevents building MPI with GPU support
should check the correct NCCL variable and use OR logic: change the when
condition on the fail task (the "Fail if role builds MPI with GPU support
without CUDA toolkit" task) to require hpc_build_mpi_w_nvidia_gpu_support and
(not hpc_install_cuda_toolkit or not hpc_install_hpc_nvidia_nccl) instead of
using hpc_install_nvidia_nccl and an AND; also update the fail message lines to
mention "hpc_install_hpc_nvidia_nccl: true" so the guidance shows the correct
variable name.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0e961e12-6582-4f7c-a0be-a14efddf25c8

📥 Commits

Reviewing files that changed from the base of the PR and between 3def6ea and a23cd13.

📒 Files selected for processing (14)

defaults/main.yml
tasks/main.yml
tasks/mpi.yml
templates/hpcx-ver-pmix-ver.lua.j2
templates/hpcx-ver.lua.j2
templates/mvapich-ver.lua.j2
templates/openmpi-no-gpu-defaults.lua.j2
templates/openmpi-ver-cuda12-gpu.lua.j2
templates/test-mpi-omb.sh.j2
tests/tests_default.yml
tests/tests_include_vars_from_parent.yml
tests/tests_skip_toolkit.yml
vars/RedHat_9.yml
vars/main.yml

coderabbitai · 2026-05-21T05:22:35Z

+- name: Fail if role builds MPI with GPU support without CUDA toolkit
+  fail:
+    msg:
+      - Building MPI libraries with Nvidia GPU support requires multiple
+      - packages to be installed.
+      - You must set the following variables true
+      - "hpc_install_cuda_toolkit: true"
+      - "hpc_install_nvidia_nccl: true"
+  when:
+    - hpc_build_mpi_w_nvidia_gpu_support
+    - not hpc_install_cuda_toolkit
+    - not hpc_install_nvidia_nccl


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify variable usage consistency in MPI tasks/defaults. rg -n "hpc_install_(hpc_)?nvidia_nccl|hpc_build_mpi_w_nvidia_gpu_support" \ defaults/main.yml tasks/main.yml tasks/mpi.yml tests/tests_*.yml

Repository: linux-system-roles/hpc

Length of output: 1235

Fix MPI GPU prerequisite guard: wrong NCCL variable and incorrect boolean logic.

The GPU-MPI precheck in tasks/mpi.yml uses hpc_install_nvidia_nccl but the role’s variable is hpc_install_hpc_nvidia_nccl (defined in defaults/main.yml and referenced in tests). As written, the guard’s when conditions are ANDed, so it only fails when both CUDA toolkit and NCCL are missing—rather than when either prerequisite is disabled.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tasks/mpi.yml` around lines 6 - 17, The fail task that prevents building MPI with GPU support should check the correct NCCL variable and use OR logic: change the when condition on the fail task (the "Fail if role builds MPI with GPU support without CUDA toolkit" task) to require hpc_build_mpi_w_nvidia_gpu_support and (not hpc_install_cuda_toolkit or not hpc_install_hpc_nvidia_nccl) instead of using hpc_install_nvidia_nccl and an AND; also update the fail message lines to mention "hpc_install_hpc_nvidia_nccl: true" so the guidance shows the correct variable name.

coderabbitai · 2026-05-21T05:22:35Z

+        - name: Copy OMB source to tests directory
+          copy:
+            src: "{{ __hpc_pkg_extracted.path }}/"
+            remote_src: true
+            dest: "{{ __hpc_azure_omb_dir }}"
+            owner: root
+            group: root
+


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Set explicit permissions when copying OMB sources.

The copy task leaves permissions implicit, which matches the lint failure (risky-file-permissions). Set mode explicitly.

💡 Proposed fix

- name: Copy OMB source to tests directory copy: src: "{{ __hpc_pkg_extracted.path }}/" remote_src: true dest: "{{ __hpc_azure_omb_dir }}" owner: root group: root + mode: "u+rwX,go+rX"

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tasks/mpi.yml` around lines 478 - 485, The copy task "Copy OMB source to tests directory" uses the Ansible copy module without an explicit mode causing risky-file-permissions lint failures; update that task to set a safe explicit mode (e.g., mode: '0755' or another appropriate octal) on the copy invocation so ownership and file permissions are deterministic; modify the task that currently contains copy: src/dest/owner/group to include mode: '0XXX' and ensure the chosen mode fits the files' needs (executable vs data) to resolve the lint warning.

richm · 2026-05-21T15:38:10Z

 hpc_azure_disable_predictable_net_names: true
 hpc_install_system_openmpi: true
-hpc_build_openmpi_w_nvidia_gpu_support: true
+hpc_build_mpi_w_nvidia_gpu_support: true


note that this changes the public API and is considered a breaking change

If this is really necessary, then

the README.md should mark the old variable as deprecated, and should say to use the new one

we should have logic to use the old variable if set

Suggested change

hpc_build_mpi_w_nvidia_gpu_support: true

hpc_build_mpi_w_nvidia_gpu_support: "{{ hpc_build_openmpi_w_nvidia_gpu_support | d(true) }}"

optional: add a task to tasks/main.yml to tell the user that hpc_build_openmpi_w_nvidia_gpu_support is deprecated if it is defined, and to use hpc_build_mpi_w_nvidia_gpu_support instead

richm · 2026-05-21T15:41:36Z

+          + __hpc_mpifileutils_build_dependencies }}
+        state: present
+        use: "{{ (__hpc_server_is_ostree | d(false)) |
+          ternary('ansible.posix.rhel_rpm_ostree', omit) }}"


Does this need a register/until? Which package installation tasks need register/until?

dgchinner added 8 commits May 20, 2026 09:24

dgchinner requested review from richm and spetrosi as code owners May 21, 2026 05:16

coderabbitai Bot reviewed May 21, 2026

View reviewed changes

richm reviewed May 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test: Add MPI library test script#135

test: Add MPI library test script#135
dgchinner wants to merge 8 commits into
linux-system-roles:mainfrom
dgchinner:mpi-updates-test

dgchinner commented May 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 21, 2026 •

edited

Loading

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

coderabbitai Bot May 21, 2026

Uh oh!

richm May 21, 2026

Uh oh!

richm May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	hpc_build_mpi_w_nvidia_gpu_support: true
	hpc_build_mpi_w_nvidia_gpu_support: "{{ hpc_build_openmpi_w_nvidia_gpu_support \| d(true) }}"

Conversation

dgchinner commented May 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

richm May 21, 2026

Choose a reason for hiding this comment

Uh oh!

richm May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

dgchinner commented May 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 21, 2026 •

edited

Loading