Skip to content

test: Add MPI library test script#135

Open
dgchinner wants to merge 8 commits into
linux-system-roles:mainfrom
dgchinner:mpi-updates-test
Open

test: Add MPI library test script#135
dgchinner wants to merge 8 commits into
linux-system-roles:mainfrom
dgchinner:mpi-updates-test

Conversation

@dgchinner
Copy link
Copy Markdown
Collaborator

@dgchinner dgchinner commented May 21, 2026

This series:

  • adds a new MPI library test script installed into /opt/hpc/azure/tests
  • adds the source tree for the OSU (Ohio State University) MPI test infrastructure under /opt/hpc/azure/tests/osu-micro-benchmarks
  • iterates over all the installed MPI environments, building the tests and running them on a single host
  • tests both CPU based MPI as well as CUDA/GPU based MPI
  • runs installation, point-to-point and collective tests
  • When multiple GPUs are present in the system, it also builds and runs GPU based point-to-point and collective tests
  • currently only runs on a single host

This series is built on top of the mpi-updates branch posted in PR #134 and so will require that PR to be merged first.

$ ./test-mpi-omb.sh -g
[2026-05-20 23:35:25] ==========================================================
[2026-05-20 23:35:25] MPI Implementation Test Suite (OSU Micro-Benchmarks)
[2026-05-20 23:35:25] ==========================================================

[2026-05-20 23:35:25] Found 5 MPI module(s): mpi/hpcx-2.24.1 mpi/hpcx-2.24.1-pmix-4.2.9 mpi/openmpi-5.0.8-cuda12-gpu mpi/openmpi-x86_64 mpi/mvapich-4.0

[2026-05-20 23:35:25] ========================================
[2026-05-20 23:35:25] Testing MPI module: mpi/hpcx-2.24.1
[2026-05-20 23:35:25] ========================================

[PASS] mpi/hpcx-2.24.1: mpicc available
[PASS] mpi/hpcx-2.24.1: mpiexec available
[2026-05-20 23:35:26] Building OMB for mpi/hpcx-2.24.1 in /opt/hpc/azure/tests/osu-micro-benchmarks (CUDA/NCCL enabled)
[PASS] mpi/hpcx-2.24.1: OMB build succeeded
[2026-05-20 23:35:59] Running startup tests...
[PASS] mpi/hpcx-2.24.1: osu_hello
[PASS] mpi/hpcx-2.24.1: osu_init
[2026-05-20 23:36:00] Running omb_pt2pt with 2 processes...
[PASS] mpi/hpcx-2.24.1: omb_pt2pt
[2026-05-20 23:36:04] Running omb_coll with 8 processes...
[PASS] mpi/hpcx-2.24.1: omb_coll
[2026-05-20 23:36:35] Skipping NCCL tests: requires at least 2 GPUs (1 found)

[2026-05-20 23:36:36] ========================================
[2026-05-20 23:36:36] Testing MPI module: mpi/hpcx-2.24.1-pmix-4.2.9
[2026-05-20 23:36:36] ========================================
.....

[2026-05-20 23:42:07] ==========================================================
[2026-05-20 23:42:07] Test Summary
[2026-05-20 23:42:07] ==========================================================
[2026-05-20 23:42:07]   Passed:  35
[2026-05-20 23:42:07] ==========================================================

The tests will not run if GPU testing is requested and there are no GPUs in the system:

$ ./test-mpi-omb.sh -g
ERROR: -g requires NVidia GPUs but none were detected
$

If the MPI module doesn't load (e.g. because there are no GPUs in the system) then it will gracefully handle the load failure:

$ ./test-mpi-omb.sh
....
[2026-05-21 01:16:31] ========================================
[2026-05-21 01:16:31] Testing MPI module: mpi/mvapich-4.0
[2026-05-21 01:16:31] ========================================

Error: MVAPICH 4.0 was built with CUDA/UCX support and requires NVidia GPUs. This machine has no GPUs. Use a different MPI module (e.g. hpcx or openmpi). 
[2026-05-21 01:16:32] Skipping mpi/mvapich-4.0: module failed to load


[2026-05-21 01:16:32] ==========================================================
[2026-05-21 01:16:32] Test Summary
[2026-05-21 01:16:32] ==========================================================
[2026-05-21 01:16:32]   Passed:  28
[2026-05-21 01:16:32] ==========================================================
$

Issue Tracker Tickets (Jira or BZ if any): https://redhat.atlassian.net/browse/RHELHPC-118

Summary by CodeRabbit

  • New Features

    • Added support for MVAPICH MPI implementation installation.
    • Introduced mpifileutils and MPI test suite installation options.
    • Added GPU-aware MPI module defaults with automatic fallback configuration when GPUs are unavailable.
    • Introduced OSU Micro-Benchmarks test harness for validating MPI across installed modules.
  • Configuration Updates

    • Restructured GPU support configuration for more granular control over MPI components.

Review Change Stack

dgchinner added 8 commits May 20, 2026 09:24
Move all MPI-related tasks out of tasks/main.yml into a dedicated
tasks/mpi.yml file for easier navigation and maintenance. This includes
the precondition checks, OpenMPI/HPC-X/PMIx/GDRCopy build and
install, etc. This is done in preparation for adding more MPI
functionality.

The main.yml file now includes mpi.yml via include_tasks at the point
where the MPI blocks previously appeared (after RDMA packages, before
Docker).

Signed-off-by: Dave Chinner <dchinner@redhat.com>
mpifileutils provides MPI-based file utilities for parallel file operations
including tools like dcp, drm, dsync, dfind, dwalk, dcmp, and dtar. The
package is built from source using cmake with HPC-X MPI, matching the
upstream azhpc-images build process.

The build uses the same temporary directory pattern as the OpenMPI build:
download and extract to a tempdir, build in a separate tempdir, install to
the __hpc_azure_resource_dir/mpifileutils directory, then clean up both
temp directories.

A parameter check is added to ensure HPC-X MPI is available before
attempting to build mpifileutils, since HPC-X provides the MPI compilers
required for the cmake build.

The package is only installed in Azure test environments (tests_azure.yml).
All other test playbooks explicitly disable it to avoid requiring HPC-X MPI.

Changes:
- Add __hpc_mpifileutils_info to vars/RedHat_9.yml (version 0.12)
- Add __hpc_mpifileutils_build_dependencies and __hpc_mpifileutils_install_dir to vars/main.yml
- Add hpc_install_mpifileutils default (true) to defaults/main.yml
- Add parameter validation check requiring hpc_build_openmpi_w_nvidia_gpu_support
- Add download, build, and install tasks using tempdir pattern
- Add mpifileutils build deps to the build dependency cleanup task
- Disable mpifileutils in tests_default, tests_skip_toolkit, and tests_include_vars_from_parent

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: add mpifileutils package to the HPC system role. You will find the version to install in the versions.json file in the azhpc-images repository, and the way it needs to be built in components/install_mpifileutils.sh. You will install it to the __hpc_azure_resource_dir directory and use the same temporary build area construct as used for building the openmpi code.

Refinements:
- Disable mpifileutils in all non-Azure test playbooks so only tests_azure.yml installs it

Signed-off-by: Dave Chinner <dchinner@redhat.com>
MVAPICH is a high-performance MPI implementation optimised for InfiniBand
and other high-speed networks. Version 4.0 is built from source using the
same temporary directory pattern as the OpenMPI build.

The build uses ./configure with --enable-g=none --enable-fast=yes flags
matching the upstream azhpc-images build process, and installs to
/opt/mvapich-<version>.

When hpc_build_mpi_w_nvidia_gpu_support is enabled, the build additionally
passes --with-ucx and --with-cuda to configure so that MVAPICH is built
with GPU-aware MPI support using the same UCX and CUDA paths as OpenMPI.

An Lmod environment module is provided in lua format, consistent with the
existing openmpi and hpcx modulefiles, allowing users to load MVAPICH via
'module load mpi/mvapich-4.0'. The module conflicts with other MPI modules
so only one can be loaded at a time. When GPU support is enabled, the
module also adds the UCX and CUDA library paths to LD_LIBRARY_PATH and
PATH, matching the openmpi-cuda module.

Changes:
- Add __hpc_mvapich_info to vars/RedHat_9.yml (version 4.0)
- Add __hpc_mvapich_install_dir to vars/main.yml
- Add hpc_install_mvapich default (true) to defaults/main.yml
- Add download, build, install, and modulefile tasks to tasks/mpi.yml
- Add mvapich-ver.lua.j2 Lmod modulefile template
- Disable hpc_install_mvapich in tests_default, tests_skip_toolkit, and tests_include_vars_from_parent
- Rename hpc_build_openmpi_w_nvidia_gpu_support to hpc_build_mpi_w_nvidia_gpu_support
  as the flag now guards GPU support for both OpenMPI and MVAPICH builds
- Conditionally pass --with-ucx and --with-cuda to MVAPICH configure when
  GPU support is enabled

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: add MVAPICH MPI library to the HPC system role. Use version 4.0 as per the reference versions.json, and the build instructions can be derived from components/install_mpis.sh. Ignore the other MPI libraries in that reference file. Add the lmod environment modules using the lua script format to needed to use the MVAPICH libraries similar to those installed by the system role for the openmpi library.

Refinements:
- configure with --with-device=ch4:ucx to use libucx as the
  network transport instead of the built in libfabrics code.
- Add --with-ucx and --with-cuda configure flags guarded by
  hpc_build_mpi_w_nvidia_gpu_support for GPU-aware MPI support.
- Rename hpc_build_openmpi_w_nvidia_gpu_support to
  hpc_build_mpi_w_nvidia_gpu_support since it now applies to
  multiple MPI library builds.
- Add UCX and CUDA library/bin paths to the MVAPICH Lmod module
  when GPU support is enabled, matching the openmpi-cuda module.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Separate the lmod environment module file installation from the MPI
library build tasks into standalone task blocks. This allows modulefile
changes to be deployed by re-running the playbook without triggering
a rebuild of the MPI libraries, which significantly speeds up the
iterative development and testing of lmod configuration changes.

The OpenMPI-based module files (PMIx, HPC-X, HPC-X+PMIx, OpenMPI,
and the no-GPU defaults helper) are grouped under a single block
gated by hpc_build_mpi_w_nvidia_gpu_support. The MVAPICH module file
has its own block gated by hpc_install_mvapich. Both blocks ensure
the target directories exist before installing files. The template
and copy modules are idempotent so these tasks are safe to run on
every playbook invocation.

Changes:
- Remove PMIx modulefile install from the PMIx build block
- Remove MPI module directory creation and HPC-X/OpenMPI/no-GPU helper
  installs from the GPU MPI build block
- Remove MVAPICH module directory creation and modulefile install from
  the MVAPICH build block
- Add new "Install OpenMPI-based lmod environment module files" block
- Add new "Install MVAPICH lmod environment module file" block

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: having to rebuild the mpi libraries to install and test changes to the lmod configuration takes a long time. extract the lmod configuration file installation from each of the MPI library installs, and implement a single task that installs all of the individual lmod files. trigger the installation of the files if any of the MPI libraries is rebuilt, or if the /usr/share/modulefiles/mpi is missing. install the individual files according to the installation parameters for each of the MPI libraries that already exist.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
MPI libraries built with CUDA/GPU acceleration use UCX-based transports
that cause warnings or failures on machines without GPUs. This adds
runtime GPU detection to the lmod environment modules so that when no
NVidia GPUs are present, the GPU transports are automatically disabled.

For OpenMPI-derived libraries (OpenMPI, HPC-X), a shared Jinja include
fragment (openmpi-no-gpu-defaults.lua.j2) checks for /dev/nvidia0 and
sets OMPI_MCA environment variables to exclude ucx, smcuda, ucc, cuda,
and hcoll transports. The fragment is inlined into each module file at
template rendering time via {% include %}.

For MVAPICH (when built with GPU support), the module refuses to load
on machines without GPUs. MVAPICH hard-codes HPC-X UCX library paths
into libmpi.so at build time so it cannot fall back to system UCX.
The module issues an LmodError directing users to an alternative MPI
module instead.

Changes:
- Add templates/openmpi-no-gpu-defaults.lua.j2 shared GPU detection fragment
- Add {% include %} to openmpi-ver-cuda12-gpu.lua.j2
- Add {% include %} to hpcx-ver.lua.j2
- Add {% include %} to hpcx-ver-pmix-ver.lua.j2
- Add LmodError to mvapich-ver.lua.j2 to refuse loading on non-GPU machines

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: the MPI libraries that are optimised for CUDA and GPU acceleration need different option sets to run on machines without GPUs. All the OpenMPI derived libraries require mpirun/mpiexec to have "--mca pml ^ucx --mca btl ^smcuda --mca osc ^ucx --mca coll ^ucc,cuda,hcoll" to turn off all the underlying UCX-based GPU accelerations. MVAPICH will require a different set of parameters as it passes environment and config variables in a different manner. These need to be set up in the lmod environment modules for each MPI library. If the system does not have any GPUs in it, they should set up the default mpirun/exec environment to use these "avoid using cuda/GPU transports" mechanisms automatically.

Refinements:
- Use Jinja {% include %} to inline GPU detection at deploy time
- MVAPICH refuses to load on non-GPU machines via LmodError because it
  hard-codes HPC-X UCX paths into libmpi.so at build time

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Add the OSU Micro-Benchmarks (OMB) package to the system role as an MPI
implementation validation test suite. The role downloads and extracts the
OMB source into the azure tests directory, and installs a test script that
discovers all installed MPI modules via Lmod, builds OMB against each one,
and runs a set of single-host MPI tests covering startup, point-to-point,
and collective operations.

If a module fails to load (e.g. mvapich on a non-GPU machine), the test
script skips that module and continues testing the remaining modules
rather than failing the entire test suite.

The test script is designed to fail fast on the first error, leaving the
build artifacts in /tmp/omb-builds/ for debugging. Startup tests run with
1 process, point-to-point tests with 2, and collective tests with nproc.

Changes:
- Add __hpc_omb_info to vars/RedHat_9.yml with OMB 8.0b2 URL and checksum
- Add __hpc_azure_omb_dir to vars/main.yml for the OMB source location
- Add hpc_install_mpi_tests default variable (true)
- Add tasks to download, extract OMB and install the test script
- Add test-mpi-omb.sh.j2 template for MPI validation
- Disable mpi_tests in CI test configurations

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: start building a MPI implementation test suite. We will start with the OSU microbenchmark package, downloading it from https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-8.0b2.tar.gz, calculating the sha256sum and then adding it to the system role. The system role will unpack it into the azure tests directory, and from there we will write a test script that iterates all the installed mpi libraries (via module loading) to build and run a set of tests from the OMB suite. Initially the test script will focus on running the tests on a single host, running tests on a cluster via a scheduler is a future modification. The test script will also begin by focussing on the MPI tests in the suite, more expansive functional testing is a future modification.

Refinements:
- Use ml -t spider mpi/ for module discovery instead of filesystem scanning
- Remove Lmod init - rely on user shell environment already having modules loaded
- fail() exits immediately to leave a debuggable corpse
- Set np per test category: 1 for startup, 2 for pt2pt, nproc for collective
- Do not use --allow-run-as-root as tests should run as a regular user
- Skip modules that fail to load instead of aborting the test suite

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Add a -g CLI flag to test-mpi-omb.sh that builds the OSU Micro-Benchmarks
with CUDA GPU support enabled. The CUDA configure flags are only available
when hpc_build_mpi_w_nvidia_gpu_support was set during deployment; if -g
is passed but GPU support was not built, the script exits with an error.

Changes:
- Add ENABLE_CUDA variable and -g option to getopts parsing
- Conditionally pass --enable-cuda and --with-cuda to OMB configure
- Use Jinja2 conditional to gate CUDA paths on hpc_build_mpi_w_nvidia_gpu_support
- Error out if -g is used but MPI was not built with GPU support

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: add a CLI parameter to the MPI test script that builds the test code with CUDA and GPU functionality enabled.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
When the -g flag is passed, extend the test suite to exercise NCCL
functionality via the OMB xccl benchmarks. The NCCL tests run standalone
pt2pt benchmarks (latency, bandwidth, bidirectional bandwidth) and
collective benchmarks (allreduce, allgather, bcast, reduce, reduce_scatter,
alltoall) which exercise the NCCL communication library directly.

The OMB configure is extended with --enable-ncclomb to build the NCCL
benchmark binaries when CUDA support is enabled.

Includes a workaround for an upstream OMB 8.0b2 bug where the xccl
Makefile.am files are missing omb_color.c from the UTILITIES list,
causing link failures. The fix patches the Makefile.am files and runs
autoreconf before configure.

The autotools packages (autoconf, automake, libtool) are moved from
__hpc_openmpi_build_dependencies to a new __hpc_mpi_packages list so
they persist after the build phase and are available for the autoreconf
workaround at test time.

Changes:
- Add --enable-ncclomb to CUDA configure flags
- Add NCCL xccl pt2pt tests (latency, bw, bibw)
- Add NCCL xccl collective tests (allreduce, allgather, bcast, reduce,
  reduce_scatter, alltoall)
- Workaround OMB 8.0b2 xccl link failure by adding omb_color.c to
  UTILITIES in Makefile.am and running autoreconf before configure
- Move autoconf/automake/libtool from __hpc_openmpi_build_dependencies
  to __hpc_mpi_packages so they are not removed after building

Created-by-AI: Claude Opus 4.6 (1M context)

Prompt: new modification: extend the MPI test script to cover CUDA, GPU and NCCL related functionality provided by the OMB suite.

Refinements:
- Workaround upstream OMB 8.0b2 bug where xccl Makefile.am files
  are missing omb_color.c from the UTILITIES list.
- Move autotools packages to persistent __hpc_mpi_packages list.
- Remove MPI launcher GPU memory tests (-d cuda D D) as the launcher
  does not support per-benchmark GPU memory placement options.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
@dgchinner dgchinner requested review from richm and spetrosi as code owners May 21, 2026 05:16
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 21, 2026

📝 Walkthrough

Walkthrough

This PR refactors the HPC role's MPI installation system from a monolithic OpenMPI workflow into a modular multi-MPI architecture supporting OpenMPI, MVAPICH, and mpifileutils with optional GPU/HPC-X builds, complemented by Lmod environment modules and a comprehensive OSU Micro-Benchmarks testing harness.

Changes

MPI Installation and Testing System Refactor

Layer / File(s) Summary
Configuration defaults and internal variables
defaults/main.yml, vars/main.yml, vars/RedHat_9.yml, tests/tests_*.yml
Public defaults now expose granular MPI options (hpc_build_mpi_w_nvidia_gpu_support, hpc_install_mvapich, hpc_install_mpifileutils, hpc_install_mpi_tests) replacing the prior OpenMPI-only GPU flag. Internal variables define build dependencies, installation paths, and package metadata for MVAPICH, mpifileutils, and OSU Micro-Benchmarks. Test configurations override defaults appropriately.
MPI library build and installation pipeline
tasks/main.yml, tasks/mpi.yml
Removes inline OpenMPI workflow and delegates to tasks/mpi.yml. New pipeline validates GPU/NCCL prerequisites and mpifileutils constraints, installs base OpenMPI packages, orchestrates GPU/HPC-X builds including PMIx, GDRCopy, and HPC-X rebuild with CUDA support, conditionally builds mpifileutils against HPC-X, builds MVAPICH with GPU flags when enabled, and cleans temporary sources.
Lmod module environment configuration
templates/openmpi-no-gpu-defaults.lua.j2, templates/hpcx-ver.lua.j2, templates/hpcx-ver-pmix-ver.lua.j2, templates/openmpi-ver-cuda12-gpu.lua.j2, templates/mvapich-ver.lua.j2
Shared openmpi-no-gpu-defaults template detects NVIDIA GPU and disables GPU/UCX transports on GPU-less systems. HPC-X, PMIx, and CUDA OpenMPI variants include the default settings. MVAPICH module enforces GPU presence check when built with CUDA/UCX support and sets transport paths and MPI_* environment variables.
OSU Micro-Benchmarks testing harness
tasks/mpi.yml, templates/test-mpi-omb.sh.j2
Task downloads OMB sources and deploys templated test script. Script parses -g (GPU) and -v (verbose) flags, discovers MPI modules, and executes per-module testing: builds OMB, runs startup/point-to-point/collective benchmarks, conditionally runs NCCL benchmarks on GPU-enabled builds, and handles known OpenMPI/ORTE fork abort case. Reports overall test summary.
🚥 Pre-merge checks | ✅ 4 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Description Format ⚠️ Warning PR description lacks template format with explicit section headers: Feature, Reason, Result, and Issue Tracker Tickets as required by .github/pull_request_template.md. Reformat PR description with explicit section headers: Feature/Enhancement, Reason, Result, and optional Issue Tracker Tickets per the template.
Description check ❓ Inconclusive The PR description is comprehensive but does not follow the provided template structure with required sections (Enhancement, Reason, Result, Issue Tracker Tickets). Reorganize the description to match the template: clearly separate Enhancement/what is added, Reason/why it's needed, and Result/what behavior is achieved. The Jira ticket is provided but format differs from template.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title follows the Conventional Commits format with 'test' type and a clear description of adding an MPI library test script, which aligns with the primary changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tasks/mpi.yml`:
- Around line 478-485: The copy task "Copy OMB source to tests directory" uses
the Ansible copy module without an explicit mode causing risky-file-permissions
lint failures; update that task to set a safe explicit mode (e.g., mode: '0755'
or another appropriate octal) on the copy invocation so ownership and file
permissions are deterministic; modify the task that currently contains copy:
src/dest/owner/group to include mode: '0XXX' and ensure the chosen mode fits the
files' needs (executable vs data) to resolve the lint warning.
- Around line 6-17: The fail task that prevents building MPI with GPU support
should check the correct NCCL variable and use OR logic: change the when
condition on the fail task (the "Fail if role builds MPI with GPU support
without CUDA toolkit" task) to require hpc_build_mpi_w_nvidia_gpu_support and
(not hpc_install_cuda_toolkit or not hpc_install_hpc_nvidia_nccl) instead of
using hpc_install_nvidia_nccl and an AND; also update the fail message lines to
mention "hpc_install_hpc_nvidia_nccl: true" so the guidance shows the correct
variable name.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 0e961e12-6582-4f7c-a0be-a14efddf25c8

📥 Commits

Reviewing files that changed from the base of the PR and between 3def6ea and a23cd13.

📒 Files selected for processing (14)
  • defaults/main.yml
  • tasks/main.yml
  • tasks/mpi.yml
  • templates/hpcx-ver-pmix-ver.lua.j2
  • templates/hpcx-ver.lua.j2
  • templates/mvapich-ver.lua.j2
  • templates/openmpi-no-gpu-defaults.lua.j2
  • templates/openmpi-ver-cuda12-gpu.lua.j2
  • templates/test-mpi-omb.sh.j2
  • tests/tests_default.yml
  • tests/tests_include_vars_from_parent.yml
  • tests/tests_skip_toolkit.yml
  • vars/RedHat_9.yml
  • vars/main.yml

Comment thread tasks/mpi.yml
Comment on lines +6 to +17
- name: Fail if role builds MPI with GPU support without CUDA toolkit
fail:
msg:
- Building MPI libraries with Nvidia GPU support requires multiple
- packages to be installed.
- You must set the following variables true
- "hpc_install_cuda_toolkit: true"
- "hpc_install_nvidia_nccl: true"
when:
- hpc_build_mpi_w_nvidia_gpu_support
- not hpc_install_cuda_toolkit
- not hpc_install_nvidia_nccl
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify variable usage consistency in MPI tasks/defaults.
rg -n "hpc_install_(hpc_)?nvidia_nccl|hpc_build_mpi_w_nvidia_gpu_support" \
  defaults/main.yml tasks/main.yml tasks/mpi.yml tests/tests_*.yml

Repository: linux-system-roles/hpc

Length of output: 1235


Fix MPI GPU prerequisite guard: wrong NCCL variable and incorrect boolean logic.

The GPU-MPI precheck in tasks/mpi.yml uses hpc_install_nvidia_nccl but the role’s variable is hpc_install_hpc_nvidia_nccl (defined in defaults/main.yml and referenced in tests). As written, the guard’s when conditions are ANDed, so it only fails when both CUDA toolkit and NCCL are missing—rather than when either prerequisite is disabled.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tasks/mpi.yml` around lines 6 - 17, The fail task that prevents building MPI
with GPU support should check the correct NCCL variable and use OR logic: change
the when condition on the fail task (the "Fail if role builds MPI with GPU
support without CUDA toolkit" task) to require
hpc_build_mpi_w_nvidia_gpu_support and (not hpc_install_cuda_toolkit or not
hpc_install_hpc_nvidia_nccl) instead of using hpc_install_nvidia_nccl and an
AND; also update the fail message lines to mention "hpc_install_hpc_nvidia_nccl:
true" so the guidance shows the correct variable name.

Comment thread tasks/mpi.yml
Comment on lines +478 to +485
- name: Copy OMB source to tests directory
copy:
src: "{{ __hpc_pkg_extracted.path }}/"
remote_src: true
dest: "{{ __hpc_azure_omb_dir }}"
owner: root
group: root

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Set explicit permissions when copying OMB sources.

The copy task leaves permissions implicit, which matches the lint failure (risky-file-permissions). Set mode explicitly.

💡 Proposed fix
         - name: Copy OMB source to tests directory
           copy:
             src: "{{ __hpc_pkg_extracted.path }}/"
             remote_src: true
             dest: "{{ __hpc_azure_omb_dir }}"
             owner: root
             group: root
+            mode: "u+rwX,go+rX"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tasks/mpi.yml` around lines 478 - 485, The copy task "Copy OMB source to
tests directory" uses the Ansible copy module without an explicit mode causing
risky-file-permissions lint failures; update that task to set a safe explicit
mode (e.g., mode: '0755' or another appropriate octal) on the copy invocation so
ownership and file permissions are deterministic; modify the task that currently
contains copy: src/dest/owner/group to include mode: '0XXX' and ensure the
chosen mode fits the files' needs (executable vs data) to resolve the lint
warning.

Comment thread defaults/main.yml
hpc_azure_disable_predictable_net_names: true
hpc_install_system_openmpi: true
hpc_build_openmpi_w_nvidia_gpu_support: true
hpc_build_mpi_w_nvidia_gpu_support: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that this changes the public API and is considered a breaking change

If this is really necessary, then

  • the README.md should mark the old variable as deprecated, and should say to use the new one
  • we should have logic to use the old variable if set
Suggested change
hpc_build_mpi_w_nvidia_gpu_support: true
hpc_build_mpi_w_nvidia_gpu_support: "{{ hpc_build_openmpi_w_nvidia_gpu_support | d(true) }}"

optional: add a task to tasks/main.yml to tell the user that hpc_build_openmpi_w_nvidia_gpu_support is deprecated if it is defined, and to use hpc_build_mpi_w_nvidia_gpu_support instead

Comment thread tasks/mpi.yml
+ __hpc_mpifileutils_build_dependencies }}
state: present
use: "{{ (__hpc_server_is_ostree | d(false)) |
ternary('ansible.posix.rhel_rpm_ostree', omit) }}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need a register/until? Which package installation tasks need register/until?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants