You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Since the recent cross-repo build-configuration work (making ZEST consumable as a library so Paul's inversion code can depend on it), building ZEST against the slatergpu git-main dependency has become unreliable for multiple group members. The consistent workaround is to point ZEST at a local path SlaterGPU instead of git-main. The underlying errors observed so far are not identical, so this issue separates the shared symptom from the one fully-traced facet (an MPI configure failure) and lists candidate fixes + a diagnostic plan.
Status of confidence: the MPI facet (Zoe) is reproduced-as-a-trace and statically analyzed. The umbrella "why builds regressed" is a hypothesis (cross-repo config/lockfile desync) — not yet confirmed. This issue is intended partly as a reference for an upcoming discussion with Paul.
recalled a runtime libmkl_intel_lp64.so.1: cannot open shared object file when running HF
self-resolved after git pull (both repos) + pixi clean + rebuild; now git-main works again
The differing errors (MPI configure vs MKL runtime) over a common symptom suggest stale/desynchronized resolved state across the two repos as the umbrella, with different downstream manifestations — consistent with Srijita's "clean pull + pixi clean + rebuild fixed it."
Building slatergpu (standalone from '.' and as ZEST's git-source dep) dies during CMake configure:
-- The CXX compiler identification is NVHPC 25.5.0 ← compiler IS found
-- USE_ACC: ON
-- COMPILE_CINTW_4C: OFF
-- Could NOT find MPI_C (missing: MPI_C_WORKS)
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
CMake Error: Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND)
CMakeLists.txt:51 (find_package)
-- Configuring incomplete, errors occurred!
The NVHPC compilers themselves are detected fine; only MPI discovery fails.
Evidence: MPI is currently unused by SlaterGPU
No #include <mpi.h> and zero MPI_* calls anywhere in the SlaterGPU source.
Built libSlaterGPU.a has 0 undefined MPI symbols.
No CMake target links MPI::MPI; nothing reads any MPI_* variable; sgpu.exe links SlaterGPU io cintw + Fortran/LAPACK/GPU libs only.
So the MPI requirement is vestigial, appearing in two places:
CMakeLists.txt:51 — find_package(MPI REQUIRED) (build-time; gated by a hard-coded USE_MPI=True).
SlaterGPUConfig.cmake.in:7 — find_dependency(MPI REQUIRED) (re-imposed on consumers, alongside OpenMP/BLAS/LAPACK, which are used).
(ZEST is the layer that actually uses MPI — MPI_Send/Recv/Barrier/COMM_WORLD/... — and finds its own MPI via its own find_package(MPI) and a declared openmpi = "5.*".)
Reproduction attempts (Perlmutter) — could NOT reproduce
A cold standalone SlaterGPU configure on Perlmutter succeeds: find_package(MPI) resolves to the NVHPC-bundled HPC-X OpenMPI:
-- Found MPI_C: <nvhpc>/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/lib/libmpi.so (found version "3.1")
So the failure is environment-specific to where that bundled MPI is not visible in the build sandbox. Discovery asymmetry, not reliability: the compilers come in via hard-coded nvc++ on PATH + flags; MPI is the only dependency routed through find_package into a separate comm_libs/.../ompi tree that needs its own env activation.
Notes from the Perlmutter session (caveats for whoever picks this up):
A stale cached slatergpu.conda masked the issue initially; even after pixi clean cache --build the package was reused. Forcing a truly cold configure required wiping the per-build CMake work dir.
Earlier ZEST link failures (undefined reference to expmat(...cusolverDnContext*...)) were a separate stale-cache artifact (a cached lib built with void* handle signatures vs current headers' cusolverDnContext*), resolved by a fresh build — not part of the MPI issue, but another symptom of cache staleness.
Why it likely regressed recently (hypotheses, unconfirmed)
Cross-repo config / lockfile desync during the zest-as-library refactor (maintainer's leading theory): build configuration is being synchronized across repos; the pixi lockfile is meant to enforce it, so a sync gap would explain intermittent, heterogeneous failures and why "clean pull + pixi clean + rebuild" fixes it.
pixi-build-cmake backend update (pinned >=0.3.6,<=0.3.8, currently resolving 0.3.8) tightening build-sandbox isolation, cutting off the host/NVHPC-MPI env leak that previously satisfied the probe.
Not a change in SlaterGPU pixi.toml deps — cmake=4.0.3.* (Sep 2025), libcint (Mar 2025), backend pin (Jan–Feb 2026) all predate the reports.
Candidate fixes (neutral — pending Paul's input on MPI intent)
A. Remove / relax the unused MPI requirement (find_package(MPI) non-REQUIRED or dropped at CMakeLists.txt:51, and find_dependency(MPI REQUIRED) at SlaterGPUConfig.cmake.in:7). Lowest risk iff SlaterGPU is not intended to use MPI in the future. Needs Paul's confirmation of intent.
B. Make the NVHPC-bundled MPI reliably discoverable in the sandbox (e.g. set MPI_HOME/MPI_C_COMPILER to the NVHPC comm_libs path). Preserves NVHPC/CUDA-aware MPI pairing; adds machinery for a currently-unused dep.
C. Declare an MPI (e.g. openmpi) in SlaterGPU's deps so the build is hermetic. Verified on Perlmutter to move discovery into $PREFIX and build end-to-end, but introduces a conda-vs-NVHPC MPI pairing question (CUDA-aware MPI), and would ideally be exposed via run-exports (Declare run-exports to consolidate downstream consumers' duplicate pixi.toml entries #89) rather than duplicated in every consumer's pixi.toml.
Design note for the Paul discussion: if SlaterGPU will do its own (CUDA-aware) MPI, prefer B (or C paired to NVHPC's CUDA-aware MPI) and treat the propagation via #89 run-exports. If not, A is cleanest.
Open questions
Does Paul intend SlaterGPU to use MPI directly (e.g. multi-GPU/multi-rank) in the future? (Determines A vs B/C.)
Is the umbrella truly cross-repo lockfile desync? What exactly did Srijita's pixi clean evict that fixed git-main?
Is Bishnu's Perlmutter failure the MPI facet or another (e.g. MKL/link)? His actual error was not captured.
Diagnostic plan for a failing environment (athena / Bishnu / a fresh student checkout)
Run where the failure is actually visible and capture:
CMakeFiles/CMakeError.log + CMakeOutput.log from the failing slatergpu build (the real MPI try-compile failure, not just "not found").
In the build sandbox: which nvc nvc++ mpicc mpicxx; echo $PATH; $NVHPC / $OPAL_PREFIX; and whether <nvhpc>/.../comm_libs/*/hpcx/*/ompi/{bin,lib}exists at all on that install.
pixi --version and the resolved pixi-build-cmake version (tests hypothesis 2).
Temporarily comment CMakeLists.txt:51; check whether find_package(OpenMP) / OpenACC also fail (whole-NVHPC-env-missing vs MPI-specific).
How NVHPC is activated (module load vs manual export) and whether that activation reaches the pixi build subprocess.
Compare the resolved pixi.lock (slatergpu rev + build deps) on a failing vs working checkout — tests hypothesis 1.
ZEST: zest-as-library refactor (the cross-repo work this regression coincides with)
Investigation performed on Perlmutter with Claude Code; the MPI failure itself was not reproducible there (NVHPC-bundled MPI is visible). Traces above are from Zoe (athena) and the Perlmutter reproduction attempts.
Summary
Since the recent cross-repo build-configuration work (making ZEST consumable as a library so Paul's inversion code can depend on it), building ZEST against the
slatergpugit-maindependency has become unreliable for multiple group members. The consistent workaround is to point ZEST at a local path SlaterGPU instead of git-main. The underlying errors observed so far are not identical, so this issue separates the shared symptom from the one fully-traced facet (an MPI configure failure) and lists candidate fixes + a diagnostic plan.Shared symptom (reliable across reporters)
slatergpu = { git = ..., branch = "main" }build fails;slatergpu = { path = "../SlaterGPU" }build succeeds.find_package(MPI REQUIRED)configure failure (trace below)libmkl_intel_lp64.so.1: cannot open shared object filewhen running HFgit pull(both repos) +pixi clean+ rebuild; now git-mainworks againThe differing errors (MPI configure vs MKL runtime) over a common symptom suggest stale/desynchronized resolved state across the two repos as the umbrella, with different downstream manifestations — consistent with Srijita's "clean pull +
pixi clean+ rebuild fixed it."Confirmed facet:
find_package(MPI REQUIRED)failure (Zoe)Building
slatergpu(standalonefrom '.'and as ZEST's git-source dep) dies during CMake configure:The NVHPC compilers themselves are detected fine; only MPI discovery fails.
Evidence: MPI is currently unused by SlaterGPU
#include <mpi.h>and zeroMPI_*calls anywhere in the SlaterGPU source.libSlaterGPU.ahas 0 undefined MPI symbols.MPI::MPI; nothing reads anyMPI_*variable;sgpu.exelinksSlaterGPU io cintw+ Fortran/LAPACK/GPU libs only.So the MPI requirement is vestigial, appearing in two places:
CMakeLists.txt:51—find_package(MPI REQUIRED)(build-time; gated by a hard-codedUSE_MPI=True).SlaterGPUConfig.cmake.in:7—find_dependency(MPI REQUIRED)(re-imposed on consumers, alongsideOpenMP/BLAS/LAPACK, which are used).(ZEST is the layer that actually uses MPI —
MPI_Send/Recv/Barrier/COMM_WORLD/...— and finds its own MPI via its ownfind_package(MPI)and a declaredopenmpi = "5.*".)Reproduction attempts (Perlmutter) — could NOT reproduce
A cold standalone SlaterGPU configure on Perlmutter succeeds:
find_package(MPI)resolves to the NVHPC-bundled HPC-X OpenMPI:So the failure is environment-specific to where that bundled MPI is not visible in the build sandbox. Discovery asymmetry, not reliability: the compilers come in via hard-coded
nvc++onPATH+ flags; MPI is the only dependency routed throughfind_packageinto a separatecomm_libs/.../ompitree that needs its own env activation.Notes from the Perlmutter session (caveats for whoever picks this up):
slatergpu.condamasked the issue initially; even afterpixi clean cache --buildthe package was reused. Forcing a truly cold configure required wiping the per-build CMake work dir.undefined reference to expmat(...cusolverDnContext*...)) were a separate stale-cache artifact (a cached lib built withvoid*handle signatures vs current headers'cusolverDnContext*), resolved by a fresh build — not part of the MPI issue, but another symptom of cache staleness.Why it likely regressed recently (hypotheses, unconfirmed)
pixi clean+ rebuild" fixes it.pixi-build-cmakebackend update (pinned>=0.3.6,<=0.3.8, currently resolving0.3.8) tightening build-sandbox isolation, cutting off the host/NVHPC-MPI env leak that previously satisfied the probe.pixi.tomldeps —cmake=4.0.3.*(Sep 2025),libcint(Mar 2025), backend pin (Jan–Feb 2026) all predate the reports.Candidate fixes (neutral — pending Paul's input on MPI intent)
find_package(MPI)non-REQUIREDor dropped atCMakeLists.txt:51, andfind_dependency(MPI REQUIRED)atSlaterGPUConfig.cmake.in:7). Lowest risk iff SlaterGPU is not intended to use MPI in the future. Needs Paul's confirmation of intent.MPI_HOME/MPI_C_COMPILERto the NVHPCcomm_libspath). Preserves NVHPC/CUDA-aware MPI pairing; adds machinery for a currently-unused dep.openmpi) in SlaterGPU's deps so the build is hermetic. Verified on Perlmutter to move discovery into$PREFIXand build end-to-end, but introduces a conda-vs-NVHPC MPI pairing question (CUDA-aware MPI), and would ideally be exposed via run-exports (Declare run-exports to consolidate downstream consumers' duplicate pixi.toml entries #89) rather than duplicated in every consumer'spixi.toml.Design note for the Paul discussion: if SlaterGPU will do its own (CUDA-aware) MPI, prefer B (or C paired to NVHPC's CUDA-aware MPI) and treat the propagation via #89 run-exports. If not, A is cleanest.
Open questions
pixi cleanevict that fixed git-main?Diagnostic plan for a failing environment (athena / Bishnu / a fresh student checkout)
Run where the failure is actually visible and capture:
CMakeFiles/CMakeError.log+CMakeOutput.logfrom the failing slatergpu build (the real MPI try-compile failure, not just "not found").which nvc nvc++ mpicc mpicxx;echo $PATH;$NVHPC/$OPAL_PREFIX; and whether<nvhpc>/.../comm_libs/*/hpcx/*/ompi/{bin,lib}exists at all on that install.pixi --versionand the resolvedpixi-build-cmakeversion (tests hypothesis 2).CMakeLists.txt:51; check whetherfind_package(OpenMP)/OpenACCalso fail (whole-NVHPC-env-missing vs MPI-specific).pixi.lock(slatergpu rev + build deps) on a failing vs working checkout — tests hypothesis 1.Related
Config.cmake.in— same file as the MPIfind_dependency)Investigation performed on Perlmutter with Claude Code; the MPI failure itself was not reproducible there (NVHPC-bundled MPI is visible). Traces above are from Zoe (athena) and the Perlmutter reproduction attempts.