Build reliability: slatergpu git-main builds fail (path-dep workaround); confirmed facet = find_package(MPI REQUIRED)

## Summary

Since the recent cross-repo build-configuration work (making ZEST consumable as a library so Paul's inversion code can depend on it), **building ZEST against the `slatergpu` git-`main` dependency has become unreliable** for multiple group members. The consistent workaround is to point ZEST at a **local path** SlaterGPU instead of git-`main`. The *underlying* errors observed so far are **not identical**, so this issue separates the **shared symptom** from the **one fully-traced facet** (an MPI configure failure) and lists candidate fixes + a diagnostic plan.

> Status of confidence: the MPI facet (Zoe) is reproduced-as-a-trace and statically analyzed. The umbrella "why builds regressed" is a **hypothesis** (cross-repo config/lockfile desync) — not yet confirmed. This issue is intended partly as a reference for an upcoming discussion with Paul.

## Shared symptom (reliable across reporters)

`slatergpu = { git = ..., branch = "main" }` build **fails**; `slatergpu = { path = "../SlaterGPU" }` build **succeeds**.

| Reporter | Machine | Error actually seen | Resolution |
|---|---|---|---|
| Zoe | athena | `find_package(MPI REQUIRED)` configure failure (trace below) | unresolved (this investigation) |
| Bishnu | Perlmutter | not captured | switched to local SlaterGPU |
| Srijita | athena | recalled a runtime `libmkl_intel_lp64.so.1: cannot open shared object file` when running HF | self-resolved after `git pull` (both repos) + `pixi clean` + rebuild; now git-`main` works again |

The differing errors (MPI configure vs MKL runtime) over a common symptom suggest **stale/desynchronized resolved state across the two repos** as the umbrella, with different downstream manifestations — consistent with Srijita's "clean pull + `pixi clean` + rebuild fixed it."

## Confirmed facet: `find_package(MPI REQUIRED)` failure (Zoe)

Building `slatergpu` (standalone `from '.'` and as ZEST's git-source dep) dies during CMake configure:

```
-- The CXX compiler identification is NVHPC 25.5.0   ← compiler IS found
-- USE_ACC: ON
-- COMPILE_CINTW_4C: OFF
-- Could NOT find MPI_C (missing: MPI_C_WORKS)
-- Could NOT find MPI_CXX (missing: MPI_CXX_WORKS)
CMake Error: Could NOT find MPI (missing: MPI_C_FOUND MPI_CXX_FOUND)
  CMakeLists.txt:51 (find_package)
-- Configuring incomplete, errors occurred!
```

The NVHPC compilers themselves are detected fine; only MPI discovery fails.

### Evidence: MPI is currently unused by SlaterGPU

- No `#include <mpi.h>` and **zero `MPI_*` calls** anywhere in the SlaterGPU source.
- Built `libSlaterGPU.a` has **0 undefined MPI symbols**.
- **No** CMake target links `MPI::MPI`; **nothing** reads any `MPI_*` variable; `sgpu.exe` links `SlaterGPU io cintw` + Fortran/LAPACK/GPU libs only.

So the MPI requirement is **vestigial**, appearing in two places:
- `CMakeLists.txt:51` — `find_package(MPI REQUIRED)` (build-time; gated by a hard-coded `USE_MPI=True`).
- `SlaterGPUConfig.cmake.in:7` — `find_dependency(MPI REQUIRED)` (re-imposed on **consumers**, alongside `OpenMP/BLAS/LAPACK`, which *are* used).

(ZEST is the layer that actually uses MPI — `MPI_Send/Recv/Barrier/COMM_WORLD/...` — and finds its own MPI via its own `find_package(MPI)` and a declared `openmpi = "5.*"`.)

## Reproduction attempts (Perlmutter) — could NOT reproduce

A cold standalone SlaterGPU configure on Perlmutter **succeeds**: `find_package(MPI)` resolves to the **NVHPC-bundled HPC-X OpenMPI**:
```
-- Found MPI_C: <nvhpc>/Linux_x86_64/25.5/comm_libs/12.9/hpcx/hpcx-2.22.1/ompi/lib/libmpi.so (found version "3.1")
```
So the failure is **environment-specific to where that bundled MPI is not visible in the build sandbox**. Discovery asymmetry, not reliability: the compilers come in via hard-coded `nvc++` on `PATH` + flags; MPI is the only dependency routed through `find_package` into a separate `comm_libs/.../ompi` tree that needs its own env activation.

Notes from the Perlmutter session (caveats for whoever picks this up):
- A stale cached `slatergpu` `.conda` masked the issue initially; even after `pixi clean cache --build` the package was reused. Forcing a truly cold configure required wiping the per-build CMake work dir.
- Earlier ZEST link failures (`undefined reference to expmat(...cusolverDnContext*...)`) were a **separate stale-cache artifact** (a cached lib built with `void*` handle signatures vs current headers' `cusolverDnContext*`), resolved by a fresh build — not part of the MPI issue, but another symptom of cache staleness.

## Why it likely regressed recently (hypotheses, unconfirmed)

1. **Cross-repo config / lockfile desync** during the zest-as-library refactor (maintainer's leading theory): build configuration is being synchronized across repos; the pixi lockfile is meant to enforce it, so a sync gap would explain intermittent, heterogeneous failures and why "clean pull + `pixi clean` + rebuild" fixes it.
2. **`pixi-build-cmake` backend update** (pinned `>=0.3.6,<=0.3.8`, currently resolving `0.3.8`) tightening build-sandbox isolation, cutting off the host/NVHPC-MPI env leak that previously satisfied the probe.
3. **Not** a change in SlaterGPU `pixi.toml` deps — `cmake=4.0.3.*` (Sep 2025), `libcint` (Mar 2025), backend pin (Jan–Feb 2026) all predate the reports.

## Candidate fixes (neutral — pending Paul's input on MPI intent)

- **A. Remove / relax the unused MPI requirement** (`find_package(MPI)` non-`REQUIRED` or dropped at `CMakeLists.txt:51`, and `find_dependency(MPI REQUIRED)` at `SlaterGPUConfig.cmake.in:7`). Lowest risk *iff* SlaterGPU is not intended to use MPI in the future. **Needs Paul's confirmation of intent.**
- **B. Make the NVHPC-bundled MPI reliably discoverable** in the sandbox (e.g. set `MPI_HOME`/`MPI_C_COMPILER` to the NVHPC `comm_libs` path). Preserves NVHPC/CUDA-aware MPI pairing; adds machinery for a currently-unused dep.
- **C. Declare an MPI (e.g. `openmpi`) in SlaterGPU's deps** so the build is hermetic. Verified on Perlmutter to move discovery into `$PREFIX` and build end-to-end, **but** introduces a conda-vs-NVHPC MPI pairing question (CUDA-aware MPI), and would ideally be exposed via **run-exports** (#89) rather than duplicated in every consumer's `pixi.toml`.

Design note for the Paul discussion: if SlaterGPU *will* do its own (CUDA-aware) MPI, prefer B (or C paired to NVHPC's CUDA-aware MPI) and treat the propagation via #89 run-exports. If not, A is cleanest.

## Open questions

- Does Paul intend SlaterGPU to use MPI directly (e.g. multi-GPU/multi-rank) in the future? (Determines A vs B/C.)
- Is the umbrella truly cross-repo lockfile desync? What exactly did Srijita's `pixi clean` evict that fixed git-`main`?
- Is Bishnu's Perlmutter failure the MPI facet or another (e.g. MKL/link)? **His actual error was not captured.**

## Diagnostic plan for a failing environment (athena / Bishnu / a fresh student checkout)

Run where the failure is actually visible and capture:
1. `CMakeFiles/CMakeError.log` + `CMakeOutput.log` from the failing slatergpu build (the real MPI try-compile failure, not just "not found").
2. In the build sandbox: `which nvc nvc++ mpicc mpicxx`; `echo $PATH`; `$NVHPC` / `$OPAL_PREFIX`; and whether `<nvhpc>/.../comm_libs/*/hpcx/*/ompi/{bin,lib}` **exists at all** on that install.
3. `pixi --version` and the resolved `pixi-build-cmake` version (tests hypothesis 2).
4. Temporarily comment `CMakeLists.txt:51`; check whether `find_package(OpenMP)` / `OpenACC` also fail (whole-NVHPC-env-missing vs MPI-specific).
5. How NVHPC is activated (module load vs manual export) and whether that activation reaches the pixi build subprocess.
6. Compare the resolved `pixi.lock` (slatergpu rev + build deps) on a failing vs working checkout — tests hypothesis 1.

## Related

- #89 (run-exports to propagate SlaterGPU deps to consumers — relevant to fix C / dep-sync)
- #93 (FindCUDA → CUDAToolkit in `Config.cmake.in` — same file as the MPI `find_dependency`)
- #46 / #49 (pixi rebuild/caching behavior — relevant to the stale-cache observations)
- ZEST: zest-as-library refactor (the cross-repo work this regression coincides with)

---
*Investigation performed on Perlmutter with Claude Code; the MPI failure itself was not reproducible there (NVHPC-bundled MPI is visible). Traces above are from Zoe (athena) and the Perlmutter reproduction attempts.*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Build reliability: slatergpu git-main builds fail (path-dep workaround); confirmed facet = find_package(MPI REQUIRED) #95

Summary

Shared symptom (reliable across reporters)

Confirmed facet: `find_package(MPI REQUIRED)` failure (Zoe)

Evidence: MPI is currently unused by SlaterGPU

Reproduction attempts (Perlmutter) — could NOT reproduce

Why it likely regressed recently (hypotheses, unconfirmed)

Candidate fixes (neutral — pending Paul's input on MPI intent)

Open questions

Diagnostic plan for a failing environment (athena / Bishnu / a fresh student checkout)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Reporter	Machine	Error actually seen	Resolution
Zoe	athena	`find_package(MPI REQUIRED)` configure failure (trace below)	unresolved (this investigation)
Bishnu	Perlmutter	not captured	switched to local SlaterGPU
Srijita	athena	recalled a runtime `libmkl_intel_lp64.so.1: cannot open shared object file` when running HF	self-resolved after `git pull` (both repos) + `pixi clean` + rebuild; now git-`main` works again

Uh oh!

Build reliability: slatergpu git-main builds fail (path-dep workaround); confirmed facet = find_package(MPI REQUIRED) #95

Description

Summary

Shared symptom (reliable across reporters)

Confirmed facet: find_package(MPI REQUIRED) failure (Zoe)

Evidence: MPI is currently unused by SlaterGPU

Reproduction attempts (Perlmutter) — could NOT reproduce

Why it likely regressed recently (hypotheses, unconfirmed)

Candidate fixes (neutral — pending Paul's input on MPI intent)

Open questions

Diagnostic plan for a failing environment (athena / Bishnu / a fresh student checkout)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Confirmed facet: `find_package(MPI REQUIRED)` failure (Zoe)