Conversation
- Move the time units (ms, s) out of the column headers and onto each cell in the Small networks and Large networks tables. The numbers now carry their own unit, so a partial copy of the table no longer loses the units; column headers are reserved for precision modes. - Rename the heading "Benchmarks" -> "Performance benchmarks" so the section is unambiguous in the table of contents. - Rename the conda environment shipped in environment-cuda.yml from "sfa-cu132" to "sfa". The env name was an arbitrary label; matching the project name makes the install snippets read more naturally (`conda activate sfa` instead of `conda activate sfa-cu132`). PyPI package names (`sfa-cu128`, `sfa-cu132`, `sfa-cu133`) are unrelated to the conda env name and remain unchanged. - Update README, INSTALL.md, and doc/install.md to use the new env name everywhere it appears.
Run 27055508425 surfaced one CPU and four CUDA failure modes when
manually triggering wheels.yml. This rewrites the workflow so the full
matrix is expected to pass:
CPU wheel - cibuildwheel rejected the pure-Python wheel
- Symptom: 'Build failed because a pure Python wheel was generated.'
- Fix: drop cibuildwheel for the CPU target and use `python -m build`
to produce one universal sfa-<ver>-py3-none-any.whl. The 3-OS x 4-py
matrix collapses to a single job; pure-Python wheels are not platform
or interpreter specific.
CUDA wheels - Jimver/cuda-toolkit was too old for CUDA 13.x
- Symptom: 'Error: Version not available: 13.2.0 / 13.3.0' on every
cu132 / cu133 cell.
- Fix: bump Jimver/cuda-toolkit v0.2.21 -> v0.2.35 (2026-03-29 release;
default CUDA is 13.2 there).
CUDA wheels - sub-package names rejected by Ubuntu apt
- Symptom: 'Unable to locate package cuda-cublas-12-8 /
cuda-cublas_dev-12-8 / cuda-nvrtc_dev-12-8'.
- Cause: Jimver prefixes every `sub-packages` entry with `cuda-`, but
Ubuntu's CUDA apt repos ship cuBLAS and NVRTC as `libcublas-*` /
`libnvrtc-*`. They must live under the separate `non-cuda-sub-packages`
input, which is passed through verbatim.
- Fix:
sub-packages: '["nvcc", "cudart", "cudart-dev"]'
non-cuda-sub-packages: '["libcublas", "libcublas-dev",
"libnvrtc", "libnvrtc-dev"]'
CUDA wheels - cibuildwheel rejected CIBW_ENVIRONMENT
- Symptom: 'cibuildwheel: Malformed environment option ...'.
- Cause: cibuildwheel parses CIBW_ENVIRONMENT with bashlex. The
unquoted semicolons inside SFA_CUDA_ARCH=sm_70;sm_75;... are
interpreted as Bash statement terminators. Also, the original block
unconditionally exported Linux-only CUDA_PATH=/usr/local/cuda and
PATH=/usr/local/cuda/bin:$PATH, which broke the Windows runs since
Jimver on Windows installs CUDA under
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.Y.
- Fix: quote every value that contains ';' or '<' (SFA_CUDA_ARCH,
SFA_CUDA_RUNTIME_REQUIRES), and split the workflow env into
CIBW_ENVIRONMENT_LINUX (with the bind-mount paths) and
CIBW_ENVIRONMENT_WINDOWS (which inherits CUDA_PATH from the Jimver
step automatically).
CUDA wheels - cuBLAS / NVRTC headers and libs were scattered
- Cause: the `network` method drops cuBLAS / NVRTC headers in
/usr/include and shared libs in /usr/lib/x86_64-linux-gnu/, while
setup.py only looks under $CUDA_HOME/{include,lib64}/. Inside the
manylinux container the additional host dirs are not mounted, so
the build would have failed at link time even after the previous
fixes.
- Fix: add a Linux-only staging step that copies cublas*.h, nvrtc.h,
libcublas*.so*, libnvrtc*.so* into /usr/local/cuda/{include,lib64}/
before cibuildwheel runs. A single `-v /usr/local/cuda:/usr/local/cuda:ro`
bind mount then exposes everything the build needs to the container.
Publish job 'needs' updated to reference build_cpu_wheel (singular).
The `if: false` gate stays in place; PyPI upload is still off.
Second wheels-build dry run (27056380944) made progress (CPU + sdist
now pass) but the 6 CUDA cells still failed at the CUDA toolkit install:
- Linux: 'Unable to locate package libnvrtc-12-8 / libnvrtc-dev-12-8'.
Only cuBLAS lives in Ubuntu's `lib*` package family on the NVIDIA
apt repo. NVRTC ships with the standard `cuda-` prefix
(cuda-nvrtc-12-8, cuda-nvrtc-dev-12-8), so it belongs back in
Jimver's `sub-packages` input, not in `non-cuda-sub-packages`.
- Windows: the NVIDIA Windows installer rejected `cudart-dev_12.8`
(exit code 3772776473). Windows uses unprefixed names with
underscores (cublas_dev, nvrtc_dev, ...) and does not split a
separate cudart-dev sub-package - the headers ship inside cudart.
Fix:
- Move the toolkit install to two OS-conditional steps, each with the
sub-package naming convention that matches its target installer.
- Linux: sub-packages now ["nvcc", "cudart", "cudart-dev", "nvrtc",
"nvrtc-dev"] (all cuda- prefixed) and non-cuda-sub-packages reduced
to just ["libcublas", "libcublas-dev"].
- Windows: sub-packages ["nvcc", "cudart", "cublas", "cublas_dev",
"nvrtc", "nvrtc_dev"] - the working configuration before the
cudart-dev typo was introduced.
- Drop NVRTC from the Linux staging step. With NVRTC pulled in via the
cuda- prefix it lands in $CUDA_HOME/{include,lib64} directly; only
cuBLAS (still installed as a lib* package) needs to be moved out of
/usr/include and /usr/lib/x86_64-linux-gnu/ so the bind mount of
/usr/local/cuda sees everything.
CPU wheel and sdist are unchanged.
…l, drop cu133 Third dry run (27056467626) made it past CUDA install on Linux for cu128 / cu132 (good) but surfaced three new blocking issues. Fixing each so the matrix can come up green: Issue 1 - cu133 cells fail at install with 'Version not available: 13.3.0' - Jimver/cuda-toolkit v0.2.35 does not have CUDA 13.3 in its version table yet. Drop the sfa-cu133 row from the matrix until a newer Jimver release supports it; re-add at that point. Update docs to match: README, INSTALL.md, doc/install.md, the SFA_PACKAGE_NAME example, the conda-env note about which CUDA majors CI tests, and the `pip install` snippet now reference cu132 instead of cu133. Issue 2 - Linux Build wheels fails inside auditwheel - Error: 'Cannot repair wheel, because required library "libcudart.so.12" could not be located'. - auditwheel was trying to vendor the NVIDIA runtime shared libs into the wheel. We don't want that - the wheel declares pinned PyPI dependencies on nvidia-cublas-cuXX / nvidia-cuda-runtime-cuXX / nvidia-cuda-nvrtc-cuXX through SFA_CUDA_RUNTIME_REQUIRES, so the libs arrive via pip at install time. - Fix: override CIBW_REPAIR_WHEEL_COMMAND_LINUX to pass --exclude for libcudart, libcublas, libcublasLt, libnvrtc, and libnvrtc-builtins in both soname.12 and soname.13 forms (covers cu128 and cu132). Issue 3 - Windows Build wheels fails with 'nvcc fatal : Cannot find compiler cl.exe in PATH' - cibuildwheel spawns the build in a subprocess that does not inherit the Developer Command Prompt environment, so cl.exe is not visible to nvcc even though MSVC is installed on the runner. - Fix: insert ilammy/msvc-dev-cmd@v1 step (Windows only) after the Jimver toolkit step; it exports VCINSTALLDIR / PATH and friends so any subsequent process can find cl.exe. CPU wheel, sdist, CUDA install on Linux (cu128/cu132), and CUDA install on Windows (cu128/cu132) are unchanged.
Fourth dry run (27056657014) graduated cu128-windows to a full pass
(11 minutes including the test phase). Three other CUDA cells failed
with two new error classes:
Issue 1 - 'nvcc fatal: Unsupported gpu architecture compute_70' on
both cu132-ubuntu and cu132-windows
- CUDA 13 nvcc no longer accepts -gencode for sm_70. The deprecation
warning had been visible since CUDA 12 ('Support for offline
compilation for architectures prior to sm_75 will be removed in a
future release'); CUDA 13 is that release.
- Fix: remove sm_70 from the cu132 archs list. cu128 keeps sm_70
because CUDA 12.8 still supports it. Volta users (P100, V100,
Quadro GV100, etc.) install sfa-cu128.
Issue 2 - cu128-ubuntu test phase failed compiling scipy from source
- The sfa wheel itself built and was repaired (auditwheel exclude
rules worked). The failure was when cibuildwheel ran the test
command in a fresh venv: cp310 installed scipy 1.15.3 from a
manylinux2014 wheel and succeeded, but cp311 picked up scipy 1.16+
which has dropped manylinux2014 wheels. pip then fell back to an
sdist build, the manylinux2014 container had no OpenBLAS, and
meson aborted with 'Dependency OpenBLAS not found'.
- Fix: set CIBW_MANYLINUX_X86_64_IMAGE to manylinux_2_28. That is the
base image scipy 1.16+ targets and matches what the rest of the
scientific-Python wheel matrix has converged to. The bind-mount of
/usr/local/cuda continues to work the same way.
After these two changes the expected matrix outcome is:
- CPU universal wheel : pass (already passing)
- sdist : pass (already passing)
- cuda-sfa-cu128-ubuntu : should pass (manylinux_2_28 scipy)
- cuda-sfa-cu128-windows : pass (already passing)
- cuda-sfa-cu132-ubuntu : should pass (no sm_70, manylinux_2_28)
- cuda-sfa-cu132-windows : should pass (no sm_70)
Fifth dry run (27057065902) graduated 5 of 6 cells; cu132-windows was the lone holdout, dying with CUDA\v13.2\include\cuda_runtime.h(82): fatal error C1083: Cannot open include file: 'crt/host_config.h': No such file or directory CUDA 13's Windows installer split the runtime developer headers (including crt/host_config.h, which cuda_runtime.h pulls in to wire nvcc up to the host MSVC) into a separate `cudart_dev` sub-package. The CUDA 12.8 Windows installer kept those headers inside `cudart`, so the prior config covered cu128-windows by accident but had no chance against cu132. Fix: add cudart_dev to the Windows sub-packages list. The package also exists on CUDA 12.8 (where it is a thin no-op overlay), so the same list works for both wheels. Linux is unaffected: the Linux toolkit install already lists "cudart-dev" alongside "cudart", and those headers landed in /usr/local/cuda/include/crt/ as expected. Expected outcome of the next run: all 6 cells green.
Sixth dry run (27057337822) regressed both Windows cells: adding `cudart_dev` to the Windows sub-packages list (which I thought would just be a no-op on CUDA 12.8) instead broke the install on BOTH cu128 and cu132 because that sub-package name does not exist on Windows for either CUDA version. Linux jobs were unaffected. Root cause of the underlying problem: NVIDIA's Windows installer reorganised which sub-package carries crt/host_config.h between CUDA 12 and CUDA 13, and the new owner is not consistently called `cudart_dev`. Different secondary sources name it differently and none of those names work for both 12.8 and 13.2. Fix: skip the sub-package guessing game entirely on Windows by switching to method: 'local'. Jimver then downloads the full NVIDIA installer .exe and runs it silently, which lays down the complete include tree (crt/host_config.h included) regardless of how NVIDIA re-tags individual chunks in future point releases. Cost: one extra ~3 GB download per Windows cell. cu128-windows had been completing in 11 minutes on the network method, so the local method should land it somewhere around 14-16 minutes - well within the GitHub Actions job budget. Linux keeps the network method + curated sub-packages + cuBLAS staging step. That combination was verified green in the previous run (cu128-ubuntu 4m55s, cu132-ubuntu 4m57s).
Bump version 0.2.0.dev0 -> 0.2.0 in pyproject.toml and sfa/__init__.py, and enable the publish-to-pypi job in wheels.yml (the prior `if: false` guard is replaced by a tag-ref check) so that pushing a v0.2.0 tag from main triggers wheel build + PyPI upload. The wheels.yml matrix itself is unchanged; the same six cells (CPU universal, sdist, sfa-cu128 / sfa-cu132 on ubuntu and windows) just verified green in dry run 27057416779 will run again on the tag push, this time producing 0.2.0 artifacts and shipping them to PyPI through the configured trusted-publisher relationships for sfa, sfa-cu128, and sfa-cu132.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
First feature release after 0.1.0. Bumps version 0.2.0.dev0 -> 0.2.0
and turns the PyPI publish job on; tag push of
v0.2.0from main willtrigger the release upload after merge.
What this branch carries over from 0.2.0.dev0 (highlights):
sfa._cuda._native) forcompute_influenceand
SignalPropagation.propagate_iterative. AOT-compiled SASS forSM 7.0 - SM 12.0 plus PTX fallback. Distributed as
sfa-cu128(CUDA 12.8) and
sfa-cu132(CUDA 13.2) wheels in addition to thepure-Python
sfapackage.compute_influence(scipy.linalg.solve),with optional
_blas_ctypesMKL / OpenBLAS direct call andthreadpoolctl-based thread limiting.device=anddtype=kwargs throughout the public API, plus theuse_tf32toggle for the Tensor Core path.benchmarks/, with the small-network table(vs v0.1.0 in fp64) and large-network table (GPU only, multiple
precisions) reproduced in the README.
tests/verification.pyportable post-install check (rename fromthe previous
tests/smoke.py).tests.ymlcovers Ubuntu / Windows / macOS-14 acrossPython 3.10-3.13;
wheels.ymlproduces a universal CPU wheelplus per-OS / per-CUDA / per-Python CUDA wheels and sdist. The
full wheels.yml matrix was dry-run green at run 27057416779
(6/6 cells, ~21 min wallclock).
/ FP16 / TF32, expanded INSTALL.md with conda and conda-free build
paths, hardware/experimental-setup tables for Performance
benchmarks.
Test plan
v0.2.0from mainsfa,sfa-cu128,sfa-cu132at 0.2.0pip install sfa-cu132; python -c "import sfa; print(sfa.__version__)"Required before tag push
PyPI trusted-publisher relationships must be configured at pypi.org
for all three release projects, otherwise the publish step will
fail authentication:
sfasfa-cu128sfa-cu132For each: Owner
dwgoon, Repositorysfa, Workflow filenamewheels.yml, Environment empty.