Skip to content

Release 0.2.0#6

Merged
dwgoon merged 8 commits into
mainfrom
0.2.0
Jun 6, 2026
Merged

Release 0.2.0#6
dwgoon merged 8 commits into
mainfrom
0.2.0

Conversation

@dwgoon

@dwgoon dwgoon commented Jun 6, 2026

Copy link
Copy Markdown
Owner

Summary

First feature release after 0.1.0. Bumps version 0.2.0.dev0 -> 0.2.0
and turns the PyPI publish job on; tag push of v0.2.0 from main will
trigger the release upload after merge.

What this branch carries over from 0.2.0.dev0 (highlights):

  • Native CUDA backend (sfa._cuda._native) for compute_influence
    and SignalPropagation.propagate_iterative. AOT-compiled SASS for
    SM 7.0 - SM 12.0 plus PTX fallback. Distributed as sfa-cu128
    (CUDA 12.8) and sfa-cu132 (CUDA 13.2) wheels in addition to the
    pure-Python sfa package.
  • CPU LAPACK closed-form fast path for compute_influence (scipy.linalg.solve),
    with optional _blas_ctypes MKL / OpenBLAS direct call and
    threadpoolctl-based thread limiting.
  • device= and dtype= kwargs throughout the public API, plus the
    use_tf32 toggle for the Tensor Core path.
  • Benchmark suite under benchmarks/, with the small-network table
    (vs v0.1.0 in fp64) and large-network table (GPU only, multiple
    precisions) reproduced in the README.
  • tests/verification.py portable post-install check (rename from
    the previous tests/smoke.py).
  • Multi-OS CI: tests.yml covers Ubuntu / Windows / macOS-14 across
    Python 3.10-3.13; wheels.yml produces a universal CPU wheel
    plus per-OS / per-CUDA / per-Python CUDA wheels and sdist. The
    full wheels.yml matrix was dry-run green at run 27057416779
    (6/6 cells, ~21 min wallclock).
  • Docs: rewritten Quick start, NVIDIA capitalization for FP64 / FP32
    / FP16 / TF32, expanded INSTALL.md with conda and conda-free build
    paths, hardware/experimental-setup tables for Performance
    benchmarks.

Test plan

  • CI (tests.yml) green on 0.2.0 branch before merge
  • After merge to main, tag v0.2.0 from main
  • Verify wheels.yml on the tag push: 6 build cells + sdist + publish
  • Verify pypi.org pages for sfa, sfa-cu128, sfa-cu132 at 0.2.0
  • Smoke install in a clean venv: pip install sfa-cu132; python -c "import sfa; print(sfa.__version__)"

Required before tag push

PyPI trusted-publisher relationships must be configured at pypi.org
for all three release projects, otherwise the publish step will
fail authentication:

  • sfa
  • sfa-cu128
  • sfa-cu132

For each: Owner dwgoon, Repository sfa, Workflow filename
wheels.yml, Environment empty.

dwgoon added 8 commits June 6, 2026 15:53
- Move the time units (ms, s) out of the column headers and onto each
  cell in the Small networks and Large networks tables. The numbers
  now carry their own unit, so a partial copy of the table no longer
  loses the units; column headers are reserved for precision modes.
- Rename the heading "Benchmarks" -> "Performance benchmarks" so the
  section is unambiguous in the table of contents.
- Rename the conda environment shipped in environment-cuda.yml from
  "sfa-cu132" to "sfa". The env name was an arbitrary label; matching
  the project name makes the install snippets read more naturally
  (`conda activate sfa` instead of `conda activate sfa-cu132`). PyPI
  package names (`sfa-cu128`, `sfa-cu132`, `sfa-cu133`) are unrelated
  to the conda env name and remain unchanged.
- Update README, INSTALL.md, and doc/install.md to use the new env
  name everywhere it appears.
Run 27055508425 surfaced one CPU and four CUDA failure modes when
manually triggering wheels.yml. This rewrites the workflow so the full
matrix is expected to pass:

CPU wheel - cibuildwheel rejected the pure-Python wheel
- Symptom: 'Build failed because a pure Python wheel was generated.'
- Fix: drop cibuildwheel for the CPU target and use `python -m build`
  to produce one universal sfa-<ver>-py3-none-any.whl. The 3-OS x 4-py
  matrix collapses to a single job; pure-Python wheels are not platform
  or interpreter specific.

CUDA wheels - Jimver/cuda-toolkit was too old for CUDA 13.x
- Symptom: 'Error: Version not available: 13.2.0 / 13.3.0' on every
  cu132 / cu133 cell.
- Fix: bump Jimver/cuda-toolkit v0.2.21 -> v0.2.35 (2026-03-29 release;
  default CUDA is 13.2 there).

CUDA wheels - sub-package names rejected by Ubuntu apt
- Symptom: 'Unable to locate package cuda-cublas-12-8 /
  cuda-cublas_dev-12-8 / cuda-nvrtc_dev-12-8'.
- Cause: Jimver prefixes every `sub-packages` entry with `cuda-`, but
  Ubuntu's CUDA apt repos ship cuBLAS and NVRTC as `libcublas-*` /
  `libnvrtc-*`. They must live under the separate `non-cuda-sub-packages`
  input, which is passed through verbatim.
- Fix:
    sub-packages: '["nvcc", "cudart", "cudart-dev"]'
    non-cuda-sub-packages: '["libcublas", "libcublas-dev",
                            "libnvrtc", "libnvrtc-dev"]'

CUDA wheels - cibuildwheel rejected CIBW_ENVIRONMENT
- Symptom: 'cibuildwheel: Malformed environment option ...'.
- Cause: cibuildwheel parses CIBW_ENVIRONMENT with bashlex. The
  unquoted semicolons inside SFA_CUDA_ARCH=sm_70;sm_75;... are
  interpreted as Bash statement terminators. Also, the original block
  unconditionally exported Linux-only CUDA_PATH=/usr/local/cuda and
  PATH=/usr/local/cuda/bin:$PATH, which broke the Windows runs since
  Jimver on Windows installs CUDA under
  C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\vXX.Y.
- Fix: quote every value that contains ';' or '<' (SFA_CUDA_ARCH,
  SFA_CUDA_RUNTIME_REQUIRES), and split the workflow env into
  CIBW_ENVIRONMENT_LINUX (with the bind-mount paths) and
  CIBW_ENVIRONMENT_WINDOWS (which inherits CUDA_PATH from the Jimver
  step automatically).

CUDA wheels - cuBLAS / NVRTC headers and libs were scattered
- Cause: the `network` method drops cuBLAS / NVRTC headers in
  /usr/include and shared libs in /usr/lib/x86_64-linux-gnu/, while
  setup.py only looks under $CUDA_HOME/{include,lib64}/. Inside the
  manylinux container the additional host dirs are not mounted, so
  the build would have failed at link time even after the previous
  fixes.
- Fix: add a Linux-only staging step that copies cublas*.h, nvrtc.h,
  libcublas*.so*, libnvrtc*.so* into /usr/local/cuda/{include,lib64}/
  before cibuildwheel runs. A single `-v /usr/local/cuda:/usr/local/cuda:ro`
  bind mount then exposes everything the build needs to the container.

Publish job 'needs' updated to reference build_cpu_wheel (singular).
The `if: false` gate stays in place; PyPI upload is still off.
Second wheels-build dry run (27056380944) made progress (CPU + sdist
now pass) but the 6 CUDA cells still failed at the CUDA toolkit install:

- Linux: 'Unable to locate package libnvrtc-12-8 / libnvrtc-dev-12-8'.
  Only cuBLAS lives in Ubuntu's `lib*` package family on the NVIDIA
  apt repo. NVRTC ships with the standard `cuda-` prefix
  (cuda-nvrtc-12-8, cuda-nvrtc-dev-12-8), so it belongs back in
  Jimver's `sub-packages` input, not in `non-cuda-sub-packages`.
- Windows: the NVIDIA Windows installer rejected `cudart-dev_12.8`
  (exit code 3772776473). Windows uses unprefixed names with
  underscores (cublas_dev, nvrtc_dev, ...) and does not split a
  separate cudart-dev sub-package - the headers ship inside cudart.

Fix:

- Move the toolkit install to two OS-conditional steps, each with the
  sub-package naming convention that matches its target installer.
- Linux: sub-packages now ["nvcc", "cudart", "cudart-dev", "nvrtc",
  "nvrtc-dev"] (all cuda- prefixed) and non-cuda-sub-packages reduced
  to just ["libcublas", "libcublas-dev"].
- Windows: sub-packages ["nvcc", "cudart", "cublas", "cublas_dev",
  "nvrtc", "nvrtc_dev"] - the working configuration before the
  cudart-dev typo was introduced.
- Drop NVRTC from the Linux staging step. With NVRTC pulled in via the
  cuda- prefix it lands in $CUDA_HOME/{include,lib64} directly; only
  cuBLAS (still installed as a lib* package) needs to be moved out of
  /usr/include and /usr/lib/x86_64-linux-gnu/ so the bind mount of
  /usr/local/cuda sees everything.

CPU wheel and sdist are unchanged.
…l, drop cu133

Third dry run (27056467626) made it past CUDA install on Linux for
cu128 / cu132 (good) but surfaced three new blocking issues. Fixing
each so the matrix can come up green:

Issue 1 - cu133 cells fail at install with 'Version not available: 13.3.0'
- Jimver/cuda-toolkit v0.2.35 does not have CUDA 13.3 in its version
  table yet. Drop the sfa-cu133 row from the matrix until a newer
  Jimver release supports it; re-add at that point. Update docs to
  match: README, INSTALL.md, doc/install.md, the SFA_PACKAGE_NAME
  example, the conda-env note about which CUDA majors CI tests, and
  the `pip install` snippet now reference cu132 instead of cu133.

Issue 2 - Linux Build wheels fails inside auditwheel
- Error: 'Cannot repair wheel, because required library "libcudart.so.12"
  could not be located'.
- auditwheel was trying to vendor the NVIDIA runtime shared libs into
  the wheel. We don't want that - the wheel declares pinned PyPI
  dependencies on nvidia-cublas-cuXX / nvidia-cuda-runtime-cuXX /
  nvidia-cuda-nvrtc-cuXX through SFA_CUDA_RUNTIME_REQUIRES, so the
  libs arrive via pip at install time.
- Fix: override CIBW_REPAIR_WHEEL_COMMAND_LINUX to pass --exclude for
  libcudart, libcublas, libcublasLt, libnvrtc, and libnvrtc-builtins
  in both soname.12 and soname.13 forms (covers cu128 and cu132).

Issue 3 - Windows Build wheels fails with 'nvcc fatal : Cannot find
compiler cl.exe in PATH'
- cibuildwheel spawns the build in a subprocess that does not inherit
  the Developer Command Prompt environment, so cl.exe is not visible
  to nvcc even though MSVC is installed on the runner.
- Fix: insert ilammy/msvc-dev-cmd@v1 step (Windows only) after the
  Jimver toolkit step; it exports VCINSTALLDIR / PATH and friends so
  any subsequent process can find cl.exe.

CPU wheel, sdist, CUDA install on Linux (cu128/cu132), and CUDA install
on Windows (cu128/cu132) are unchanged.
Fourth dry run (27056657014) graduated cu128-windows to a full pass
(11 minutes including the test phase). Three other CUDA cells failed
with two new error classes:

Issue 1 - 'nvcc fatal: Unsupported gpu architecture compute_70' on
both cu132-ubuntu and cu132-windows
- CUDA 13 nvcc no longer accepts -gencode for sm_70. The deprecation
  warning had been visible since CUDA 12 ('Support for offline
  compilation for architectures prior to sm_75 will be removed in a
  future release'); CUDA 13 is that release.
- Fix: remove sm_70 from the cu132 archs list. cu128 keeps sm_70
  because CUDA 12.8 still supports it. Volta users (P100, V100,
  Quadro GV100, etc.) install sfa-cu128.

Issue 2 - cu128-ubuntu test phase failed compiling scipy from source
- The sfa wheel itself built and was repaired (auditwheel exclude
  rules worked). The failure was when cibuildwheel ran the test
  command in a fresh venv: cp310 installed scipy 1.15.3 from a
  manylinux2014 wheel and succeeded, but cp311 picked up scipy 1.16+
  which has dropped manylinux2014 wheels. pip then fell back to an
  sdist build, the manylinux2014 container had no OpenBLAS, and
  meson aborted with 'Dependency OpenBLAS not found'.
- Fix: set CIBW_MANYLINUX_X86_64_IMAGE to manylinux_2_28. That is the
  base image scipy 1.16+ targets and matches what the rest of the
  scientific-Python wheel matrix has converged to. The bind-mount of
  /usr/local/cuda continues to work the same way.

After these two changes the expected matrix outcome is:
- CPU universal wheel        : pass (already passing)
- sdist                       : pass (already passing)
- cuda-sfa-cu128-ubuntu       : should pass (manylinux_2_28 scipy)
- cuda-sfa-cu128-windows      : pass (already passing)
- cuda-sfa-cu132-ubuntu       : should pass (no sm_70, manylinux_2_28)
- cuda-sfa-cu132-windows      : should pass (no sm_70)
Fifth dry run (27057065902) graduated 5 of 6 cells; cu132-windows
was the lone holdout, dying with

  CUDA\v13.2\include\cuda_runtime.h(82): fatal error C1083:
  Cannot open include file: 'crt/host_config.h':
  No such file or directory

CUDA 13's Windows installer split the runtime developer headers
(including crt/host_config.h, which cuda_runtime.h pulls in to wire
nvcc up to the host MSVC) into a separate `cudart_dev` sub-package.
The CUDA 12.8 Windows installer kept those headers inside `cudart`,
so the prior config covered cu128-windows by accident but had no
chance against cu132.

Fix: add cudart_dev to the Windows sub-packages list. The package
also exists on CUDA 12.8 (where it is a thin no-op overlay), so the
same list works for both wheels.

Linux is unaffected: the Linux toolkit install already lists
"cudart-dev" alongside "cudart", and those headers landed in
/usr/local/cuda/include/crt/ as expected.

Expected outcome of the next run: all 6 cells green.
Sixth dry run (27057337822) regressed both Windows cells: adding
`cudart_dev` to the Windows sub-packages list (which I thought would
just be a no-op on CUDA 12.8) instead broke the install on BOTH
cu128 and cu132 because that sub-package name does not exist on
Windows for either CUDA version. Linux jobs were unaffected.

Root cause of the underlying problem: NVIDIA's Windows installer
reorganised which sub-package carries crt/host_config.h between
CUDA 12 and CUDA 13, and the new owner is not consistently called
`cudart_dev`. Different secondary sources name it differently and
none of those names work for both 12.8 and 13.2.

Fix: skip the sub-package guessing game entirely on Windows by
switching to method: 'local'. Jimver then downloads the full NVIDIA
installer .exe and runs it silently, which lays down the complete
include tree (crt/host_config.h included) regardless of how NVIDIA
re-tags individual chunks in future point releases.

Cost: one extra ~3 GB download per Windows cell. cu128-windows had
been completing in 11 minutes on the network method, so the local
method should land it somewhere around 14-16 minutes - well within
the GitHub Actions job budget.

Linux keeps the network method + curated sub-packages + cuBLAS
staging step. That combination was verified green in the previous
run (cu128-ubuntu 4m55s, cu132-ubuntu 4m57s).
Bump version 0.2.0.dev0 -> 0.2.0 in pyproject.toml and sfa/__init__.py,
and enable the publish-to-pypi job in wheels.yml (the prior `if: false`
guard is replaced by a tag-ref check) so that pushing a v0.2.0 tag from
main triggers wheel build + PyPI upload.

The wheels.yml matrix itself is unchanged; the same six cells (CPU
universal, sdist, sfa-cu128 / sfa-cu132 on ubuntu and windows) just
verified green in dry run 27057416779 will run again on the tag push,
this time producing 0.2.0 artifacts and shipping them to PyPI through
the configured trusted-publisher relationships for sfa, sfa-cu128,
and sfa-cu132.
@dwgoon dwgoon merged commit 1528f39 into main Jun 6, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant