dwgoon · dwgoon · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026 · Jun 6, 2026
diff --git a/.github/workflows/wheels.yml b/.github/workflows/wheels.yml
diff --git a/INSTALL.md b/INSTALL.md
@@ -10,7 +10,6 @@ NVIDIA driver new enough for that CUDA version.
 | `sfa`        | none    | -                  | Linux, macOS, Windows  |
 | `sfa-cu128`  | 12.8.x  | 570 (Linux / Win)  | Linux, Windows         |
 | `sfa-cu132`  | 13.2.x  | 580                | Linux, Windows         |
-| `sfa-cu133`  | 13.3.x  | 580                | Linux, Windows         |
 
 All CUDA wheels share the same AOT-compiled SASS matrix (SM 7.0
 through SM 12.0: Volta, Turing, Ampere, Ada, Hopper, Blackwell), plus
@@ -37,14 +36,13 @@ that is the maximum CUDA version your driver supports.
 
 | Package     | CUDA bundled | Minimum NVIDIA driver | When to pick                                              |
 |-------------|--------------|------------------------|-----------------------------------------------------------|
-| `sfa-cu133` | 13.3.x       | 580                    | Newest hardware / drivers; default for fresh installs.    |
-| `sfa-cu132` | 13.2.x       | 580                    | Matches the `sfa-cu132` conda env used for development.   |
+| `sfa-cu132` | 13.2.x       | 580                    | Newest CUDA stack; matches `environment-cuda.yml`.        |
 | `sfa-cu128` | 12.8.x       | 570                    | Older driver (CUDA 12 line); broadest backwards compat.   |
 
 Example (install the newest one):
 
 ```bash
-pip install sfa-cu133
+pip install sfa-cu132
 ```
 
 Requires Python 3.10+. macOS is not supported because Apple ended
@@ -81,15 +79,15 @@ the host compiler, and `conda` will not install it for you.
 git clone https://github.com/dwgoon/sfa.git && cd sfa
 
 conda env create -f environment-cuda.yml
-conda activate sfa-cu132
+conda activate sfa
 pip install -e .                 # builds the CUDA extension via the env's nvcc
 
 # CPU-only variant (skip CUDA even if nvcc is on PATH):
 SFA_BUILD_CUDA=0 pip install -e .
 ```
 
-This is also how the project maintainers build on Windows: the
-`sfa-cu132` env provides `nvcc` and cuBLAS, while system MSVC handles
+This is also how the project maintainers build on Windows: the `sfa`
+env provides `nvcc` and cuBLAS, while system MSVC handles
 `bindings.cpp`. The resulting extension is e.g.
 `sfa/_cuda/_native.cp312-win_amd64.pyd`.
 
@@ -98,8 +96,8 @@ is what the maintainers test against. The same workflow works for any
 CUDA major / minor that has a `cuda-toolkit` build on the `nvidia`
 channel: edit the two `cuda-version` / `cuda-toolkit` pins in lockstep
 (see [What `environment-cuda.yml` provides](#what-environment-cudayml-provides)
-below) and rename the env on the first line of the file. CUDA 12.8 and
-13.3 environments have been tested in CI.
+below) and rename the env on the first line of the file. CUDA 12.8
+and 13.2 environments have been tested in CI.
 
 ### Option B: conda-free build (system CUDA + system C++ compiler)
 
@@ -180,7 +178,7 @@ and falls through to a CPU-only build (printing
 ### What `environment-cuda.yml` provides
 
 The shipped conda environment file creates a self-contained build
-environment named `sfa-cu132` that does **not** require any
+environment named `sfa` that does **not** require any
 system-wide CUDA install. Everything the build needs - the CUDA
 compiler, the CUDA runtime, cuBLAS headers and import libs, plus the
 Python build and runtime dependencies - is pulled in from the
@@ -199,14 +197,14 @@ Concretely, the file pins:
 
 The `cuda-toolkit` meta-package pulls in `nvcc`, `cudart`, `nvrtc`,
 `cccl`, `cupti`, the profiler API, and the rest of the CUDA dev
-toolchain. After `conda activate sfa-cu132`, `nvcc` is on `PATH` and
+toolchain. After `conda activate sfa`, `nvcc` is on `PATH` and
 `setup.py`'s CUDA-extension build picks it up automatically.
 
 Notes for adjusting the file:
 
 - To target a different CUDA major version, change the two `nvidia::`
   pins (`cuda-version` and `cuda-toolkit`) in lockstep. The env name
-  on the first line (`sfa-cu132`) is just a label; rename it freely.
+  on the first line (`sfa`) is just a label; rename it freely.
 - A host C++ compiler is still required (MSVC on Windows, GCC on
   Linux). The toolchain itself is not bundled by `cuda-toolkit`;
   conda will not install it for you.
@@ -220,7 +218,7 @@ Notes for adjusting the file:
 |----------------------|------------------------------------------------------------------------|
 | `SFA_BUILD_CUDA`     | `0` to force a pure-Python install. Default: build if `nvcc` is found. |
 | `SFA_CUDA_ARCH`      | Semicolon-separated SM list, e.g. `sm_89` (dev) or `sm_70;sm_80;sm_89`. Default: the full wheel-wide AOT matrix. |
-| `SFA_PACKAGE_NAME`   | Override the PyPI name (used by CI to produce e.g. `sfa-cu132` or `sfa-cu133` from the same source tree). |
+| `SFA_PACKAGE_NAME`   | Override the PyPI name (used by CI to produce e.g. `sfa-cu128` or `sfa-cu132` from the same source tree). |
 
 ## Verify the install
 

diff --git a/README.md b/README.md
@@ -43,7 +43,6 @@ set of CUDA optimized `sfa-cuXYZ` versions:
 | `sfa`         | none   | -                   | Linux, macOS, Windows  |
 | `sfa-cu128`   | 12.8.x | 570 (Linux / Win)   | Linux, Windows         |
 | `sfa-cu132`   | 13.2.x | 580                 | Linux, Windows         |
-| `sfa-cu133`   | 13.3.x | 580                 | Linux, Windows         |
 
 Each CUDA wheel ships ahead-of-time compiled SASS for NVIDIA SM 7.0
 through SM 12.0 (Volta, Turing, Ampere, Ada, Hopper, Blackwell) plus a
@@ -67,7 +66,7 @@ supports.
 Example (install the newest one):
 
 ```bash
-pip install sfa-cu133
+pip install sfa-cu132
 ```
 
 > [!IMPORTANT]
@@ -87,7 +86,7 @@ self-contained env):
 ```bash
 git clone https://github.com/dwgoon/sfa.git && cd sfa
 conda env create -f environment-cuda.yml
-conda activate sfa-cu132
+conda activate sfa
 pip install -e .
 ```
 
@@ -280,7 +279,7 @@ S_gpu = compute_influence(
 )
 ```
 
-## Benchmarks
+## Performance benchmarks
 
 ### Hardware setup
 
@@ -329,24 +328,24 @@ S_gpu = compute_influence(
 
 ### Small networks
 
-| # Nodes | # Edges  | CPU iter (FP64) ms | CPU LAPACK (FP64) ms | CUDA (FP64) ms        |
-|---------|----------|--------------------|----------------------|-----------------------|
-|    32   | 992      | 0.1 ± 0.0          | 0.2 ± 0.0 (0.4x)     | 1.3 ± 0.2 (0.06x)     |
-|    64   |  ~4.0 K  | 0.2 ± 0.0          | 0.2 ± 0.0 (0.8x)     | 1.4 ± 0.1 (0.13x)     |
-|   128   | ~16.3 K  | 2.5 ± 0.0          | 0.4 ± 0.0 (**7.2x**) | 1.9 ± 0.1 (1.3x)      |
-|   256   | ~65.3 K  | 6.9 ± 0.2          | 2.4 ± 0.1 (**2.8x**) | 3.1 ± 0.8 (2.2x)      |
-|   512   |  ~262 K  | 38.8 ± 1.7         | 190 ± 46 (0.2x)      | 6.4 ± 0.2 (**6.0x**)  |
-|  1024   | ~1.05 M  | 180 ± 8            | 486 ± 89 (0.4x)      | 47 ± 10 (**3.8x**)    |
-|  2048   | ~4.19 M  | 2140 ± 320         | 3880 ± 2990 (0.6x)   | 245 ± 2 (**8.7x**)    |
-|  4096   | ~16.8 M  | 12520 ± 2380       | 5690 ± 1390 (2.2x)   | 4320 ± 580 (**2.9x**) |
+| # Nodes | # Edges  | CPU iter (FP64)    | CPU LAPACK (FP64)         | CUDA (FP64)                 |
+|---------|----------|--------------------|---------------------------|-----------------------------|
+|    32   | 992      | 0.1 ± 0.0 ms       | 0.2 ± 0.0 ms (0.4x)       | 1.3 ± 0.2 ms (0.06x)        |
+|    64   |  ~4.0 K  | 0.2 ± 0.0 ms       | 0.2 ± 0.0 ms (0.8x)       | 1.4 ± 0.1 ms (0.13x)        |
+|   128   | ~16.3 K  | 2.5 ± 0.0 ms       | 0.4 ± 0.0 ms (**7.2x**)   | 1.9 ± 0.1 ms (1.3x)         |
+|   256   | ~65.3 K  | 6.9 ± 0.2 ms       | 2.4 ± 0.1 ms (**2.8x**)   | 3.1 ± 0.8 ms (2.2x)         |
+|   512   |  ~262 K  | 38.8 ± 1.7 ms      | 190 ± 46 ms (0.2x)        | 6.4 ± 0.2 ms (**6.0x**)     |
+|  1024   | ~1.05 M  | 180 ± 8 ms         | 486 ± 89 ms (0.4x)        | 47 ± 10 ms (**3.8x**)       |
+|  2048   | ~4.19 M  | 2140 ± 320 ms      | 3880 ± 2990 ms (0.6x)     | 245 ± 2 ms (**8.7x**)       |
+|  4096   | ~16.8 M  | 12520 ± 2380 ms    | 5690 ± 1390 ms (2.2x)     | 4320 ± 580 ms (**2.9x**)    |
 
 ### Large networks
 
-| # Nodes | # Edges | CPU LAPACK (FP64) s | CUDA TF32 (FP32) s   | CUDA FP32 (no TF32) s | CUDA FP16 s              |
-|---------|---------|---------------------|----------------------|-----------------------|--------------------------|
-|  5000   |  ~25 M  |  5.10 ± 2.24             | 0.366 ± 0.027 (14x)  | 0.356 ± 0.034 (14x)   | 0.349 ± 0.037 (**15x**)  |
-| 10000   | ~100 M  | 17.60 ± 0.57             | 1.55 ± 0.05 (11x)    | 4.07 ± 0.06 (4.3x)    | 1.13 ± 0.16 (**16x**)    |
-| 20000   | ~400 M  | 70.88 ± 0.79             | 9.13 ± 0.10 (7.8x)   | 16.30 ± 0.28 (4.3x)   | 4.28 ± 0.02 (**17x**)    |
+| # Nodes | # Edges | CPU LAPACK (FP64) | CUDA TF32 (FP32)         | CUDA FP32 (no TF32)        | CUDA FP16                  |
+|---------|---------|-------------------|--------------------------|----------------------------|----------------------------|
+|  5000   |  ~25 M  |  5.10 ± 2.24 s    | 0.366 ± 0.027 s (14x)    | 0.356 ± 0.034 s (14x)      | 0.349 ± 0.037 s (**15x**)  |
+| 10000   | ~100 M  | 17.60 ± 0.57 s    | 1.55 ± 0.05 s (11x)      | 4.07 ± 0.06 s (4.3x)       | 1.13 ± 0.16 s (**16x**)    |
+| 20000   | ~400 M  | 70.88 ± 0.79 s    | 9.13 ± 0.10 s (7.8x)     | 16.30 ± 0.28 s (4.3x)      | 4.28 ± 0.02 s (**17x**)    |
 
 - CPU paths show noticeably higher variance than GPU paths (CPU
   LAPACK FP64 stddev reaches ~25-77% of the mean at small `N`),

diff --git a/doc/install.md b/doc/install.md
@@ -9,8 +9,7 @@ one** into a given environment.
 |---------------|--------|---------------------|-----------------------------|
 | `sfa`         | none   | -                   | Linux, macOS, Windows       |
 | `sfa-cu128`   | 12.8.x | 570 (Linux / Win)   | Linux, Windows              |
-| `sfa-cu132`   | 13.2.x | 580                 | Linux, Windows              |
-| `sfa-cu133`   | 13.3.x | 580                 | Linux, Windows (newest)     |
+| `sfa-cu132`   | 13.2.x | 580                 | Linux, Windows (newest)     |
 
 ## Requirements
 
@@ -33,10 +32,9 @@ Run `nvidia-smi` and look at the "CUDA Version" column. That is the
 number:
 
 ```text
-nvidia-smi -> "CUDA Version: 13.3"  -> any of sfa-cu128 / cu132 / cu133
-nvidia-smi -> "CUDA Version: 13.0"  -> sfa-cu128
-nvidia-smi -> "CUDA Version: 12.8"  -> sfa-cu128
-nvidia-smi -> "CUDA Version: 12.6"  -> upgrade your driver or use `sfa` (CPU)
+nvidia-smi -> "CUDA Version: 13.2" or higher  -> sfa-cu132 or sfa-cu128
+nvidia-smi -> "CUDA Version: 12.8" - 13.1     -> sfa-cu128
+nvidia-smi -> "CUDA Version: 12.6"             -> upgrade your driver or use `sfa` (CPU)
 ```
 
 When in doubt, start with `sfa-cu128` for the widest driver coverage
@@ -102,7 +100,7 @@ and the runtime Python deps:
 
 ```bash
 conda env create -f environment-cuda.yml
-conda activate sfa-cu132
+conda activate sfa
 pip install -e .
 ```
 

diff --git a/environment-cuda.yml b/environment-cuda.yml
@@ -1,4 +1,4 @@
-name: sfa-cu132
+name: sfa
 channels:
   - nvidia
   - conda-forge

diff --git a/pyproject.toml b/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 # wheels under the SFA_PACKAGE_NAME env var.
 [project]
 name = "sfa"
-version = "0.2.0.dev0"
+version = "0.2.0"
 description = "Signal flow analysis"
 readme = "README.md"
 license = { text = "MIT" }

diff --git a/sfa/__init__.py b/sfa/__init__.py
@@ -1,4 +1,4 @@
-__version__ = "0.2.0.dev0"
+__version__ = "0.2.0"
 
 from .base import *
 from .containers import AlgorithmSet