Releases: nv-legate/cupynumeric
v26.01.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/26.01/.
Highlights
Added functionality
- Implement
cupynumeric.pad. - Implement
cupynumeric.linalg.pinv(single-CPU/GPU only) - Implement
from_dlpackfor exporting cuPyNumericndarrays through the DLPack interface - Detect when an object being used to initialize a cuPyNumeric
ndarrayimplements the DLPack interface, and use it if possible.
Bugfixes
- Ensure unimplemented stub functions always return cuPyNumeric
ndarrays.
Known issues
- We are aware of hangs when using cuSolverMp-based APIs on 4+ Perlmutter nodes. This appears to be a cluster-specific issue, that we are investigating.
- We are aware of performance regressions with
cupynumeric.einsumon Blackwell GPUs, starting to occur with cuBLAS 13.2. These are under investigation.
Full Changelog: v25.11.00...v26.01.00
v25.11.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA 12 and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA 12/13 and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.11/.
Highlights
Support matrix changes
- Start distributing conda packages for CUDA 13.
- Port to cuSolverMp 0.7 (now the new required minimum).
- Validate cuPyNumeric on DGX Spark.
Note that currently the pip wheels do not include CUDA 13 support, nor cuSolverMp support (linear solve / matrix decomposition APIs are constrained to single-GPU execution when using the wheels).
Added functionality
cupynumeric.histogram2dandcupynumeric.histogramddcupynumeric.lexsortcupynumeric.isin- Multi-GPU & multi-node implementation of QR factorization, based on cuSolverMp
Performance improvements
- Accelerate axis-wise reductions on GPUs by combining multiple kernel invocations into one.
- Parallelize specialized implementation for
cupynumeric.take, and use it in more cases, includingcupynumeric.take_along_axis.
UX improvements
- I/O functions (e.g. hdf5
to_file) and memory offloading (e.g.offload_to) functions from Legate now accept cuPyNumeric ndarrays directly.
Known issues
- We are aware of hangs when using cuSolverMp-based APIs on 4+ Perlmutter nodes. This appears to be a cluster-specific issue, that we are investigating.
- We are aware of hangs when using UCX 1.19 with the CUDA 13 conda packages. These are typically accompanied by an error message like this:
We are investigating a proper fix. For the time being, setting
ib_md.c:287 UCX ERROR ibv_reg_mr(address=(nil), length=134217728, access=0xf) failed: Bad address ucp_mm.c:76 UCX ERROR failed to register address (nil) (cuda) length 134217728 on md[6]=mlx5_0: Input/output error (md supports: host|cuda)UCX_MEMTYPE_CACHE=noin the environment appears to resolve the hang, at the cost of potentially decreasing UCX performance.
Full Changelog: v25.10.00...v25.11.00
v25.10.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.10/.
Highlights
Added functionality
- Implement
cupynumeric.in1d. - Add DLPack import/export support to cuPyNumeric ndarrays.
- Allow batched input for
cupynumeric.linalg.solve.
Performance improvements
- Optimized implementation for the special
axis=case ofcupynumeric.take. - Improve heuristics for choosing between batched and unbatched matrix multiplication.
- Improved implementation of
cupynumeric.nonzerothat uses no additional scratch space. - Identify special cases of advanced indexing that can be executed faster using
cupynumeric.einsum.
Documentation / profiling
- Add a tutorial on using Legate Tasks to extend cuPyNumeric.
- Add a user warning when an operation (e.g. printing to the console) causes a sharded array to be gathered onto a single memory.
- Add sub-boxes to the Legate profiler, showing how long the Python interpreter spends inside cuPyNumeric API calls.
Breaking changes
- Move nightly conda packages to a dedicated channel,
-c legate-nightly.
Known issues
- We are aware of hangs occurring under certain platforms and UCC configurations, when using cuSolverMp-backed multi-GPU operations (Cholesky factorization and linear solve). We expect these to be fixed by the 25.11 release, that updates to cuSolverMp 0.7.
Full Changelog: v25.08.00...v25.10.00
v25.08.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.08/.
New features
Added functionality
- Multi-node multi-GPU capable SVD, specialized for tall-skinny matrices
cupynumeric.crosscupynumeric.insertcupynumeric.logspacecupynumeric.real_if_closecupynumeric.rootscupynumeric.ravel_multi_indexcupynumeric.copytocupynumeric.diagflatcupynumeric.deletecupynumeric.nan_to_num- Support multi-axis reductions
Performance Improvements
- Improve robustness & speed of
cupynumeric.sort, by combining allocations where possible, and adding synchronization barriers around NCCL collectives. - Remove some extraneous blocking that was only necessary to match the behavior of NumPy 1.x.
- Improve performance of NumPy fallback, in particular removing extraneous array copies, and adding special cases for quick fallback to functions such as
cupynumeric.concatenate.
Miscellaneous
- Unify all environment variables that control cuPyNumeric's NumPy fallback heuristics, to a single one,
CUPYNUMERIC_MAX_EAGER_VOLUME. - Allow any available BLAS implementation to be used in a source build.
Full Changelog: v25.07.00...v25.08.00
v25.07.00
This is a beta release of cuPyNumeric.
Pip wheels are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, for Linux (x86-64 and ARM64, with CUDA and multi-node support) and macOS (for ARM64). Conda packages are available at https://anaconda.org/legate/cupynumeric, for Linux (x86-64 and ARM64, with CUDA and multi-node support). GASNet-based (rather than UCX-based) conda packages are under the gex label. Windows is currently supported through WSL.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.07/.
New features
Added functionality
- Multi-node multi-GPU capable
cupynumeric.linalg.solveandcupynumeric.linalg.cholesky, backed by cuSolverMp. - Single-GPU
cupynumeric.linalg.eigh/eigvalsh, backed by cuSolver. cupynumeric.round
Support matrix changes
- macOS wheels are now available on PyPI.
- Add support for Blackwell CUDA architecture and MNNVL.
- Drop support for Python 3.10 and add support for Python 3.13.
- Remove NumPy 1.X restriction from packages (now compatible with NumPy 2.X).
Tuning
- Add an optional "doctor" mode, that will detect some common anti-patterns causing bad performance. Enable with
CUPYNUMERIC_DOCTOR=1, see https://docs.nvidia.com/cupynumeric/25.07/api/settings.html#doctor for more information.
Documentation
- A basic cuPyNumeric tutorial is available, see https://docs.nvidia.com/cupynumeric/25.07/user/tutorial.html.
- Start publishing nightly doc builds to https://nv-legate.github.io/cupynumeric.
Full Changelog: v25.03.02...v25.07.00
Known issues
- Multi-node runs can occasionally segfault at exit. This issue is under investigation. Preliminary investigation suggests that the issue depends on the ordering between cuPyNumeric and OpenBLAS teardown. There is no impact to the correctness of the computation and subsequent GPU usage.
- If the user explicitly forces multi-GPU execution of a sorting operation on very small arrays (about as many elements as the number of GPUs) this can result in CUDA errors. In normal conditions cuPyNumeric would not be GPU-accelerating operations of this size. A fix for this issue is in development and will be made available in an upcoming nightly build.
v25.03.02
This is a beta release of cuPyNumeric.
Linux x86 and ARM builds for Python 3.10 - 3.12 are available on PyPI at https://pypi.org/project/nvidia-cupynumeric/, and as conda packages at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.
New features
PIP install support
With this release, Linux x86 and ARM builds of cuPyNumeric are available for Python 3.10 - 3.12 as Python wheels on PyPI in addition to conda.
- cuPyNumeric can be installed with:
See https://docs.nvidia.com/cupynumeric/25.03/installation.html#installing-pypi-packages for further instructions.
pip install nvidia-cupynumeric - These wheels support multi-node execution through UCX.
See https://docs.nvidia.com/legate/25.03/networking-wheels.html for more details.
v25.03.00
This is a beta release of cuPyNumeric.
Linux x86 and ARM conda packages are available for this release at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.03/.
New features
Licensing
- With this release the Legate framework, on which cuPyNumeric is based, becomes open-source, under the Apache-2.0 license. This makes the entire cuPyNumeric stack (anything above the CUDA library level) open-source.
Added functionality
- Matrix exponential:
cupynumeric.linalg.expm - Batched eigendecomposition:
cupynumeric.linalg.eigvals&cupynumeric.linalg.eig
Performance improvements
- No longer doing unnecessary streaming when running matrix multiplication on a single processor/GPU.
UX improvements
- Add the
legate.core.ProfileRangePython context manager, to annotate sub-spans within a larger task span on the profiler visualization. - Add the
local_task_arrayhelper function, that can be used in Python tasks to create a view over a Store/Array argument, using a NumPy or CuPy array as appropriate based on the type of memory where the data is located.
Documentation improvements
- Add a user guide chapter on accelerating multi-GPU HDF5 workloads.
Known issues
- We are aware of possible performance regressions when using UCX 1.18. We are temporarily restricting our packages to UCX <= 1.17 while we investigate this.
v25.01.00
This is a beta release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/25.01/.
New features
Added functionality
- Add the
methodparameter tocupynumeric.convolve. - Increase the maximum array dimension from 4 to 6.
- Experimental support for NumPy 2.0 (not reflected in the package constraints yet).
Memory management enhancements
- Updates to take advantage of the deferred-eager pool unification in Legate. This change has the potential to increase the effective available memory capacity by up to 100% for many usecases. It also removes the need for the user to adjust the
--eager-alloc-percentage. - Add the
offload_to()API, that allows a user to offload an array to a particular memory kind, such that any copies in other memories are discarded. This can be useful e.g. to evict an array from GPU memory onto system memory, freeing up space for subsequent GPU tasks.
I/O improvements
- Use cuFile to accelerate HDF5 reads on the GPU.
- Add support for reading "binary" HDF5 datasets (in particular useful for reading boolean-type datasets).
UX Improvements
- Consider NUMA node topology when allocating CPU cores and memory during automatic machine configuration.
- Add environment variable
LEGATE_LIMIT_STDOUT, to only print out the output from one of the copies of the top-level program in a multi-process execution. - Remove an extraneous warning about
__buffer__being unimplemented.
Deprecations
- Drop support for the Maxwell GPU architecture. cuPyNumeric now requires at least Pascal (
sm_60).
v24.11.02
This is a patch release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.
Packaging Changes
- Update for Legate
v24.11.01
v24.11.01
This is a patch release of cuPyNumeric.
Linux x86 and ARM conda packages are available at https://anaconda.org/legate/cupynumeric.
Documentation for this release can be found at https://docs.nvidia.com/cupynumeric/24.11/.
Bug Fixes
- Explicit fallback to
__array__()on__buffer__