[BUG] cupynumeric.linalg.solve can NOT solve(raise some errors) the matrix when the size of matrix >= 2048 in multi gpus situation.

### Software versions

System info:
  Python      :  3.12.12 | packaged by Anaconda, Inc. | (main, Oct 21 2025, 20:16:04) [GCC 11.2.0]
  Platform    :  Linux-5.14.0-570.39.1.el9_6.x86_64-x86_64-with-glibc2.34
  GPU driver  :  580.82.09
  GPU devices :
    GPU 0 :  NVIDIA GeForce RTX 2080 Ti
    GPU 1 :  NVIDIA GeForce RTX 2080 Ti
    GPU 2 :  NVIDIA GeForce RTX 2080 Ti
    GPU 3 :  NVIDIA GeForce RTX 2080 Ti
    GPU 4 :  NVIDIA GeForce RTX 2080 Ti
    GPU 5 :  NVIDIA GeForce RTX 2080 Ti

Package versions:
  legion      :  legion-25.12.0-22-g4679528b4 (commit: 4679528b4cd3f71a9cebc642e39a1c0f074c717a)
  legate      :  26.01.00
  cupynumeric :  26.01.00
  numpy       :  1.26.4
  scipy       :  1.16.3
  numba       :  (failed to detect)

Legate build configuration:
  build_type        :  Release
  use_openmp        :  True
  use_cuda          :  True
  networks          :  ucx
  conduit           :  
  configure_options :  --LEGATE_ARCH=arch-conda;--with-python;--with-cc=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-cc;--with-cxx=/tmp/conda-croot/legate/_build_env/bin/x86_64-conda-linux-gnu-c++;--build-march=haswell;--cmake-generator=Ninja;--with-openmp;--with-cuda;--build-type=release;--with-ucx

Package details:
  cuda-version :  cuda-version-13.1-h2ff5cdb_3 (conda-forge)
  legate       :  legate-26.01.00-cuda13_py312_ucx_gpu_g3ccb63960_0 (legate)
  cupynumeric  :  cupynumeric-26.01.00-cuda13_py312_gpu_gae1c7878_0 (legate)

### Jupyter notebook / Jupyter Lab version

_No response_

### Expected behavior

1. the minimal reproducer code:

```python
import cupynumeric as cp

if __name__ == '__main__':
    n = 2048
    A = cp.random.rand(n, n)
    b = cp.random.rand(n, n)
    cp.linalg.solve(A, b)
```
Expect successful operation under normal conditions.

### Observed behavior

2. errors:

```shell
legate --gpus 2 test.py

[error1.txt](https://github.com/user-attachments/files/25037341/error1.txt)
also in comment
```
3. and then I use compute-sanitizer tool to analyse:
```shell

[error2.txt](https://github.com/user-attachments/files/25037457/error2.txt)
also in comment
```
4. But, if set the size of matrix < 2048, such as 2047, the code will run successfully, and regardless of the counts of GPUs
5. Let me summarize the test to a table, which you can see the issue clearly:

| size of matrix | count of gpu | run result |
|---------|---------|---------|
| size < 2048 | 1  | ok |
| size >= 2048   | 1  | ok |
| size < 2048 | >=2| ok |
| size >= 2048 | >= 2| failed |


### Example code or instructions

the minimal reproducer code:

```python
import cupynumeric as cp

if __name__ == '__main__':
    n = 2048
    A = cp.random.rand(n, n)
    b = cp.random.rand(n, n)
    cp.linalg.solve(A, b)
```

### Stack traceback or browser console output

6. and additonal infomation:
```shell
nvidia-smi topo -m
        GPU0    GPU1    GPU2    GPU3    GPU4    GPU5    CPU Affinity    NUMA Affinity   GPU NUMA ID
GPU0     X      PIX     PHB     PHB     SYS     SYS     0-7,16-23       0               N/A
GPU1    PIX      X      PHB     PHB     SYS     SYS     0-7,16-23       0               N/A
GPU2    PHB     PHB      X      PIX     SYS     SYS     0-7,16-23       0               N/A
GPU3    PHB     PHB     PIX      X      SYS     SYS     0-7,16-23       0               N/A
GPU4    SYS     SYS     SYS     SYS      X      PIX     8-15,24-31      1               N/A
GPU5    SYS     SYS     SYS     SYS     PIX      X      8-15,24-31      1               N/A
```

```shell
the nvidia official sample code from: https://github.com/NVIDIA/cuda-samples/blob/v12.9/Samples/0_Introduction/simpleP2P/simpleP2P.cu

[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 6

Checking GPU(s) for support of peer to peer memory access...
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU4) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU0) -> NVIDIA GeForce RTX 2080 Ti (GPU5) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU4) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU1) -> NVIDIA GeForce RTX 2080 Ti (GPU5) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU4) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU2) -> NVIDIA GeForce RTX 2080 Ti (GPU5) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU4) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU3) -> NVIDIA GeForce RTX 2080 Ti (GPU5) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU4) -> NVIDIA GeForce RTX 2080 Ti (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU4) -> NVIDIA GeForce RTX 2080 Ti (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU4) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU4) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU4) -> NVIDIA GeForce RTX 2080 Ti (GPU5) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU5) -> NVIDIA GeForce RTX 2080 Ti (GPU0) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU5) -> NVIDIA GeForce RTX 2080 Ti (GPU1) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU5) -> NVIDIA GeForce RTX 2080 Ti (GPU2) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU5) -> NVIDIA GeForce RTX 2080 Ti (GPU3) : No
> Peer access from NVIDIA GeForce RTX 2080 Ti (GPU5) -> NVIDIA GeForce RTX 2080 Ti (GPU4) : No
Two or more GPUs with Peer-to-Peer access capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.
```

```shell
conda list| grep -E "nccl|cu"

cuda-cccl_linux-64             13.1.115         ha770c72_0                         conda-forge
cuda-cudart                    13.1.80          hecca717_0                         conda-forge
cuda-cudart-dev_linux-64       13.1.80          h376f20c_0                         conda-forge
cuda-cudart-static_linux-64    13.1.80          h376f20c_0                         conda-forge
cuda-cudart_linux-64           13.1.80          h376f20c_0                         conda-forge
cuda-nvrtc                     13.1.115         hecca717_0                         conda-forge
cuda-nvtx                      13.1.115         hecca717_0                         conda-forge
cuda-version                   13.1             h2ff5cdb_3                         conda-forge
cupy                           13.6.0           py312h045ee1a_2                    conda-forge
cupy-core                      13.6.0           py312h1a70bb2_2                    conda-forge
cupynumeric                    26.01.00         cuda13_py312_gpu_gae1c7878_0       legate
cutensor                       2.3.1.0          h15eaa2f_1                         conda-forge
icu                            73.1             h6a678d5_0
legate                         26.01.00         cuda13_py312_ucx_gpu_g3ccb63960_0  legate
libcublas                      13.2.1.1         h676940d_0                         conda-forge
libcufft                       12.1.0.78        hecca717_0                         conda-forge
libcufile                      1.16.1.26        hd07211c_0                         conda-forge
libcups                        2.4.15           hbe4054b_0
libcurand                      10.4.1.81        h676940d_0                         conda-forge
libcurl                        8.18.0           h4e3cde8_0                         conda-forge
libcusolver                    12.0.9.81        h676940d_0                         conda-forge
libcusolvermp0                 0.7.2.888        h7bcfba5_3                         conda-forge
libcusparse                    12.7.3.1         hecca717_0                         conda-forge
nccl                           2.28.9.1         hd557bf5_1                         conda-forge
ncurses                        6.5              h7934f7d_0
xcb-util-cursor                0.1.5            h5eee18b_0
```
Could you give me some advice to solve this issue? and please let me know if you need any further information. Thanks a lot.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cupynumeric.linalg.solve can NOT solve(raise some errors) the matrix when the size of matrix >= 2048 in multi gpus situation. #1253

Software versions

Jupyter notebook / Jupyter Lab version

Expected behavior

Observed behavior

Example code or instructions

Stack traceback or browser console output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

size of matrix	count of gpu	run result
size < 2048	1	ok
size >= 2048	1	ok
size < 2048	>=2	ok
size >= 2048	>= 2	failed

[BUG] cupynumeric.linalg.solve can NOT solve(raise some errors) the matrix when the size of matrix >= 2048 in multi gpus situation. #1253

Description

Software versions

Jupyter notebook / Jupyter Lab version

Expected behavior

Observed behavior

Example code or instructions

Stack traceback or browser console output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions