-
Notifications
You must be signed in to change notification settings - Fork 96
Description
Hi,
I'm trying to run the operator_mpi_example.py example. At the point where the communicator of the WorkStrem is set fails, with DISTRIBUTED_FAILURE (27) error.
`Rank 0: ===== device info ======
Rank 0: GPU-local-id: 0
Rank 0: GPU-name: NVIDIA A100-SXM4-40GB
Rank 0: GPU-clock: 1410000
Rank 0: GPU-memoryClock: 1215000
Rank 0: GPU-nSM: 108
Rank 0: GPU-major: 8
Rank 0: GPU-minor: 0
Rank 0: ========================
Rank 1: ===== device info ======
Rank 1: GPU-local-id: 1
Rank 1: GPU-name: NVIDIA A100-SXM4-40GB
Rank 1: GPU-clock: 1410000
Rank 1: GPU-memoryClock: 1215000
Rank 1: GPU-nSM: 108
Rank 1: GPU-major: 8
Rank 1: GPU-minor: 0
Rank 1: ========================
Rank 0: Created WorkStream (execution context) on current device.
Rank 1: Created WorkStream (execution context) on current device.
Traceback (most recent call last):
Traceback (most recent call last):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):
`
I've installed the cuquantum package with conda and openmpi. I tried to look up the meaning of DISTRIBUTED_FAILURE (27), but found basically nothing.
Any suggestions where to start?