Skip to content

operator_mpi_example.py results in DISTRIBUTED_FAILURE (27) #211

@SzbKrisztian

Description

@SzbKrisztian

Hi,

I'm trying to run the operator_mpi_example.py example. At the point where the communicator of the WorkStrem is set fails, with DISTRIBUTED_FAILURE (27) error.

`Rank 0: ===== device info ======
Rank 0: GPU-local-id: 0
Rank 0: GPU-name: NVIDIA A100-SXM4-40GB
Rank 0: GPU-clock: 1410000
Rank 0: GPU-memoryClock: 1215000
Rank 0: GPU-nSM: 108
Rank 0: GPU-major: 8
Rank 0: GPU-minor: 0
Rank 0: ========================
Rank 1: ===== device info ======
Rank 1: GPU-local-id: 1
Rank 1: GPU-name: NVIDIA A100-SXM4-40GB
Rank 1: GPU-clock: 1410000
Rank 1: GPU-memoryClock: 1215000
Rank 1: GPU-nSM: 108
Rank 1: GPU-major: 8
Rank 1: GPU-minor: 0
Rank 1: ========================
Rank 0: Created WorkStream (execution context) on current device.
Rank 1: Created WorkStream (execution context) on current device.
Traceback (most recent call last):
Traceback (most recent call last):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):
File "/project/home/pr_1sk/operator_mpi_example.py", line 50, in
ctx.set_communicator(comm=MPI.COMM_WORLD.Dup(), provider="MPI")
~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/work_stream.py", line 195, in set_communicator
self._handle.set_communicator(comm, provider)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
File "/home/pr_1sk/.local/lib/python3.13/site-packages/cuquantum/densitymat/_internal/library_handle.py", line 108, in set_communicator
cudm.reset_distributed_configuration(
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
self._validated_ptr, _comm_provider_map[provider], _comm_ptr, _size
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "cuquantum/bindings/cudensitymat.pyx", line 180, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
File "cuquantum/bindings/cudensitymat.pyx", line 193, in cuquantum.bindings.cudensitymat.reset_distributed_configuration
check_status(status)
File "cuquantum/bindings/cudensitymat.pyx", line 135, in cuquantum.bindings.cudensitymat.check_status
raise cuDensityMatError(status)
cuquantum.bindings.cudensitymat.cuDensityMatError: DISTRIBUTED_FAILURE (27):

`

I've installed the cuquantum package with conda and openmpi. I tried to look up the meaning of DISTRIBUTED_FAILURE (27), but found basically nothing.

Any suggestions where to start?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions