Hangs with `get_zcopy` and `cuda_ipc` on dual-GPU workstations

In UCX-Py, we used to add the [`UCX_RNDV_SCHEME=get_zcopy` override due to potential performance issues](https://github.com/openucx/ucx/wiki/NVIDIA-GPU-Support#tuning), this was recently removed in https://github.com/rapidsai/ucx-py/pull/836 to allow bounce buffers to function.

In some cases, using `UCX_RNDV_SCHEME=get_zcopy` (which may still be triggered with the new `UCX_RNDV_SCHEME=auto` default) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.

The MRE involves 3 files. A `listener.py` working as server, and `sender.py` that reproduces the hang mentioned above, and a `sender2.py` that does _NOT_ reproduce the issue. The only difference from `sender2.py` to the original hang-reproducing `sender.py` is how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does _NOT_ hang). The files to reproduce are below.

<details><summary>listener.py</summary>

```python
import asyncio

import ucp
import cupy as cp

async def run(device):
    # Set CUDA device and create context before initializing UCX
    cp.cuda.runtime.setDevice(device)
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    async def receiver(endpoint):
        print("  CLIENT: receiving sources...", end="", flush=True)
        await endpoint.recv_obj(allocator=lambda n: cp.empty(n, dtype="uint8"))
        print("CLIENT done receiving sources", flush=True)

        await endpoint.close()
        listener.close()

    print("CREATING listener...", flush=True)
    listener = ucp.create_listener(receiver, 9092)
    print("DONE creating listener...", flush=True)

    while not listener.closed():
        await asyncio.sleep(0.05)


if __name__ == "__main__":
    r = asyncio.run(run(1))
```

</details>

<details><summary>sender.py</summary>

```python
import asyncio

import ucp
import cupy as cp


async def send(msg):
    ep = await ucp.create_endpoint("localhost", 9092)

    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send(cp.ones(222221, dtype='int32')))
    print("DONE SENDING", flush=True)
```

</details>

<details><summary>sender2.py</summary>

```python
import asyncio

import ucp
import cupy as cp


async def send():
    ep = await ucp.create_endpoint("localhost", 9092)

    msg = cp.ones(222221, dtype='int32')
    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send())
    print("DONE SENDING", flush=True)
```

</details>

Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:

```bash
# Listener loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_TCP_CM_REUSEADDR=y python listener.py; done

# Client loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc python sender.py; done
```

By replacing `sender.py` in the client loop with `sender2.py`, all 100 iterations should complete without any hangs, whereas the hang with `sender.py` occurs normally in under 10 iterations. Also running with `UCX_RNDV_SCHEME=put_zcopy` instead should not reproduce the hang either.

Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.

cc @rlratzel 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hangs with `get_zcopy` and `cuda_ipc` on dual-GPU workstations #888

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Hangs with get_zcopy and cuda_ipc on dual-GPU workstations #888

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Hangs with `get_zcopy` and `cuda_ipc` on dual-GPU workstations #888