Skip to content
This repository was archived by the owner on Sep 18, 2025. It is now read-only.
This repository was archived by the owner on Sep 18, 2025. It is now read-only.

Hangs with get_zcopy and cuda_ipc on dual-GPU workstations #888

@pentschev

Description

@pentschev

In UCX-Py, we used to add the UCX_RNDV_SCHEME=get_zcopy override due to potential performance issues, this was recently removed in #836 to allow bounce buffers to function.

In some cases, using UCX_RNDV_SCHEME=get_zcopy (which may still be triggered with the new UCX_RNDV_SCHEME=auto default) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.

The MRE involves 3 files. A listener.py working as server, and sender.py that reproduces the hang mentioned above, and a sender2.py that does NOT reproduce the issue. The only difference from sender2.py to the original hang-reproducing sender.py is how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does NOT hang). The files to reproduce are below.

listener.py
import asyncio

import ucp
import cupy as cp

async def run(device):
    # Set CUDA device and create context before initializing UCX
    cp.cuda.runtime.setDevice(device)
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    async def receiver(endpoint):
        print("  CLIENT: receiving sources...", end="", flush=True)
        await endpoint.recv_obj(allocator=lambda n: cp.empty(n, dtype="uint8"))
        print("CLIENT done receiving sources", flush=True)

        await endpoint.close()
        listener.close()

    print("CREATING listener...", flush=True)
    listener = ucp.create_listener(receiver, 9092)
    print("DONE creating listener...", flush=True)

    while not listener.closed():
        await asyncio.sleep(0.05)


if __name__ == "__main__":
    r = asyncio.run(run(1))
sender.py
import asyncio

import ucp
import cupy as cp


async def send(msg):
    ep = await ucp.create_endpoint("localhost", 9092)

    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send(cp.ones(222221, dtype='int32')))
    print("DONE SENDING", flush=True)
sender2.py
import asyncio

import ucp
import cupy as cp


async def send():
    ep = await ucp.create_endpoint("localhost", 9092)

    msg = cp.ones(222221, dtype='int32')
    print(f"   SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
    await ep.send_obj(msg)
    print("    SERVER: done", flush=True)
    await ep.close()


if __name__ == '__main__':
    # Create CUDA context before initializing UCX
    cp.cuda.runtime.free(0)
    # Initialize UCX before any CUDA memory allocation
    ucp.init()

    asyncio.run(send())
    print("DONE SENDING", flush=True)

Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:

# Listener loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_TCP_CM_REUSEADDR=y python listener.py; done

# Client loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc python sender.py; done

By replacing sender.py in the client loop with sender2.py, all 100 iterations should complete without any hangs, whereas the hang with sender.py occurs normally in under 10 iterations. Also running with UCX_RNDV_SCHEME=put_zcopy instead should not reproduce the hang either.

Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.

cc @rlratzel

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions