In UCX-Py, we used to add the UCX_RNDV_SCHEME=get_zcopy override due to potential performance issues, this was recently removed in #836 to allow bounce buffers to function.
In some cases, using UCX_RNDV_SCHEME=get_zcopy (which may still be triggered with the new UCX_RNDV_SCHEME=auto default) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.
The MRE involves 3 files. A listener.py working as server, and sender.py that reproduces the hang mentioned above, and a sender2.py that does NOT reproduce the issue. The only difference from sender2.py to the original hang-reproducing sender.py is how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does NOT hang). The files to reproduce are below.
listener.py
import asyncio
import ucp
import cupy as cp
async def run(device):
# Set CUDA device and create context before initializing UCX
cp.cuda.runtime.setDevice(device)
cp.cuda.runtime.free(0)
# Initialize UCX before any CUDA memory allocation
ucp.init()
async def receiver(endpoint):
print(" CLIENT: receiving sources...", end="", flush=True)
await endpoint.recv_obj(allocator=lambda n: cp.empty(n, dtype="uint8"))
print("CLIENT done receiving sources", flush=True)
await endpoint.close()
listener.close()
print("CREATING listener...", flush=True)
listener = ucp.create_listener(receiver, 9092)
print("DONE creating listener...", flush=True)
while not listener.closed():
await asyncio.sleep(0.05)
if __name__ == "__main__":
r = asyncio.run(run(1))
sender.py
import asyncio
import ucp
import cupy as cp
async def send(msg):
ep = await ucp.create_endpoint("localhost", 9092)
print(f" SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
await ep.send_obj(msg)
print(" SERVER: done", flush=True)
await ep.close()
if __name__ == '__main__':
# Create CUDA context before initializing UCX
cp.cuda.runtime.free(0)
# Initialize UCX before any CUDA memory allocation
ucp.init()
asyncio.run(send(cp.ones(222221, dtype='int32')))
print("DONE SENDING", flush=True)
sender2.py
import asyncio
import ucp
import cupy as cp
async def send():
ep = await ucp.create_endpoint("localhost", 9092)
msg = cp.ones(222221, dtype='int32')
print(f" SERVER: sending {type(msg)}, len: {len(msg)}, dtype: {msg.dtype}", end="", flush=True)
await ep.send_obj(msg)
print(" SERVER: done", flush=True)
await ep.close()
if __name__ == '__main__':
# Create CUDA context before initializing UCX
cp.cuda.runtime.free(0)
# Initialize UCX before any CUDA memory allocation
ucp.init()
asyncio.run(send())
print("DONE SENDING", flush=True)
Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:
# Listener loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc UCX_TCP_CM_REUSEADDR=y python listener.py; done
# Client loop
for i in {0..100}; do echo $i; UCX_RNDV_SCHEME=get_zcopy UCX_TLS=tcp,cuda_copy,cuda_ipc python sender.py; done
By replacing sender.py in the client loop with sender2.py, all 100 iterations should complete without any hangs, whereas the hang with sender.py occurs normally in under 10 iterations. Also running with UCX_RNDV_SCHEME=put_zcopy instead should not reproduce the hang either.
Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.
cc @rlratzel
In UCX-Py, we used to add the
UCX_RNDV_SCHEME=get_zcopyoverride due to potential performance issues, this was recently removed in #836 to allow bounce buffers to function.In some cases, using
UCX_RNDV_SCHEME=get_zcopy(which may still be triggered with the newUCX_RNDV_SCHEME=autodefault) causes UCX-Py to hang non-deterministically. This seems only to be reproducible on workstations with at least two GPUs connected via NVLink, and I'm unable to reproduce it on a DGX-1, for example.The MRE involves 3 files. A
listener.pyworking as server, andsender.pythat reproduces the hang mentioned above, and asender2.pythat does NOT reproduce the issue. The only difference fromsender2.pyto the original hang-reproducingsender.pyis how the message to be sent is passed, originally passed to the Python async function (hangs) or created directly within the Python async function (does NOT hang). The files to reproduce are below.listener.py
sender.py
sender2.py
Since the hang does not reproduce deterministically, it may be necessary to run listener/client in a loop, e.g.:
By replacing
sender.pyin the client loop withsender2.py, all 100 iterations should complete without any hangs, whereas the hang withsender.pyoccurs normally in under 10 iterations. Also running withUCX_RNDV_SCHEME=put_zcopyinstead should not reproduce the hang either.Given the error occurs depending on how we pass the message via the async Python interface, I suspect there may be something to do with how the event loop is executing, but I do not have any solid evidence for that at the moment.
cc @rlratzel