Skip to content

Fix ApplicationContext reference cycle and unhandled write exceptions#634

Open
pentschev wants to merge 22 commits into
rapidsai:mainfrom
pentschev:more-distributed-ucxx-fixes
Open

Fix ApplicationContext reference cycle and unhandled write exceptions#634
pentschev wants to merge 22 commits into
rapidsai:mainfrom
pentschev:more-distributed-ucxx-fixes

Conversation

@pentschev
Copy link
Copy Markdown
Member

@pentschev pentschev commented Apr 17, 2026

UCXListener._cb_data["cb_args"] held a strong reference to ApplicationContext creating a reference cycle that prevented proper cleanup and caused intermittent ucxx.reset() failures in the ucxx_loop fixture. The fix passes weakref.ref(self) in cb_args so the cycle never holds ApplicationContext alive, while Listener retains a direct strong reference (released by close()) to preserve the detection contract in reset(). _listener_handler_coroutine now dereferences the weakref immediately and clears ctx in finally to avoid holding ApplicationContext in cancelled coroutine frames. UCXXListener.stop() now calls ucxx_server.close() before dropping the reference for deterministic cleanup.

Additionally, write() now catches BaseException instead of ucxx.exceptions.UCXError so that CancelledError and ValueError mid-write also trigger abort() and return CommClosedError (consistent with #630), two f-strings in ActiveClients error messages had their missing f prefix added, and gc.collect() is called before ucxx.reset() in the test fixture as a safety net.

@pentschev pentschev self-assigned this Apr 17, 2026
@pentschev pentschev requested a review from a team as a code owner April 17, 2026 16:36
@pentschev pentschev added bug Something isn't working non-breaking Introduces a non-breaking change ucxx distributed-ucxx labels Apr 17, 2026
@pentschev pentschev changed the title Fix possible ApplicationContext and possible unhandled exception Fix ApplicationContext reference cycle and unhandled write exceptions Apr 23, 2026
In pytest-asyncio 1.3.0 (asyncio_mode=auto) + Python 3.14, the
framework replaces item.obj with a sync wrapper before pytest_pyfunc_call
fires, so inspect.iscoroutinefunction() returns False and asyncio.wait_for
was never applied, and tests could hang indefinitely.

Fix by adding a pytest_runtest_call hook that wraps the original async
function with asyncio.wait_for before pytest-asyncio's MonkeyPatch runs.
Also add an optional timeout parameter to wait_listener_client_handlers
as a defensive guard against infinite waits in handler polling loops.
These tests were failing CI with the new 60s default timeout. GPU
transfers of 16 MB buffers with multi_size up to 8 can legitimately
take over 60 seconds on CI hardware.
When asyncio.wait_for fires during wait_listener_client_handlers, the
CancelledError propagated immediately, letting the Listener be GC'd
while a handler coroutine still held an in-flight CUDA send/recv. The
UCX progress thread's cuMemcpyAsync then raced with the GC finalizer's
close_blocking() call, causing a segfault in ucp_mem_type_pack.

The fix catches CancelledError in wait_listener_client_handlers and
defers it until active_clients reaches 0, keeping the Listener alive
long enough for all handlers to complete. Calls task.uncancel() on
Python 3.11+ to prevent immediate re-cancellation on the next await.
Comment thread python/ucxx/ucxx/_lib_async/listener.py Outdated
# Dereference the weakref immediately so the UCXListener's cb_args tuple
# does not keep ApplicationContext alive through this coroutine's frame.
ctx = ctx_weakref()
del ctx_weakref
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to decref ctx_weakref. By definition it can't be keeping anything alive.

Also, I don't understand the comment.

Copy link
Copy Markdown
Member Author

@pentschev pentschev Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was totally AI slop that I missed when committed, it's removed now. I'm still struggling with this PR, it seems I have in fact resolved the original issue (distributed-ucxx failing tests) as I haven't seen it pop up anymore for a while here, but simultaneously the fix seems to have broken the UCXX Python tests, so still trying to fix that and thus so many changes that may seem somewhat random still.

del ctx_weakref was a no-op: weakrefs cannot keep the referent alive
by definition, so there was nothing to release. The accompanying
comment was also wrong for the same reason.
Without an explicit close, the client Endpoint is finalized during asyncio
event loop teardown. Its _finalizer calls close_blocking(), which calls
ucp_worker_progress() from the Python thread concurrently with the
WorkerProgressThread causing a race on cuMemcpyAsync that segfaults.

Closing the client inside the test coroutine (while the event loop is still
running) performs the UCX close handshake while the progress thread is
properly synchronized, so the finalizer is a no-op at teardown time.
This also lets the echo server's ep.close() complete the handshake
immediately rather than blocking for up to 10 s waiting for the peer.
@pentschev pentschev requested a review from a team as a code owner April 28, 2026 18:57
@pentschev pentschev force-pushed the more-distributed-ucxx-fixes branch from 1ceda88 to 2bd2e1d Compare May 12, 2026 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working distributed-ucxx non-breaking Introduces a non-breaking change ucxx

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants