Fix ApplicationContext reference cycle and unhandled write exceptions#634
Open
pentschev wants to merge 22 commits into
Open
Fix ApplicationContext reference cycle and unhandled write exceptions#634pentschev wants to merge 22 commits into
ApplicationContext reference cycle and unhandled write exceptions#634pentschev wants to merge 22 commits into
Conversation
wence-
approved these changes
Apr 23, 2026
ApplicationContext and possible unhandled exceptionApplicationContext reference cycle and unhandled write exceptions
In pytest-asyncio 1.3.0 (asyncio_mode=auto) + Python 3.14, the framework replaces item.obj with a sync wrapper before pytest_pyfunc_call fires, so inspect.iscoroutinefunction() returns False and asyncio.wait_for was never applied, and tests could hang indefinitely. Fix by adding a pytest_runtest_call hook that wraps the original async function with asyncio.wait_for before pytest-asyncio's MonkeyPatch runs. Also add an optional timeout parameter to wait_listener_client_handlers as a defensive guard against infinite waits in handler polling loops.
These tests were failing CI with the new 60s default timeout. GPU transfers of 16 MB buffers with multi_size up to 8 can legitimately take over 60 seconds on CI hardware.
When asyncio.wait_for fires during wait_listener_client_handlers, the CancelledError propagated immediately, letting the Listener be GC'd while a handler coroutine still held an in-flight CUDA send/recv. The UCX progress thread's cuMemcpyAsync then raced with the GC finalizer's close_blocking() call, causing a segfault in ucp_mem_type_pack. The fix catches CancelledError in wait_listener_client_handlers and defers it until active_clients reaches 0, keeping the Listener alive long enough for all handlers to complete. Calls task.uncancel() on Python 3.11+ to prevent immediate re-cancellation on the next await.
wence-
reviewed
Apr 28, 2026
| # Dereference the weakref immediately so the UCXListener's cb_args tuple | ||
| # does not keep ApplicationContext alive through this coroutine's frame. | ||
| ctx = ctx_weakref() | ||
| del ctx_weakref |
Contributor
There was a problem hiding this comment.
Why do you need to decref ctx_weakref. By definition it can't be keeping anything alive.
Also, I don't understand the comment.
Member
Author
There was a problem hiding this comment.
This was totally AI slop that I missed when committed, it's removed now. I'm still struggling with this PR, it seems I have in fact resolved the original issue (distributed-ucxx failing tests) as I haven't seen it pop up anymore for a while here, but simultaneously the fix seems to have broken the UCXX Python tests, so still trying to fix that and thus so many changes that may seem somewhat random still.
del ctx_weakref was a no-op: weakrefs cannot keep the referent alive by definition, so there was nothing to release. The accompanying comment was also wrong for the same reason.
Without an explicit close, the client Endpoint is finalized during asyncio event loop teardown. Its _finalizer calls close_blocking(), which calls ucp_worker_progress() from the Python thread concurrently with the WorkerProgressThread causing a race on cuMemcpyAsync that segfaults. Closing the client inside the test coroutine (while the event loop is still running) performs the UCX close handshake while the progress thread is properly synchronized, so the finalizer is a no-op at teardown time. This also lets the echo server's ep.close() complete the handshake immediately rather than blocking for up to 10 s waiting for the peer.
added 2 commits
May 12, 2026 08:09
1ceda88 to
2bd2e1d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
UCXListener._cb_data["cb_args"]held a strong reference toApplicationContextcreating a reference cycle that prevented proper cleanup and caused intermittentucxx.reset()failures in theucxx_loopfixture. The fix passesweakref.ref(self)incb_argsso the cycle never holdsApplicationContextalive, whileListenerretains a direct strong reference (released byclose()) to preserve the detection contract inreset()._listener_handler_coroutinenow dereferences the weakref immediately and clearsctxinfinallyto avoid holdingApplicationContextin cancelled coroutine frames.UCXXListener.stop()now callsucxx_server.close()before dropping the reference for deterministic cleanup.Additionally,
write()now catchesBaseExceptioninstead ofucxx.exceptions.UCXErrorso thatCancelledErrorandValueErrormid-write also triggerabort()and returnCommClosedError(consistent with #630), two f-strings inActiveClientserror messages had their missing f prefix added, andgc.collect()is called beforeucxx.reset()in the test fixture as a safety net.