Add support to query UCP debug info from requests#437
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
ucp_request_query is safe at any point in a UCP request's lifetime between obtaining a UCS_PTR_IS_PTR handle and calling ucp_request_free. The previous code freed inside Request::callback (progress thread, outside any lock) while queryRequestAttributes ran on the submit thread under _mutex, causing a race in threaded progress modes that could free the request out from under the query. Move ucp_request_free into Request::setStatus, where it now executes inside the lock that the submit thread already holds during publish + query. Introduce a small Request::publishRequest helper so every submit site (Tag, Stream, Mem, Am send and rendezvous-recv, Flush, EpClose) performs the "store _request, query attributes" pair atomically under the same _mutex. With both sides serialized on the same recursive mutex the race is gone with no atomics, no new locking in the callback path beyond what setStatus already takes, and no behavioral change to the disabled (default) configuration.
|
/ok to test |
|
/ok to test |
|
/ok to test |
|
/ok to test ed307b3 |
| Attributes _requestAttr{}; ///< Request attributes queried when request is posted | ||
| bool _isRequestAttrValid{false}; ///< Whether the request attributes are valid |
There was a problem hiding this comment.
A default-constructed attribute has UCS_MEMORY_TYPE_UNKNOWN. But a valid one must have a known memory type, can we use that to check if the request attribute is valid rather than having a separate flag?
| * Every submit site (all `request` methods from child classes and the AM | ||
| * rendezvous-receive path) calls this after obtaining the request handle from the | ||
| * corresponding `ucp_*_nbx` function. |
There was a problem hiding this comment.
Trying to list the submission sites like this parenthetical is a recipe for documentation going out of date.
| // Cancel inflight requests and submit FORCE close ATOMICALLY in a | ||
| // single pre-callback, with no ucp_worker_progress() between them. | ||
| // | ||
| // Why cancel here at all (UCX FORCE close already cancels endpoint | ||
| // operations): | ||
| // tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)), | ||
| // not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending. | ||
| // Without ucp_request_cancel() here, an `await ep.close()` running | ||
| // alongside an outstanding `await ep.recv()` would hang forever. | ||
| // See test_shutdown.py::test_{server,client}_shutdown. | ||
| // | ||
| // Why atomic with FORCE close (not as a separate pre-callback): | ||
| // When cancelAll and FORCE close were separate pre-callbacks (the | ||
| // old cancelInflightRequestsBlocking path), a full ucp_worker_progress() | ||
| // ran between them. That intermediate progress could leave UCT-level | ||
| // TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of | ||
| // a CUDA send); the next progress after FORCE close then crashed | ||
| // dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short | ||
| // -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback | ||
| // matches the safe single-threaded ordering proven by the regression | ||
| // test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp. | ||
| if (!worker->registerGenericPre( | ||
| [this, &status, ¶m]() { status = ucp_ep_close_nbx(_handle, ¶m); }, period)) | ||
| [this, &status, ¶m]() { | ||
| _inflightRequests->cancelAll(); | ||
| status = ucp_ep_close_nbx(_handle, ¶m); | ||
| // Invalidate _handle synchronously immediately, to prevent | ||
| // time window where _handle` points to freed UCP memory, usually | ||
| // observed in `populateDelayedSubmission()`. | ||
| _originalHandle = _handle; | ||
| _handle = nullptr; | ||
| }, | ||
| period)) | ||
| continue; | ||
| submitted = true; |
There was a problem hiding this comment.
This doesn't seem to be anything to do with request attributes.
There was a problem hiding this comment.
Same here, accidentally applied patch to the wrong branch, reverted.
| // When cancelAll and FORCE close were separate pre-callbacks (the | ||
| // old cancelInflightRequestsBlocking path), a full ucp_worker_progress() | ||
| // ran between them. That intermediate progress could leave UCT-level | ||
| // TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of | ||
| // a CUDA send); the next progress after FORCE close then crashed | ||
| // dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short | ||
| // -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback | ||
| // matches the safe single-threaded ordering proven by the regression | ||
| // test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp. |
There was a problem hiding this comment.
Please rework this comment so that it makes sense independently of any historical route to the code we have now.
Something like:
Cancellation and forced closing of requests must happen without UCX progress so that handles are not left in a half-initialised state. For example, if cancellation and close are separate, progress could result in the requests being in the middle of a cuMemcpyAsync when the close runs.
?
There was a problem hiding this comment.
Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.
| if (UCS_PTR_IS_PTR(_request)) { | ||
| auto queryStatus = ucp_request_query(_request, &result); | ||
| if (queryStatus == UCS_OK && result.debug_string != nullptr) { | ||
| _requestAttr.debugString = std::string(result.debug_string); |
There was a problem hiding this comment.
So we allocated space for the debug string, then we copy it. Why not resize the debug_string based on its size and std::move() it?
| std::lock_guard<std::recursive_mutex> lock(_mutex); | ||
|
|
||
| if (_isRequestAttrValid) return; | ||
| if (!_worker->isRequestAttributesEnabled()) return; |
There was a problem hiding this comment.
nit: Return early before trying to grab the lock if we're not enabled.
|
|
||
| const std::string& Request::getOwnerString() const { return _ownerString; } | ||
|
|
||
| void Request::queryRequestAttributes() |
There was a problem hiding this comment.
This method is only called from publishRequest. I think we could just inline the implementation there.
| if (!_worker->isRequestAttributesEnabled()) | ||
| throw ucxx::UnsupportedError( | ||
| "Request attributes querying is disabled on the owning worker; build the worker " | ||
| "with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it"); |
There was a problem hiding this comment.
Again, move this before grabbing the lock.
| // When cancelAll and FORCE close were separate pre-callbacks (the | ||
| // old cancelInflightRequestsBlocking path), a full ucp_worker_progress() | ||
| // ran between them. That intermediate progress could leave UCT-level | ||
| // TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of | ||
| // a CUDA send); the next progress after FORCE close then crashed | ||
| // dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short | ||
| // -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback | ||
| // matches the safe single-threaded ordering proven by the regression | ||
| // test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp. |
There was a problem hiding this comment.
Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.
| // Cancel inflight requests and submit FORCE close ATOMICALLY in a | ||
| // single pre-callback, with no ucp_worker_progress() between them. | ||
| // | ||
| // Why cancel here at all (UCX FORCE close already cancels endpoint | ||
| // operations): | ||
| // tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)), | ||
| // not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending. | ||
| // Without ucp_request_cancel() here, an `await ep.close()` running | ||
| // alongside an outstanding `await ep.recv()` would hang forever. | ||
| // See test_shutdown.py::test_{server,client}_shutdown. | ||
| // | ||
| // Why atomic with FORCE close (not as a separate pre-callback): | ||
| // When cancelAll and FORCE close were separate pre-callbacks (the | ||
| // old cancelInflightRequestsBlocking path), a full ucp_worker_progress() | ||
| // ran between them. That intermediate progress could leave UCT-level | ||
| // TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of | ||
| // a CUDA send); the next progress after FORCE close then crashed | ||
| // dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short | ||
| // -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback | ||
| // matches the safe single-threaded ordering proven by the regression | ||
| // test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp. | ||
| if (!worker->registerGenericPre( | ||
| [this, &status, ¶m]() { status = ucp_ep_close_nbx(_handle, ¶m); }, period)) | ||
| [this, &status, ¶m]() { | ||
| _inflightRequests->cancelAll(); | ||
| status = ucp_ep_close_nbx(_handle, ¶m); | ||
| // Invalidate _handle synchronously immediately, to prevent | ||
| // time window where _handle` points to freed UCP memory, usually | ||
| // observed in `populateDelayedSubmission()`. | ||
| _originalHandle = _handle; | ||
| _handle = nullptr; | ||
| }, | ||
| period)) | ||
| continue; | ||
| submitted = true; |
There was a problem hiding this comment.
Same here, accidentally applied patch to the wrong branch, reverted.
| * Every submit site (all `request` methods from child classes and the AM | ||
| * rendezvous-receive path) calls this after obtaining the request handle from the | ||
| * corresponding `ucp_*_nbx` function. |
| Attributes _requestAttr{}; ///< Request attributes queried when request is posted | ||
| bool _isRequestAttrValid{false}; ///< Whether the request attributes are valid |
|
|
||
| const std::string& Request::getOwnerString() const { return _ownerString; } | ||
|
|
||
| void Request::queryRequestAttributes() |
| std::lock_guard<std::recursive_mutex> lock(_mutex); | ||
|
|
||
| if (_isRequestAttrValid) return; | ||
| if (!_worker->isRequestAttributesEnabled()) return; |
| if (!_worker->isRequestAttributesEnabled()) | ||
| throw ucxx::UnsupportedError( | ||
| "Request attributes querying is disabled on the owning worker; build the worker " | ||
| "with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it"); |
| if (UCS_PTR_IS_PTR(_request)) { | ||
| auto queryStatus = ucp_request_query(_request, &result); | ||
| if (queryStatus == UCS_OK && result.debug_string != nullptr) { | ||
| _requestAttr.debugString = std::string(result.debug_string); |
|
|
||
| run_cpp_tests() { | ||
| CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((10*60)) ${GTESTS_PATH}/UCXX_TEST" | ||
| CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((20*60)) ${GTESTS_PATH}/UCXX_TEST" |
There was a problem hiding this comment.
Some of the slow runners are already running close to the current 10 minutes timeout (sample from last night ran from 07:19:36 until 07:28:13, for a total of 08:37 minutes), with the addition of new tests they occasionally do timeout.
Add an opt-in mechanism for querying UCP request attributes (memory type, debug string) on every
ucxx::Request. Enabled via a newucxx::experimental::WorkerBuilder::requestAttributes(true)option. When enabled, everyucxx::Requestsubmit site funnels through a smallRequest::publishRequest()helper that stores the UCP handle and queriesucp_request_queryunder the existing_mutex.ucp_request_freemoves fromRequest::callbackintoRequest::setStatus, making the query and the free mutually exclusive without any new atomics or callback-side locking. Wired through every request type and exposed to users viaRequest::queryAttributes(), which throwsucxx::UnsupportedErrorwhen the feature is disabled on the owning worker anducxx::NoElemErrorwhen UCX took an inline path that produced no UCP handle to query (e.g., an eager UCX transfer).Tag, AM, and MemoryGet test are asserted strictly above the rendezvous threshold, where UCX deterministically allocates a queryable request on every transport. Stream and MemoryPut use lenient assertions (substring-check on success, throw is acceptable) because stream has no rendezvous protocol and small RMA puts are fire-and-forget, both of which have transport-dependent inline-completion behavior that no fixed size threshold can portably predict.
The toggle is worker-scoped: enabling it queries attributes on every request created from that worker, which has potentially non-negligible per-request cost. Fine-grained per-request opt-in (so callers can attribute-query only the requests they care about) is not implemented here, it requires a builder-pattern constructor at the request level which doesn't exist yet, and is deferred to a follow-up. For now, users who need attributes accept the worker-wide cost, and users who don't, opt out by leaving the default.