Add support to query UCP debug info from requests by pentschev · Pull Request #437 · rapidsai/ucxx

pentschev · 2025-06-06T16:36:11Z

Add an opt-in mechanism for querying UCP request attributes (memory type, debug string) on every ucxx::Request. Enabled via a new ucxx::experimental::WorkerBuilder::requestAttributes(true) option. When enabled, every ucxx::Request submit site funnels through a small Request::publishRequest() helper that stores the UCP handle and queries ucp_request_query under the existing _mutex. ucp_request_free moves from Request::callback into Request::setStatus, making the query and the free mutually exclusive without any new atomics or callback-side locking. Wired through every request type and exposed to users via Request::queryAttributes(), which throws ucxx::UnsupportedError when the feature is disabled on the owning worker and ucxx::NoElemError when UCX took an inline path that produced no UCP handle to query (e.g., an eager UCX transfer).

Tag, AM, and MemoryGet test are asserted strictly above the rendezvous threshold, where UCX deterministically allocates a queryable request on every transport. Stream and MemoryPut use lenient assertions (substring-check on success, throw is acceptable) because stream has no rendezvous protocol and small RMA puts are fire-and-forget, both of which have transport-dependent inline-completion behavior that no fixed size threshold can portably predict.

The toggle is worker-scoped: enabling it queries attributes on every request created from that worker, which has potentially non-negligible per-request cost. Fine-grained per-request opt-in (so callers can attribute-query only the requests they care about) is not implemented here, it requires a builder-pattern constructor at the request level which doesn't exist yet, and is deferred to a follow-up. For now, users who need attributes accept the worker-wide cost, and users who don't, opt out by leaving the default.

copy-pr-bot · 2025-06-06T16:36:14Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

copy-pr-bot · 2025-11-12T16:37:55Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ucp_request_query is safe at any point in a UCP request's lifetime between obtaining a UCS_PTR_IS_PTR handle and calling ucp_request_free. The previous code freed inside Request::callback (progress thread, outside any lock) while queryRequestAttributes ran on the submit thread under _mutex, causing a race in threaded progress modes that could free the request out from under the query. Move ucp_request_free into Request::setStatus, where it now executes inside the lock that the submit thread already holds during publish + query. Introduce a small Request::publishRequest helper so every submit site (Tag, Stream, Mem, Am send and rendezvous-recv, Flush, EpClose) performs the "store _request, query attributes" pair atomically under the same _mutex. With both sides serialized on the same recursive mutex the race is gone with no atomics, no new locking in the callback path beyond what setStatus already takes, and no behavioral change to the disabled (default) configuration.

pentschev · 2026-04-29T20:21:24Z

/ok to test

pentschev · 2026-04-30T11:45:31Z

/ok to test

pentschev · 2026-05-06T08:51:36Z

/ok to test

pentschev · 2026-05-12T16:16:08Z

/ok to test ed307b3

wence- · 2026-05-12T17:01:15Z

+  Attributes _requestAttr{};        ///< Request attributes queried when request is posted
+  bool _isRequestAttrValid{false};  ///< Whether the request attributes are valid


A default-constructed attribute has UCS_MEMORY_TYPE_UNKNOWN. But a valid one must have a known memory type, can we use that to check if the request attribute is valid rather than having a separate flag?

Done in 6cd44e5 .

wence- · 2026-05-12T17:02:34Z

+   * Every submit site (all `request` methods from child classes and the AM
+   * rendezvous-receive path) calls this after obtaining the request handle from the
+   * corresponding `ucp_*_nbx` function.


Trying to list the submission sites like this parenthetical is a recipe for documentation going out of date.

Good point, cleaned up in 3716e5d .

wence- · 2026-05-12T17:04:22Z

+        // Cancel inflight requests and submit FORCE close ATOMICALLY in a
+        // single pre-callback, with no ucp_worker_progress() between them.
+        //
+        // Why cancel here at all (UCX FORCE close already cancels endpoint
+        // operations):
+        //   tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)),
+        //   not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending.
+        //   Without ucp_request_cancel() here, an `await ep.close()` running
+        //   alongside an outstanding `await ep.recv()` would hang forever.
+        //   See test_shutdown.py::test_{server,client}_shutdown.
+        //
+        // Why atomic with FORCE close (not as a separate pre-callback):
+        //   When cancelAll and FORCE close were separate pre-callbacks (the
+        //   old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
+        //   ran between them.  That intermediate progress could leave UCT-level
+        //   TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
+        //   a CUDA send); the next progress after FORCE close then crashed
+        //   dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
+        //   -> cuMemcpyAsync -> SIGSEGV).  Running them in a single pre-callback
+        //   matches the safe single-threaded ordering proven by the regression
+        //   test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
        if (!worker->registerGenericPre(
-              [this, &status, &param]() { status = ucp_ep_close_nbx(_handle, &param); }, period))
+              [this, &status, &param]() {
+                _inflightRequests->cancelAll();
+                status = ucp_ep_close_nbx(_handle, &param);
+                // Invalidate _handle synchronously immediately, to prevent
+                // time window where _handle` points to freed UCP memory, usually
+                // observed in `populateDelayedSubmission()`.
+                _originalHandle = _handle;
+                _handle         = nullptr;
+              },
+              period))
          continue;
        submitted = true;


This doesn't seem to be anything to do with request attributes.

Same here, accidentally applied patch to the wrong branch, reverted.

wence- · 2026-05-12T17:06:59Z

+        //   When cancelAll and FORCE close were separate pre-callbacks (the
+        //   old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
+        //   ran between them.  That intermediate progress could leave UCT-level
+        //   TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
+        //   a CUDA send); the next progress after FORCE close then crashed
+        //   dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
+        //   -> cuMemcpyAsync -> SIGSEGV).  Running them in a single pre-callback
+        //   matches the safe single-threaded ordering proven by the regression
+        //   test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.


Please rework this comment so that it makes sense independently of any historical route to the code we have now.

Something like:

Cancellation and forced closing of requests must happen without UCX progress so that handles are not left in a half-initialised state. For example, if cancellation and close are separate, progress could result in the requests being in the middle of a cuMemcpyAsync when the close runs.

?

Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.

wence- · 2026-05-12T17:12:49Z

+  if (UCS_PTR_IS_PTR(_request)) {
+    auto queryStatus = ucp_request_query(_request, &result);
+    if (queryStatus == UCS_OK && result.debug_string != nullptr) {
+      _requestAttr.debugString = std::string(result.debug_string);


So we allocated space for the debug string, then we copy it. Why not resize the debug_string based on its size and std::move() it?

Done in a1a5c17 .

wence- · 2026-05-12T17:13:25Z

+  std::lock_guard<std::recursive_mutex> lock(_mutex);
+
+  if (_isRequestAttrValid) return;
+  if (!_worker->isRequestAttributesEnabled()) return;


nit: Return early before trying to grab the lock if we're not enabled.

Done in 3e92f4b .

wence- · 2026-05-12T17:15:32Z


 const std::string& Request::getOwnerString() const { return _ownerString; }

+void Request::queryRequestAttributes()


This method is only called from publishRequest. I think we could just inline the implementation there.

Done in a37d5e0 .

wence- · 2026-05-12T17:16:07Z

+  if (!_worker->isRequestAttributesEnabled())
+    throw ucxx::UnsupportedError(
+      "Request attributes querying is disabled on the owning worker; build the worker "
+      "with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it");


Again, move this before grabbing the lock.

Done in 3e92f4b .

This reverts commit 41d80c2.

… a single pre-callback" This reverts commit 6691758.

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

pentschev

Thanks @wence- for the review. I think I have addressed all your comments, please take another look when you have a chance.

pentschev · 2026-05-12T20:07:47Z

+        //   When cancelAll and FORCE close were separate pre-callbacks (the
+        //   old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
+        //   ran between them.  That intermediate progress could leave UCT-level
+        //   TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
+        //   a CUDA send); the next progress after FORCE close then crashed
+        //   dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
+        //   -> cuMemcpyAsync -> SIGSEGV).  Running them in a single pre-callback
+        //   matches the safe single-threaded ordering proven by the regression
+        //   test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.


Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.

pentschev · 2026-05-12T20:08:23Z

+        // Cancel inflight requests and submit FORCE close ATOMICALLY in a
+        // single pre-callback, with no ucp_worker_progress() between them.
+        //
+        // Why cancel here at all (UCX FORCE close already cancels endpoint
+        // operations):
+        //   tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)),
+        //   not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending.
+        //   Without ucp_request_cancel() here, an `await ep.close()` running
+        //   alongside an outstanding `await ep.recv()` would hang forever.
+        //   See test_shutdown.py::test_{server,client}_shutdown.
+        //
+        // Why atomic with FORCE close (not as a separate pre-callback):
+        //   When cancelAll and FORCE close were separate pre-callbacks (the
+        //   old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
+        //   ran between them.  That intermediate progress could leave UCT-level
+        //   TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
+        //   a CUDA send); the next progress after FORCE close then crashed
+        //   dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
+        //   -> cuMemcpyAsync -> SIGSEGV).  Running them in a single pre-callback
+        //   matches the safe single-threaded ordering proven by the regression
+        //   test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
        if (!worker->registerGenericPre(
-              [this, &status, &param]() { status = ucp_ep_close_nbx(_handle, &param); }, period))
+              [this, &status, &param]() {
+                _inflightRequests->cancelAll();
+                status = ucp_ep_close_nbx(_handle, &param);
+                // Invalidate _handle synchronously immediately, to prevent
+                // time window where _handle` points to freed UCP memory, usually
+                // observed in `populateDelayedSubmission()`.
+                _originalHandle = _handle;
+                _handle         = nullptr;
+              },
+              period))
          continue;
        submitted = true;


Same here, accidentally applied patch to the wrong branch, reverted.

pentschev · 2026-05-12T20:19:46Z

+   * Every submit site (all `request` methods from child classes and the AM
+   * rendezvous-receive path) calls this after obtaining the request handle from the
+   * corresponding `ucp_*_nbx` function.


Good point, cleaned up in 3716e5d .

pentschev · 2026-05-12T20:49:59Z

+  Attributes _requestAttr{};        ///< Request attributes queried when request is posted
+  bool _isRequestAttrValid{false};  ///< Whether the request attributes are valid


Done in 6cd44e5 .

pentschev · 2026-05-12T20:50:31Z


 const std::string& Request::getOwnerString() const { return _ownerString; }

+void Request::queryRequestAttributes()


Done in a37d5e0 .

pentschev · 2026-05-12T20:50:50Z

+  std::lock_guard<std::recursive_mutex> lock(_mutex);
+
+  if (_isRequestAttrValid) return;
+  if (!_worker->isRequestAttributesEnabled()) return;


Done in 3e92f4b .

pentschev · 2026-05-12T20:51:08Z

+  if (!_worker->isRequestAttributesEnabled())
+    throw ucxx::UnsupportedError(
+      "Request attributes querying is disabled on the owning worker; build the worker "
+      "with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it");


Done in 3e92f4b .

pentschev · 2026-05-12T20:51:12Z

+  if (UCS_PTR_IS_PTR(_request)) {
+    auto queryStatus = ucp_request_query(_request, &result);
+    if (queryStatus == UCS_OK && result.debug_string != nullptr) {
+      _requestAttr.debugString = std::string(result.debug_string);


Done in a1a5c17 .

pentschev · 2026-05-13T14:11:45Z


 run_cpp_tests() {
-  CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((10*60)) ${GTESTS_PATH}/UCXX_TEST"
+  CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((20*60)) ${GTESTS_PATH}/UCXX_TEST"


Some of the slow runners are already running close to the current 10 minutes timeout (sample from last night ran from 07:19:36 until 07:28:13, for a total of 08:37 minutes), with the addition of new tests they occasionally do timeout.

pentschev added 6 commits June 3, 2025 05:46

Add capability method to query worker attributes

3e40312

Add tests for

0f20908

Add method to debug information

f9ee756

Cache attributes before completion

e77c9be

[WIP] Always enable debug info

a5f6b9b

Improve query, add user-facing getter

34b805c

pentschev self-assigned this Jun 6, 2025

pentschev added feature request New feature or request DO NOT MERGE Hold off on merging; see PR for details labels Jun 6, 2025

Merge remote-tracking branch 'upstream/main' into request-stats

a752fcc

pentschev changed the base branch from branch-0.45 to main November 12, 2025 16:38

pentschev added 3 commits March 26, 2026 15:07

Merge remote-tracking branch 'upstream/main' into request-stats

11ce3cf

Merge remote-tracking branch 'upstream/main' into request-stats

8d805a3

Merge remote-tracking branch 'upstream/main' into request-stats

48f833d

pentschev mentioned this pull request Apr 24, 2026

Add detailed metrics stating the exact source _and_ destination of communicated data rapidsai/rapidsmpf#996

Open

pentschev added 5 commits April 29, 2026 13:17

Merge remote-tracking branch 'upstream/main' into request-stats

5dee52f

Add requestAttributes to ucxx::experimental::WorkerBuilder

4becdab

Improve tests

cd13fc5

Query request attributes for other request types, add tests

dc3445e

pentschev added 2 commits April 30, 2026 04:45

Skip ProgressTag configurations that have no UCP request to query

15b086e

Reorganize tag tests

174fb40

Merge remote-tracking branch 'upstream/main' into request-stats-dev

68fd787

pentschev added 2 commits May 6, 2026 01:52

Fix rendezvous threshold skip conditions

612c4bc

Test for expected substring

718b464

Horde added 4 commits May 12, 2026 09:14

Rename getAttributes back to queryAttributes

16091bd

Throw ucxx::NoElemError from Request::queryAttributes

2e6310a

Split disabled vs runtime-unavailable in Request::queryAttributes

be015a1

Drop redundant status field from Request::Attributes

ed307b3

pentschev force-pushed the request-stats branch from 8a615a1 to ed307b3 Compare May 12, 2026 16:14

pentschev added non-breaking Introduces a non-breaking change and removed DO NOT MERGE Hold off on merging; see PR for details labels May 12, 2026

pentschev marked this pull request as ready for review May 12, 2026 16:34

pentschev requested a review from a team as a code owner May 12, 2026 16:34

wence- requested changes May 12, 2026

View reviewed changes

pentschev and others added 10 commits May 12, 2026 13:09

Revert "Fix invalid _handle usage"

024561b

This reverts commit 41d80c2.

Revert "Cancel inflight requests and submit force-close atomically in…

9cd6e1f

… a single pre-callback" This reverts commit 6691758.

Cleanup

5502e7a

Co-authored-by: Lawrence Mitchell <wence@gmx.li>

Merge remote-tracking branch 'origin/request-stats' into request-stats

9513942

Docstring cleanup

3716e5d

Fix noexcept mismatch on Worker::isRequestAttributesEnabled

d2a60b0

Use memoryType sentinel instead of _isRequestAttrValid flag

6cd44e5

Inline queryRequestAttributes into publishRequest

a37d5e0

Move enabled-check before mutex in publishRequest and queryAttributes

3e92f4b

Move debug string into cached attribute instead of copying

a1a5c17

pentschev commented May 12, 2026

View reviewed changes

pentschev added 4 commits May 12, 2026 14:01

Merge remote-tracking branch 'upstream/main' into request-stats

e0e389e

Fix outdated variable name

f1d488d

More fixes

0c35d5e

Increase C++ test timeout

8dfa5aa

pentschev requested a review from a team as a code owner May 13, 2026 14:05

pentschev requested a review from jameslamb May 13, 2026 14:05

pentschev commented May 13, 2026

View reviewed changes

pentschev requested a review from wence- May 13, 2026 14:20

		Attributes _requestAttr{}; ///< Request attributes queried when request is posted
		bool _isRequestAttrValid{false}; ///< Whether the request attributes are valid


		const std::string& Request::getOwnerString() const { return _ownerString; }

		void Request::queryRequestAttributes()

Conversation

pentschev commented Jun 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Jun 6, 2025

Uh oh!

copy-pr-bot Bot commented Nov 12, 2025

Uh oh!

pentschev commented Apr 29, 2026

Uh oh!

pentschev commented Apr 30, 2026

Uh oh!

pentschev commented May 6, 2026

Uh oh!

pentschev commented May 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pentschev May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

pentschev commented Jun 6, 2025 •

edited

Loading

pentschev May 13, 2026 •

edited

Loading