Skip to content

Add support to query UCP debug info from requests#437

Open
pentschev wants to merge 53 commits into
rapidsai:mainfrom
pentschev:request-stats
Open

Add support to query UCP debug info from requests#437
pentschev wants to merge 53 commits into
rapidsai:mainfrom
pentschev:request-stats

Conversation

@pentschev
Copy link
Copy Markdown
Member

@pentschev pentschev commented Jun 6, 2025

Add an opt-in mechanism for querying UCP request attributes (memory type, debug string) on every ucxx::Request. Enabled via a new ucxx::experimental::WorkerBuilder::requestAttributes(true) option. When enabled, every ucxx::Request submit site funnels through a small Request::publishRequest() helper that stores the UCP handle and queries ucp_request_query under the existing _mutex. ucp_request_free moves from Request::callback into Request::setStatus, making the query and the free mutually exclusive without any new atomics or callback-side locking. Wired through every request type and exposed to users via Request::queryAttributes(), which throws ucxx::UnsupportedError when the feature is disabled on the owning worker and ucxx::NoElemError when UCX took an inline path that produced no UCP handle to query (e.g., an eager UCX transfer).

Tag, AM, and MemoryGet test are asserted strictly above the rendezvous threshold, where UCX deterministically allocates a queryable request on every transport. Stream and MemoryPut use lenient assertions (substring-check on success, throw is acceptable) because stream has no rendezvous protocol and small RMA puts are fire-and-forget, both of which have transport-dependent inline-completion behavior that no fixed size threshold can portably predict.

The toggle is worker-scoped: enabling it queries attributes on every request created from that worker, which has potentially non-negligible per-request cost. Fine-grained per-request opt-in (so callers can attribute-query only the requests they care about) is not implemented here, it requires a builder-pattern constructor at the request level which doesn't exist yet, and is deferred to a follow-up. For now, users who need attributes accept the worker-wide cost, and users who don't, opt out by leaving the default.

@pentschev pentschev self-assigned this Jun 6, 2025
@pentschev pentschev added feature request New feature or request DO NOT MERGE Hold off on merging; see PR for details labels Jun 6, 2025
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Jun 6, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Nov 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@pentschev pentschev changed the base branch from branch-0.45 to main November 12, 2025 16:38
ucp_request_query is safe at any point in a UCP request's lifetime
between obtaining a UCS_PTR_IS_PTR handle and calling ucp_request_free.
The previous code freed inside Request::callback (progress thread,
outside any lock) while queryRequestAttributes ran on the submit thread
under _mutex, causing a race in threaded progress modes that could free
the request out from under the query.

Move ucp_request_free into Request::setStatus, where it now executes
inside the lock that the submit thread already holds during publish +
query. Introduce a small Request::publishRequest helper so every submit
site (Tag, Stream, Mem, Am send and rendezvous-recv, Flush, EpClose)
performs the "store _request, query attributes" pair atomically under
the same _mutex. With both sides serialized on the same recursive mutex
the race is gone with no atomics, no new locking in the callback path
beyond what setStatus already takes, and no behavioral change to the
disabled (default) configuration.
@pentschev
Copy link
Copy Markdown
Member Author

/ok to test

@pentschev
Copy link
Copy Markdown
Member Author

/ok to test

@pentschev
Copy link
Copy Markdown
Member Author

/ok to test

@pentschev
Copy link
Copy Markdown
Member Author

/ok to test ed307b3

@pentschev pentschev added non-breaking Introduces a non-breaking change and removed DO NOT MERGE Hold off on merging; see PR for details labels May 12, 2026
@pentschev pentschev marked this pull request as ready for review May 12, 2026 16:34
@pentschev pentschev requested a review from a team as a code owner May 12, 2026 16:34
Comment thread cpp/include/ucxx/request.h Outdated
Comment on lines +66 to +67
Attributes _requestAttr{}; ///< Request attributes queried when request is posted
bool _isRequestAttrValid{false}; ///< Whether the request attributes are valid
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A default-constructed attribute has UCS_MEMORY_TYPE_UNKNOWN. But a valid one must have a known memory type, can we use that to check if the request attribute is valid rather than having a separate flag?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6cd44e5 .

Comment thread cpp/include/ucxx/request.h Outdated
Comment on lines +299 to +301
* Every submit site (all `request` methods from child classes and the AM
* rendezvous-receive path) calls this after obtaining the request handle from the
* corresponding `ucp_*_nbx` function.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to list the submission sites like this parenthetical is a recipe for documentation going out of date.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, cleaned up in 3716e5d .

Comment thread cpp/include/ucxx/worker.h Outdated
Comment thread cpp/src/endpoint.cpp Outdated
Comment on lines 291 to 324
// Cancel inflight requests and submit FORCE close ATOMICALLY in a
// single pre-callback, with no ucp_worker_progress() between them.
//
// Why cancel here at all (UCX FORCE close already cancels endpoint
// operations):
// tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)),
// not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending.
// Without ucp_request_cancel() here, an `await ep.close()` running
// alongside an outstanding `await ep.recv()` would hang forever.
// See test_shutdown.py::test_{server,client}_shutdown.
//
// Why atomic with FORCE close (not as a separate pre-callback):
// When cancelAll and FORCE close were separate pre-callbacks (the
// old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
// ran between them. That intermediate progress could leave UCT-level
// TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
// a CUDA send); the next progress after FORCE close then crashed
// dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
// -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback
// matches the safe single-threaded ordering proven by the regression
// test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
if (!worker->registerGenericPre(
[this, &status, &param]() { status = ucp_ep_close_nbx(_handle, &param); }, period))
[this, &status, &param]() {
_inflightRequests->cancelAll();
status = ucp_ep_close_nbx(_handle, &param);
// Invalidate _handle synchronously immediately, to prevent
// time window where _handle` points to freed UCP memory, usually
// observed in `populateDelayedSubmission()`.
_originalHandle = _handle;
_handle = nullptr;
},
period))
continue;
submitted = true;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't seem to be anything to do with request attributes.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, accidentally applied patch to the wrong branch, reverted.

Comment thread cpp/src/endpoint.cpp Outdated
Comment on lines +303 to +311
// When cancelAll and FORCE close were separate pre-callbacks (the
// old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
// ran between them. That intermediate progress could leave UCT-level
// TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
// a CUDA send); the next progress after FORCE close then crashed
// dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
// -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback
// matches the safe single-threaded ordering proven by the regression
// test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rework this comment so that it makes sense independently of any historical route to the code we have now.

Something like:

Cancellation and forced closing of requests must happen without UCX progress so that handles are not left in a half-initialised state. For example, if cancellation and close are separate, progress could result in the requests being in the middle of a cuMemcpyAsync when the close runs.

?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.

Comment thread cpp/src/request.cpp Outdated
Comment thread cpp/src/request.cpp Outdated
if (UCS_PTR_IS_PTR(_request)) {
auto queryStatus = ucp_request_query(_request, &result);
if (queryStatus == UCS_OK && result.debug_string != nullptr) {
_requestAttr.debugString = std::string(result.debug_string);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we allocated space for the debug string, then we copy it. Why not resize the debug_string based on its size and std::move() it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a1a5c17 .

Comment thread cpp/src/request.cpp Outdated
std::lock_guard<std::recursive_mutex> lock(_mutex);

if (_isRequestAttrValid) return;
if (!_worker->isRequestAttributesEnabled()) return;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Return early before trying to grab the lock if we're not enabled.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3e92f4b .

Comment thread cpp/src/request.cpp Outdated

const std::string& Request::getOwnerString() const { return _ownerString; }

void Request::queryRequestAttributes()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is only called from publishRequest. I think we could just inline the implementation there.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a37d5e0 .

Comment thread cpp/src/request.cpp
Comment on lines +296 to +299
if (!_worker->isRequestAttributesEnabled())
throw ucxx::UnsupportedError(
"Request attributes querying is disabled on the owning worker; build the worker "
"with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it");
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, move this before grabbing the lock.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3e92f4b .

Copy link
Copy Markdown
Member Author

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @wence- for the review. I think I have addressed all your comments, please take another look when you have a chance.

Comment thread cpp/src/endpoint.cpp Outdated
Comment on lines +303 to +311
// When cancelAll and FORCE close were separate pre-callbacks (the
// old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
// ran between them. That intermediate progress could leave UCT-level
// TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
// a CUDA send); the next progress after FORCE close then crashed
// dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
// -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback
// matches the safe single-threaded ordering proven by the regression
// test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, I'm really sorry about this. Those changes were for a completely different branch, I was developing and testing this and the other branch on a remote machine and scp-ing patches, and I accidentally applied this to the current branch. I've reverted this now.

Comment thread cpp/src/endpoint.cpp Outdated
Comment on lines 291 to 324
// Cancel inflight requests and submit FORCE close ATOMICALLY in a
// single pre-callback, with no ucp_worker_progress() between them.
//
// Why cancel here at all (UCX FORCE close already cancels endpoint
// operations):
// tag_recv requests are worker-scoped (ucp_tag_recv_nbx(worker, ...)),
// not endpoint-scoped, so ucp_ep_close_nbx(FORCE) leaves them pending.
// Without ucp_request_cancel() here, an `await ep.close()` running
// alongside an outstanding `await ep.recv()` would hang forever.
// See test_shutdown.py::test_{server,client}_shutdown.
//
// Why atomic with FORCE close (not as a separate pre-callback):
// When cancelAll and FORCE close were separate pre-callbacks (the
// old cancelInflightRequestsBlocking path), a full ucp_worker_progress()
// ran between them. That intermediate progress could leave UCT-level
// TCP pending entries half-dispatched (mid-cuMemcpyAsync staging of
// a CUDA send); the next progress after FORCE close then crashed
// dispatching them on a freed staging buffer (uct_cuda_copy_ep_get_short
// -> cuMemcpyAsync -> SIGSEGV). Running them in a single pre-callback
// matches the safe single-threaded ordering proven by the regression
// test in cpp/tests/endpoint_close_force_tcp_cuda_race.cpp.
if (!worker->registerGenericPre(
[this, &status, &param]() { status = ucp_ep_close_nbx(_handle, &param); }, period))
[this, &status, &param]() {
_inflightRequests->cancelAll();
status = ucp_ep_close_nbx(_handle, &param);
// Invalidate _handle synchronously immediately, to prevent
// time window where _handle` points to freed UCP memory, usually
// observed in `populateDelayedSubmission()`.
_originalHandle = _handle;
_handle = nullptr;
},
period))
continue;
submitted = true;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here, accidentally applied patch to the wrong branch, reverted.

Comment thread cpp/include/ucxx/request.h Outdated
Comment on lines +299 to +301
* Every submit site (all `request` methods from child classes and the AM
* rendezvous-receive path) calls this after obtaining the request handle from the
* corresponding `ucp_*_nbx` function.
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, cleaned up in 3716e5d .

Comment thread cpp/include/ucxx/request.h Outdated
Comment on lines +66 to +67
Attributes _requestAttr{}; ///< Request attributes queried when request is posted
bool _isRequestAttrValid{false}; ///< Whether the request attributes are valid
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 6cd44e5 .

Comment thread cpp/src/request.cpp Outdated

const std::string& Request::getOwnerString() const { return _ownerString; }

void Request::queryRequestAttributes()
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a37d5e0 .

Comment thread cpp/src/request.cpp Outdated
std::lock_guard<std::recursive_mutex> lock(_mutex);

if (_isRequestAttrValid) return;
if (!_worker->isRequestAttributesEnabled()) return;
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3e92f4b .

Comment thread cpp/src/request.cpp
Comment on lines +296 to +299
if (!_worker->isRequestAttributesEnabled())
throw ucxx::UnsupportedError(
"Request attributes querying is disabled on the owning worker; build the worker "
"with `ucxx::experimental::WorkerBuilder::requestAttributes(true)` to enable it");
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 3e92f4b .

Comment thread cpp/src/request.cpp Outdated
if (UCS_PTR_IS_PTR(_request)) {
auto queryStatus = ucp_request_query(_request, &result);
if (queryStatus == UCS_OK && result.debug_string != nullptr) {
_requestAttr.debugString = std::string(result.debug_string);
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in a1a5c17 .

@pentschev pentschev requested a review from a team as a code owner May 13, 2026 14:05
@pentschev pentschev requested a review from jameslamb May 13, 2026 14:05
Comment thread ci/run_cpp.sh

run_cpp_tests() {
CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((10*60)) ${GTESTS_PATH}/UCXX_TEST"
CMD_LINE="python ${TIMEOUT_TOOL_PATH} $((20*60)) ${GTESTS_PATH}/UCXX_TEST"
Copy link
Copy Markdown
Member Author

@pentschev pentschev May 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the slow runners are already running close to the current 10 minutes timeout (sample from last night ran from 07:19:36 until 07:28:13, for a total of 08:37 minutes), with the addition of new tests they occasionally do timeout.

@pentschev pentschev requested a review from wence- May 13, 2026 14:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants