Skip to content

Active Message APIs: support scatter-gather I/O and user-defined header data (Part 1: C++ APIs)#594

Merged
rapids-bot[bot] merged 19 commits into
rapidsai:mainfrom
grlee77:grelee/ucxx-iov-updates-cpp
Mar 3, 2026
Merged

Active Message APIs: support scatter-gather I/O and user-defined header data (Part 1: C++ APIs)#594
rapids-bot[bot] merged 19 commits into
rapidsai:mainfrom
grlee77:grelee/ucxx-iov-updates-cpp

Conversation

@grlee77
Copy link
Copy Markdown
Contributor

@grlee77 grlee77 commented Feb 18, 2026

Background

UCX supports scatter-gather I/O through its IOV (I/O Vector) datatype, which allows sending data from multiple non-contiguous memory buffers in a single operation. This avoids the need to copy disjoint buffers into a single contiguous allocation before sending, which is important for workloads that produce multi-segment messages (e.g., a serialized header followed by one or more tensor payloads).

Until now, UCXX's Active Message (AM) send API only supported contiguous buffers. Callers needing to send data from multiple buffers had to either concatenate into a staging buffer or issue multiple separate sends. This PR adds first-class IOV support and a structured parameter object that exposes additional UCX knobs without breaking existing callers.

The concrete motivation is a Holoscan SDK use case. An existing internal GXF UCX extension used by holoscan uses UCX C APIs to send/receive a C++ data structure that is basically a std::unordered_map of named tensors (i.e. C++ equivalent of Python dict[str, ndarray]). We would like to use similar functionality elsewhere via UCXX C++ APIs.

The approach already verified to work with UCX C APIs (ucp_am_send_nbx) is:

  1. serialize tensor info like shape, dtype, strides into the AM header (always host data)
  2. provide a vector of data pointers to the AM iov API (all tensors must be on the same device)

The equivalent behavior is not currently possible with UCXX due to the following limitations:

  1. flags passed to AM send calls were hard-coded with no way to modify them
  2. no way to specify additional user data in the AM header
  3. no way to send via UCX's scatter-gather I/O via the I/O Vector (iov) data type

For Holoscan purposes we do not currently need the Python APIs at all, but I have a follow-up PR that adds them for completeness and to follow the UCXX convention of exposing functionality to Python.

IOV support for active messages

New types (cpp/include/ucxx/typedefs.h):

  • AmSendMemoryTypePolicy enum — controls receiver-side allocation behavior when no allocator is registered for the sender's memory type. FallbackToHost (default) silently falls back to host memory; ErrorOnUnsupported fails the receive with UCS_ERR_UNSUPPORTED.
  • AmSendParams struct — groups send flags, UCX datatype, memory type hint, memory type policy, and optional receiver callback info into a single parameter object.

New Endpoint::amSend overloads (cpp/include/ucxx/endpoint.h, cpp/src/endpoint.cpp):

  • amSend(buffer, length, AmSendParams) — contiguous send with explicit policy controls.
  • amSend(std::vector<ucp_dt_iov_t>, AmSendParams) — scatter-gather IOV send. UCX receives the segments as a single logical message; the receiver sees the reassembled contiguous data.

The original amSend(buffer, length, memoryType, ...) overload is preserved unchanged.

Request data layer (cpp/include/ucxx/request_data.h, cpp/src/request_data.cpp):

  • data::AmSend gains an IOV constructor that stores the segment vector, validates entries (non-empty list, non-null buffers for non-zero lengths, correct datatype), and sets _count to the number of IOV entries.

AM send request path (cpp/src/request_am.cpp):

  • UCP_OP_ATTR_FIELD_DATATYPE is now included in op_attr_mask so UCX respects the datatype field.
  • For IOV sends, the IOV descriptor array pointer is passed to ucp_am_send_nbx instead of a raw buffer pointer. The lambda captures data::AmSend by const reference to ensure the descriptors remain valid for the duration of the async operation.
  • AmHeader serialization appends the AmSendMemoryTypePolicy byte. Deserialization reads it when present and defaults to FallbackToHost for backward compatibility with older headers.

User-defined AM header

UCX Active Messages support a separate header parameter in ucp_am_send_nbx that travels independently of the body payload. The header is always host memory regardless of the body's memory type, making it ideal for metadata (tensor names, shapes, dtypes, etc.) alongside device-memory payloads.

Previously, UCXX used the AM header internally for its own AmHeader struct (memoryType, memoryTypePolicy, receiverCallbackInfo) and provided no way for users to attach their own header data.

C++ changes:

  • AmSendParams gains a std::string userHeader field (opaque arbitrary bytes, not necessarily text).
  • data::AmSend and data::AmReceive carry the user header through the send and receive paths.
  • AmHeader serialization appends the user header size and data after the existing fields. Deserialization uses a bounds check so older senders that don't include a user header deserialize with an empty string.
  • Request::getRecvHeader() virtual method (returns empty string by default) with RequestAm::getRecvHeader() override that returns the user header from the received message.

Size limits: The user header is serialized into the AM header parameter of ucp_am_send_nbx, which is subject to transport-level size limits (e.g., ~8 KiB default for TCP via UCX_TCP_TX_SEG_SIZE). Exceeding the limit causes a fatal UCX error. Keep user headers small (< 4 KiB recommended) or increase segment size environment variables.

Backward compatibility

  • The existing amSend(buffer, length, memoryType, ...) signature is unchanged and continues to work as before.
  • AmSendParams defaults (flags = UCP_AM_SEND_FLAG_REPLY, datatype = contig(1), memoryType = HOST, policy = FallbackToHost) match the prior implicit behavior.
  • The serialized AM header is backward-compatible: the policy byte is appended at the end and older receivers that don't read it will default to FallbackToHost.
  • User header bytes are appended after the policy byte. Older receivers that don't read past the policy byte will silently ignore them. Older senders that don't include a user header will deserialize with an empty string (bounds check in deserialization).
  • AmSendParams::userHeader defaults to an empty string. Existing callers that don't set it are unaffected.
  • getRecvHeader() returns an empty string for non-AM requests and for AM receives from senders that didn't set a user header.

Usage Examples

Python API examples below are for context, but are not included in this PR. They are to be provided in a follow-up PR.

Contiguous buffer (new parameterized form)

auto params       = ucxx::AmSendParams{};
params.memoryType = UCS_MEMORY_TYPE_HOST;
auto req = endpoint->amSend(buffer, length, params);

IOV send

std::vector<ucp_dt_iov_t> iov(2);
iov[0].buffer = payload1_ptr;
iov[0].length = payload1_len;
iov[1].buffer = payload2_ptr;
iov[1].length = payload2_len;

auto params       = ucxx::AmSendParams{};
params.datatype   = UCP_DATATYPE_IOV;
params.memoryType = UCS_MEMORY_TYPE_HOST;
auto req = endpoint->amSend(iov, params);
await ep.am_send_iov([payload1, payload2])
# With an optional user-defined header for metadata:
await ep.am_send_iov([payload1, payload2], user_header=b'{"parts":2}')

Strict memory policy

auto params             = ucxx::AmSendParams{};
params.memoryType       = UCS_MEMORY_TYPE_CUDA;
params.memoryTypePolicy = ucxx::AmSendMemoryTypePolicy::ErrorOnUnsupported;
auto req = endpoint->amSend(buffer, length, params);
from ucxx._lib.libucxx import PythonAmSendMemoryTypePolicy

await ep.am_send(
    buffer,
    memory_ty pe_policy=PythonAmSendMemoryTypePolicy.ErrorOnUnsupported,
)

User-defined header

auto params        = ucxx::AmSendParams{};
params.memoryType  = UCS_MEMORY_TYPE_HOST;
params.userHeader  = "{\"dtype\":\"float32\",\"shape\":[4,256,256]}";
auto req = endpoint->amSend(buffer, length, params);
await ep.am_send(buffer, user_header=b'{"dtype":"float32","shape":[4,256,256]}')

On the receive side, access the header after the request completes:

auto req = endpoint->amRecv();
// ... wait for completion ...
std::string header = req->getRecvHeader();  // empty if sender didn't set one
auto buf = req->getRecvBuffer();
buf, header = await ep.am_recv_with_header()  # header is b"" if not sent

add test for strict memory policy unsupported-path
The std::visit lambda was taking data::AmSend amSend by value, creating a temporary copy.
For IOV sends, sendBuffer pointed to amSend._iov.data() — the
copy's vector storage. When the lambda returned, the copy was destroyed, but ucp_am_send_nbx
is async and still needed the IOV descriptors, causing a use-after-free. Changed to const data::AmSend&
amSend so it references the original data in _requestData, which lives as long as the RequestAm object.

Add UCP_OP_ATTR_FIELD_DATATYPE so UCX doesn't ignore the .datatype field
this is needed to include host-side information on tensor shape, strides, dtype etc when using I/O Vector (iov) APIs
@grlee77 grlee77 requested a review from a team as a code owner February 18, 2026 16:14
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 18, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Comment thread cpp/include/ucxx/typedefs.h Outdated
AmSendMemoryTypePolicy::FallbackToHost}; ///< Receiver allocation policy.
std::optional<AmReceiverCallbackInfo> receiverCallbackInfo{
std::nullopt}; ///< Optional receiver callback metadata.
std::string userHeader{}; ///< Opaque user-defined header (arbitrary bytes, not necessarily text).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: Should we store std::vector<std::byte> rather than std::string here? WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that std::vector<std::byte> is semantically clearer for C++. The reason for this choice was consistency with the existing AmHeaderSerialized which has type std::string. I think the reason for std::string is likely that Cython provides <string><->bytes casting so in the Cython code this currently allows simply

params.userHeader = <string>user_header

where user_header is a Python bytes object (as in libucxx.pyx in #595)

If we change it then I think that Cython code becomes something like

  cdef vector[byte] header_vec
  header_vec.resize(len(user_header))
  memcpy(header_vec.data(), <const char*>user_header, len(user_header))
  params.userHeader = header_vec

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with us breaking existing implementation or even API for clearer semantics, even if handling on the Cython side is a bit more brittle. With that said, I prefer that we treat C++ as first class citizen even if that means exposing code to Cython becomes sort of a second class citizen, so I would also prefer std::vector<std::byte> here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yeah, ok, this is fine.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, let me try changing it and if it isn't too hard on the Python/Cython side we can make the change.

Comment thread cpp/src/request_data.cpp Outdated
AmSend::AmSend(const std::vector<ucp_dt_iov_t>& iov, const AmSendParams& params)
: _buffer(nullptr),
_length(0),
_iov(iov),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I think this copies iov since _iov is not a reference?

Therefore, perhaps it would make it clearer to the caller if this ctor took the parameter by value std::vector<ucp_dt_iov_t> iov (rather than a reference) and initialised by _iov(std::move(iov)) here.

That way I'm not worried that I need to keep the reference alive for the lifetime of the returned shared_ptr Request.

WDYT?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I agree that this would be better

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated in 1f6604d

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Feb 23, 2026

/ok to test

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 23, 2026

/ok to test

@wence-, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

@wence-
Copy link
Copy Markdown
Contributor

wence- commented Feb 24, 2026

/ok to test e50506e

Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good, thanks @grlee77 . Providing a user header was something we already had planned and partially implemented in #479 , so this is a very good addition (we can refactor that PR to use your implementation instead of what was in there). I left a minor improvement to an exception, and suggest we switch to std::vector<std::byte> as Lawrence initially pointed out to. I know there are good reasons also not to do it, so I am fine with keeping std::string as well, let me know what you think.

Comment thread cpp/include/ucxx/typedefs.h Outdated
AmSendMemoryTypePolicy::FallbackToHost}; ///< Receiver allocation policy.
std::optional<AmReceiverCallbackInfo> receiverCallbackInfo{
std::nullopt}; ///< Optional receiver callback metadata.
std::string userHeader{}; ///< Opaque user-defined header (arbitrary bytes, not necessarily text).
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fine with us breaking existing implementation or even API for clearer semantics, even if handling on the Cython side is a bit more brittle. With that said, I prefer that we treat C++ as first class citizen even if that means exposing code to Cython becomes sort of a second class citizen, so I would also prefer std::vector<std::byte> here.

Comment thread cpp/src/request_data.cpp Outdated
@pentschev pentschev added feature request New feature or request non-breaking Introduces a non-breaking change libucxx labels Feb 25, 2026
@pentschev
Copy link
Copy Markdown
Member

/ok to test 5e00989

@pentschev
Copy link
Copy Markdown
Member

Looks like there are some style issues, but it seems everything should be fixable by pre-commit with pre-commit run -a though.

@grlee77
Copy link
Copy Markdown
Contributor Author

grlee77 commented Feb 26, 2026

Thanks for reviewing. I updated userHeader to use std::vector<std::byte> and installed pre-commit to address the linting issues.

@pentschev
Copy link
Copy Markdown
Member

/ok to test 5377c03

@pentschev
Copy link
Copy Markdown
Member

/ok to test 479af66

Copy link
Copy Markdown
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @grlee77 , this looks good to me. There was still a minor style issue from a commit before you had pre-commit on I believe, so I fixed it and pushed to your branch so we could run tests, hope you don't mind. I'll leave this open before merging for another day or two in case we have more comments.

@pentschev
Copy link
Copy Markdown
Member

/merge

@rapids-bot rapids-bot Bot merged commit fb0803e into rapidsai:main Mar 3, 2026
90 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request libucxx non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants