Active Message APIs: support scatter-gather I/O and user-defined header data (Part 1: C++ APIs)#594
Conversation
add test for strict memory policy unsupported-path
…structors are created
The std::visit lambda was taking data::AmSend amSend by value, creating a temporary copy. For IOV sends, sendBuffer pointed to amSend._iov.data() — the copy's vector storage. When the lambda returned, the copy was destroyed, but ucp_am_send_nbx is async and still needed the IOV descriptors, causing a use-after-free. Changed to const data::AmSend& amSend so it references the original data in _requestData, which lives as long as the RequestAm object. Add UCP_OP_ATTR_FIELD_DATATYPE so UCX doesn't ignore the .datatype field
this is needed to include host-side information on tensor shape, strides, dtype etc when using I/O Vector (iov) APIs
| AmSendMemoryTypePolicy::FallbackToHost}; ///< Receiver allocation policy. | ||
| std::optional<AmReceiverCallbackInfo> receiverCallbackInfo{ | ||
| std::nullopt}; ///< Optional receiver callback metadata. | ||
| std::string userHeader{}; ///< Opaque user-defined header (arbitrary bytes, not necessarily text). |
There was a problem hiding this comment.
question: Should we store std::vector<std::byte> rather than std::string here? WDYT?
There was a problem hiding this comment.
Agree that std::vector<std::byte> is semantically clearer for C++. The reason for this choice was consistency with the existing AmHeaderSerialized which has type std::string. I think the reason for std::string is likely that Cython provides <string><->bytes casting so in the Cython code this currently allows simply
params.userHeader = <string>user_headerwhere user_header is a Python bytes object (as in libucxx.pyx in #595)
If we change it then I think that Cython code becomes something like
cdef vector[byte] header_vec
header_vec.resize(len(user_header))
memcpy(header_vec.data(), <const char*>user_header, len(user_header))
params.userHeader = header_vecThere was a problem hiding this comment.
I'm fine with us breaking existing implementation or even API for clearer semantics, even if handling on the Cython side is a bit more brittle. With that said, I prefer that we treat C++ as first class citizen even if that means exposing code to Cython becomes sort of a second class citizen, so I would also prefer std::vector<std::byte> here.
There was a problem hiding this comment.
Okay, let me try changing it and if it isn't too hard on the Python/Cython side we can make the change.
| AmSend::AmSend(const std::vector<ucp_dt_iov_t>& iov, const AmSendParams& params) | ||
| : _buffer(nullptr), | ||
| _length(0), | ||
| _iov(iov), |
There was a problem hiding this comment.
question: I think this copies iov since _iov is not a reference?
Therefore, perhaps it would make it clearer to the caller if this ctor took the parameter by value std::vector<ucp_dt_iov_t> iov (rather than a reference) and initialised by _iov(std::move(iov)) here.
That way I'm not worried that I need to keep the reference alive for the lifetime of the returned shared_ptr Request.
WDYT?
There was a problem hiding this comment.
Thanks, I agree that this would be better
|
/ok to test |
@wence-, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
|
/ok to test e50506e |
pentschev
left a comment
There was a problem hiding this comment.
This is looking good, thanks @grlee77 . Providing a user header was something we already had planned and partially implemented in #479 , so this is a very good addition (we can refactor that PR to use your implementation instead of what was in there). I left a minor improvement to an exception, and suggest we switch to std::vector<std::byte> as Lawrence initially pointed out to. I know there are good reasons also not to do it, so I am fine with keeping std::string as well, let me know what you think.
| AmSendMemoryTypePolicy::FallbackToHost}; ///< Receiver allocation policy. | ||
| std::optional<AmReceiverCallbackInfo> receiverCallbackInfo{ | ||
| std::nullopt}; ///< Optional receiver callback metadata. | ||
| std::string userHeader{}; ///< Opaque user-defined header (arbitrary bytes, not necessarily text). |
There was a problem hiding this comment.
I'm fine with us breaking existing implementation or even API for clearer semantics, even if handling on the Cython side is a bit more brittle. With that said, I prefer that we treat C++ as first class citizen even if that means exposing code to Cython becomes sort of a second class citizen, so I would also prefer std::vector<std::byte> here.
|
/ok to test 5e00989 |
|
Looks like there are some style issues, but it seems everything should be fixable by pre-commit with |
Co-authored-by: Peter Andreas Entschev <peter@entschev.com>
|
Thanks for reviewing. I updated |
|
/ok to test 5377c03 |
|
/ok to test 479af66 |
pentschev
left a comment
There was a problem hiding this comment.
Thanks @grlee77 , this looks good to me. There was still a minor style issue from a commit before you had pre-commit on I believe, so I fixed it and pushed to your branch so we could run tests, hope you don't mind. I'll leave this open before merging for another day or two in case we have more comments.
|
/merge |
Background
UCX supports scatter-gather I/O through its IOV (I/O Vector) datatype, which allows sending data from multiple non-contiguous memory buffers in a single operation. This avoids the need to copy disjoint buffers into a single contiguous allocation before sending, which is important for workloads that produce multi-segment messages (e.g., a serialized header followed by one or more tensor payloads).
Until now, UCXX's Active Message (AM) send API only supported contiguous buffers. Callers needing to send data from multiple buffers had to either concatenate into a staging buffer or issue multiple separate sends. This PR adds first-class IOV support and a structured parameter object that exposes additional UCX knobs without breaking existing callers.
The concrete motivation is a Holoscan SDK use case. An existing internal GXF UCX extension used by holoscan uses UCX C APIs to send/receive a C++ data structure that is basically a
std::unordered_mapof named tensors (i.e. C++ equivalent of Pythondict[str, ndarray]). We would like to use similar functionality elsewhere via UCXX C++ APIs.The approach already verified to work with UCX C APIs (
ucp_am_send_nbx) is:The equivalent behavior is not currently possible with UCXX due to the following limitations:
For Holoscan purposes we do not currently need the Python APIs at all, but I have a follow-up PR that adds them for completeness and to follow the UCXX convention of exposing functionality to Python.
IOV support for active messages
New types (
cpp/include/ucxx/typedefs.h):AmSendMemoryTypePolicyenum — controls receiver-side allocation behavior when no allocator is registered for the sender's memory type.FallbackToHost(default) silently falls back to host memory;ErrorOnUnsupportedfails the receive withUCS_ERR_UNSUPPORTED.AmSendParamsstruct — groups send flags, UCX datatype, memory type hint, memory type policy, and optional receiver callback info into a single parameter object.New
Endpoint::amSendoverloads (cpp/include/ucxx/endpoint.h,cpp/src/endpoint.cpp):amSend(buffer, length, AmSendParams)— contiguous send with explicit policy controls.amSend(std::vector<ucp_dt_iov_t>, AmSendParams)— scatter-gather IOV send. UCX receives the segments as a single logical message; the receiver sees the reassembled contiguous data.The original
amSend(buffer, length, memoryType, ...)overload is preserved unchanged.Request data layer (
cpp/include/ucxx/request_data.h,cpp/src/request_data.cpp):data::AmSendgains an IOV constructor that stores the segment vector, validates entries (non-empty list, non-null buffers for non-zero lengths, correct datatype), and sets_countto the number of IOV entries.AM send request path (
cpp/src/request_am.cpp):UCP_OP_ATTR_FIELD_DATATYPEis now included inop_attr_maskso UCX respects the datatype field.ucp_am_send_nbxinstead of a raw buffer pointer. The lambda capturesdata::AmSendby const reference to ensure the descriptors remain valid for the duration of the async operation.AmHeaderserialization appends theAmSendMemoryTypePolicybyte. Deserialization reads it when present and defaults toFallbackToHostfor backward compatibility with older headers.User-defined AM header
UCX Active Messages support a separate header parameter in
ucp_am_send_nbxthat travels independently of the body payload. The header is always host memory regardless of the body's memory type, making it ideal for metadata (tensor names, shapes, dtypes, etc.) alongside device-memory payloads.Previously, UCXX used the AM header internally for its own
AmHeaderstruct (memoryType, memoryTypePolicy, receiverCallbackInfo) and provided no way for users to attach their own header data.C++ changes:
AmSendParamsgains astd::string userHeaderfield (opaque arbitrary bytes, not necessarily text).data::AmSendanddata::AmReceivecarry the user header through the send and receive paths.AmHeaderserialization appends the user header size and data after the existing fields. Deserialization uses a bounds check so older senders that don't include a user header deserialize with an empty string.Request::getRecvHeader()virtual method (returns empty string by default) withRequestAm::getRecvHeader()override that returns the user header from the received message.Size limits: The user header is serialized into the AM header parameter of
ucp_am_send_nbx, which is subject to transport-level size limits (e.g., ~8 KiB default for TCP viaUCX_TCP_TX_SEG_SIZE). Exceeding the limit causes a fatal UCX error. Keep user headers small (< 4 KiB recommended) or increase segment size environment variables.Backward compatibility
amSend(buffer, length, memoryType, ...)signature is unchanged and continues to work as before.AmSendParamsdefaults (flags = UCP_AM_SEND_FLAG_REPLY,datatype = contig(1),memoryType = HOST,policy = FallbackToHost) match the prior implicit behavior.FallbackToHost.AmSendParams::userHeaderdefaults to an empty string. Existing callers that don't set it are unaffected.getRecvHeader()returns an empty string for non-AM requests and for AM receives from senders that didn't set a user header.Usage Examples
Python API examples below are for context, but are not included in this PR. They are to be provided in a follow-up PR.
Contiguous buffer (new parameterized form)
IOV send
Strict memory policy
User-defined header
On the receive side, access the header after the request completes: