Is your feature request related to a problem? Please describe.
It would be useful to have a pack function to merge multiple device_buffers into a single device_buffer. This is helpful in situations where having one large device_buffer to read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.
Similarly it would be useful to have an unpack function to split a device_buffer into multiple device_buffers. This is helpful in situations where having one large device_buffer to write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.
Describe the solution you'd like
For pack it would be nice if it simply takes several device_buffers in vector form and return a single one. Additionally it would be nice if pack could recognize when device_buffers are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so if unpack is used regularly?). If we allow pack to change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the larger device_buffer.
For unpack it would be nice if it takes a single device_buffer and size_ts in vector form to split and return a vector of multiple device_buffers. Additionally it would be nice if unpack did not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.
Describe alternatives you've considered
One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use vector.
pack itself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).
Additional context
Having unpack in particular would be helpful for aggregated receives. A natural extension of this would be to have pack for aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( dask/distributed#3453 ) provides a WIP implementation of aggregated receives for context.
Also having pack would be useful when spilling several device_buffers from device to host as it would allow us to pack them into one device_buffer before transferring ( rapidsai/dask-cuda#250 ). Having unpack would help us break up the allocation whenever the object is unspilled.
This need has also come up in downstream contexts ( #3793 ). Maybe they would benefit from an upstream solution as well?
Is your feature request related to a problem? Please describe.
It would be useful to have a
packfunction to merge multipledevice_buffers into a singledevice_buffer. This is helpful in situations where having one largedevice_bufferto read from is more performant. However it ultimately consists of many smaller data segments that would need to be merged together. Example use cases include sending data with UCX and spilling data from device to host.Similarly it would be useful to have an
unpackfunction to split adevice_bufferinto multipledevice_buffers. This is helpful in situations where having one largedevice_bufferto write into is more performant. However it ultimately consists of many smaller data segments that may need to be freed at different times. Example use cases include receiving data with UCX and unspilling data from host to device.Describe the solution you'd like
For
packit would be nice if it simply takes severaldevice_buffers invectorform and return a single one. Additionally it would be nice ifpackcould recognize whendevice_buffers are contiguous in memory and avoid a copy. Though admittedly this last part is tricky (maybe less so ifunpackis used regularly?). If we allowpackto change the order (to benefit from contiguous memory for example), we may want additional information about where the data segments live in the largerdevice_buffer.For
unpackit would be nice if it takes a singledevice_bufferandsize_ts invectorform to split and return avectorof multipledevice_buffers. Additionally it would be nice ifunpackdid not perform any copies. Hopefully that is straightforward, but there may be things I'm not understanding.Describe alternatives you've considered
One might consider using variadics in C++ for the arguments. While nice at the C++ level, this seems tricky to use from the Cython and Python levels. Hence the suggestion to just use
vector.packitself could be implemented by a user simply allocating a larger buffer and copying over. Would be nice to avoid the extra allocation when possible though (which may require knowledge that RMM has about the allocations).Additional context
Having
unpackin particular would be helpful for aggregated receives. A natural extension of this would be to havepackfor aggregated sends. All-in-all this should allow transmitting a larger amount of data at once with UCX and thus benefiting from this use case it is more honed for. PR ( dask/distributed#3453 ) provides a WIP implementation of aggregated receives for context.Also having
packwould be useful when spilling severaldevice_buffers from device to host as it would allow us to pack them into onedevice_bufferbefore transferring ( rapidsai/dask-cuda#250 ). Havingunpackwould help us break up the allocation whenever the object is unspilled.This need has also come up in downstream contexts ( #3793 ). Maybe they would benefit from an upstream solution as well?