feat: add multiple_blocks_allocation RAII handle for fixed_size_memory_resource#2368
feat: add multiple_blocks_allocation RAII handle for fixed_size_memory_resource#2368nirandaperera wants to merge 6 commits into
Conversation
Signed-off-by: niranda perera <niranda.perera@gmail.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a public RAII handle Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
cpp/tests/mr/fixed_size_mr_test.cpp (1)
58-115: Add a multithreaded test for the new allocation/destruction path.This only exercises single-threaded accounting. The feature change is in the mutex/event path of
make_asyncand destruction, so a test that allocates and destroys handles from multiple CPU threads would give much better coverage of the new synchronization logic. Based on learnings "Test concurrent allocations and deallocations from multiple threads; verify thread safety of pool implementations."🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/mr/fixed_size_mr_test.cpp` around lines 58 - 115, The test only exercises single-threaded allocation/deallocation of fixed_size_memory_resource and does not cover the new mutex/event synchronization in rmm::mr::multiple_blocks_allocation::make_async and its destruction path; add a multithreaded variant of TEST_P(FixedSizeMRTest, AllocateBlocksAsyncUpstreamCountedDeallocateDoesNotReturnToUpstream) that spawns several CPU threads which concurrently call multiple_blocks_allocation::make_async on the same fixed_size_mr instance (using the same stream_pool or distinct streams) and then clear/destroy the handles concurrently, asserting the same invariants (handle->size(), handle->capacity(), counting.get_bytes_counter().value behavior) both before and after fixed_size_mr destruction to exercise and verify the mutex/event synchronization path in make_async and destructor.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/mr/fixed_size_memory_resource.cpp`:
- Around line 73-85: After acquiring the blocks with self.get_block(...) and
before returning the new multiple_blocks_allocation, ensure that construction
failures do not leak the acquired blocks: wrap the construction of the
unique_ptr<multiple_blocks_allocation> (the new multiple_blocks_allocation(size,
std::move(blocks), stream, std::move(mr)) call) in a try/catch that on
catch(...) calls self.deallocate_blocks_async_unsafe(std::move(blocks), stream)
and rethrows; alternatively create the object via a factory that returns a smart
pointer or use a scoped guard that deallocates blocks on exception so
deallocate_blocks_async_unsafe is always invoked if the
multiple_blocks_allocation constructor throws.
- Around line 39-55: Reject per-thread default streams when creating an
asynchronous multiple_blocks_allocation: in the factory function make_async (the
caller that constructs multiple_blocks_allocation and passes cuda_stream_view
stream), validate stream.is_per_thread_default() and fail fast (throw or return
an error) instead of accepting PTDS; multiple_blocks_allocation stores stream_
and uses it in its destructor (deallocate_blocks_async_unsafe), so accepting
PTDS would cause the stored handle to resolve on a different thread and race
with GPU work—update make_async to check stream.is_per_thread_default() and
document the rejection, or alternatively require the allocation be destroyed on
the creating thread if PTDS is allowed (preferred: reject PTDS in make_async).
---
Nitpick comments:
In `@cpp/tests/mr/fixed_size_mr_test.cpp`:
- Around line 58-115: The test only exercises single-threaded
allocation/deallocation of fixed_size_memory_resource and does not cover the new
mutex/event synchronization in rmm::mr::multiple_blocks_allocation::make_async
and its destruction path; add a multithreaded variant of TEST_P(FixedSizeMRTest,
AllocateBlocksAsyncUpstreamCountedDeallocateDoesNotReturnToUpstream) that spawns
several CPU threads which concurrently call
multiple_blocks_allocation::make_async on the same fixed_size_mr instance (using
the same stream_pool or distinct streams) and then clear/destroy the handles
concurrently, asserting the same invariants (handle->size(), handle->capacity(),
counting.get_bytes_counter().value behavior) both before and after fixed_size_mr
destruction to exercise and verify the mutex/event synchronization path in
make_async and destructor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 65115bfb-9479-479a-b6c0-2aa3b96ca229
📒 Files selected for processing (7)
cpp/include/rmm/mr/detail/fixed_size_memory_resource_impl.hppcpp/include/rmm/mr/detail/stream_ordered_memory_resource.hppcpp/include/rmm/mr/fixed_size_memory_resource.hppcpp/src/mr/detail/fixed_size_memory_resource_impl.cppcpp/src/mr/fixed_size_memory_resource.cppcpp/tests/CMakeLists.txtcpp/tests/mr/fixed_size_mr_test.cpp
Signed-off-by: niranda perera <niranda.perera@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/rmm/mr/fixed_size_memory_resource.hpp`:
- Around line 91-204: Update the class-level doc for multiple_blocks_allocation
to reference the actual factory make_async (replace the stale
allocate_blocks_async mention) and add a brief thread-safety/lifetime note
clarifying that multiple_blocks_allocation is non-copyable/non-movable, holds a
refcounted fixed_size_memory_resource (mr_) so the pool outlives the handle, and
that allocation/deallocation are stream-ordered on the cuda_stream_view passed
to make_async (including that destructor records the deallocation on that
stream). Ensure references to symbols multiple_blocks_allocation, make_async,
mr_, and the destructor are included so the API contract matches the
implementation.
In `@cpp/src/mr/fixed_size_memory_resource.cpp`:
- Around line 51-57: The destructor
multiple_blocks_allocation::~multiple_blocks_allocation currently calls
mr_->deallocate_blocks_async_unsafe(...) which can surface CUDA errors from a
destructor; make the cleanup no-throw by wrapping the deallocation call in the
project's no-throw CUDA error macro/utility (e.g., RMM_CUDA_TRY_NOEXCEPT) or an
equivalent try/catch that logs errors without throwing. Ensure you still check
blocks_.empty(), acquire the mutex via std::lock_guard as before, and move
blocks_ into the deallocation call, but perform that call inside the no-throw
wrapper so any CUDA error is swallowed/logged rather than allowed to propagate
out of the destructor.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4e775602-9c2b-4234-8235-8629762e0fa4
📒 Files selected for processing (3)
cpp/include/rmm/mr/fixed_size_memory_resource.hppcpp/src/mr/fixed_size_memory_resource.cppcpp/tests/mr/fixed_size_mr_test.cpp
| * before the failure are returned to the pool on `stream` (same ordering as normal | ||
| * deallocation). | ||
| */ | ||
| [[nodiscard]] static std::unique_ptr<multiple_blocks_allocation> make_async( |
There was a problem hiding this comment.
Why do we need a factory instead of a normal constructor?
There was a problem hiding this comment.
This was previously inside the fixed sized mr class. Then I thought a factory method was the best. When I pulled it out, I left it as is. I felt its more idiomatic. We can throw and verify args (I should remove the RMM_EXPECTS statements in the ctr) cleanly. But I am fine either way. WDYT @bdice ?
| * deallocation). | ||
| */ | ||
| [[nodiscard]] static std::unique_ptr<multiple_blocks_allocation> make_async( | ||
| fixed_size_memory_resource mr, std::size_t size, cuda_stream_view stream); |
There was a problem hiding this comment.
All new APIs should accept cuda::stream_ref rather than rmm::cuda_stream_view. I am planning to deprecate rmm::cuda_stream_view at some point. Until then the two types are implicitly convertible and thus interchangeable.
Signed-off-by: niranda perera <niranda.perera@gmail.com>
Signed-off-by: niranda perera <niranda.perera@gmail.com>
There was a problem hiding this comment.
♻️ Duplicate comments (2)
cpp/src/mr/fixed_size_memory_resource.cpp (2)
108-121:⚠️ Potential issue | 🟠 MajorKeep handle construction inside the rollback scope.
If
new multiple_blocks_allocation(...)throws, control skips the currentcatchand the acquired blocks never get returned to the pool. The rollback needs to cover object construction as well, not justget_block().🛠️ Proposed fix
std::vector<std::byte*> blocks; blocks.reserve(num_blocks); try { for (std::size_t i = 0; i < num_blocks; ++i) { blocks.push_back( static_cast<std::byte*>(self.get_block(self.get_block_size(), stream_event).pointer())); } - } catch (...) { - RMM_CUDA_TRY(self.deallocate_blocks_async_unsafe(std::move(blocks), stream)); - throw; - } - - return std::unique_ptr<multiple_blocks_allocation>( - new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr))); + return std::unique_ptr<multiple_blocks_allocation>( + new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr))); + } catch (...) { + RMM_CUDA_TRY(self.deallocate_blocks_async_unsafe(std::move(blocks), stream)); + throw; + }#!/bin/bash # Verify that the handle construction still sits outside the rollback scope. sed -n '103,122p' cpp/src/mr/fixed_size_memory_resource.cppBased on learnings: "Detect and fix GPU memory leaks: ensure all device allocations (cudaMalloc, cudaMallocManaged, cudaHostAlloc) are properly deallocated, including error paths."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mr/fixed_size_memory_resource.cpp` around lines 108 - 121, The construction of the multiple_blocks_allocation handle can throw and currently lives outside the rollback scope, so if its constructor throws the acquired blocks in `blocks` won't be returned; move the creation of the unique_ptr/new multiple_blocks_allocation(...) inside the same try that obtains blocks (or add a nested try/catch around the `new multiple_blocks_allocation(...)` that calls `self.deallocate_blocks_async_unsafe(std::move(blocks), stream)` and rethrows) so that any exception during `multiple_blocks_allocation` construction triggers the same rollback path; reference `get_block`, `blocks`, `deallocate_blocks_async_unsafe`, and `multiple_blocks_allocation` when making the change.
49-56:⚠️ Potential issue | 🔴 CriticalUse
std::exchange()to ensure moved-from state is safely emptied.After move-construction or move-assignment,
std::vectoris left in a valid but unspecified state—not guaranteed empty. This creates a double-release risk: if the moved-from object later callsclear()or its destructor, GPU blocks can be returned twice. Additionally,deallocate_blocks_async_unsafe()is called with a moved-from vector in unspecified state.Also, in
make_async()(lines 115-121), thenew multiple_blocks_allocation(...)call is outside thetry/catchblock. If that allocation throws, the acquired blocks are not deallocated, leaking GPU memory. Wrap the object construction in the rollback scope.🔧 Proposed fix
void multiple_blocks_allocation::clear() { - if (!blocks_.empty()) { + auto blocks = std::exchange(blocks_, {}); + if (!blocks.empty()) { std::lock_guard<std::mutex> lock(mr_->get_mutex()); - RMM_CUDA_TRY(mr_->deallocate_blocks_async_unsafe(std::move(blocks_), stream_)); + RMM_CUDA_TRY(mr_->deallocate_blocks_async_unsafe(std::move(blocks), stream_)); } size_ = 0; } multiple_blocks_allocation::multiple_blocks_allocation(multiple_blocks_allocation&& other) noexcept - : blocks_(std::move(other.blocks_)), + : blocks_(std::exchange(other.blocks_, {})), size_(other.size_), stream_(other.stream_), mr_(std::move(other.mr_)) { other.size_ = 0; } multiple_blocks_allocation& multiple_blocks_allocation::operator=( multiple_blocks_allocation&& other) { if (this != &other) { clear(); - blocks_ = std::move(other.blocks_); + blocks_ = std::exchange(other.blocks_, {}); size_ = other.size_; stream_ = other.stream_; mr_ = std::move(other.mr_); other.size_ = 0; } return *this; } try { for (std::size_t i = 0; i < num_blocks; ++i) { blocks.push_back(...); } - } catch (...) { + return std::unique_ptr<multiple_blocks_allocation>( + new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr))); + } catch (...) { RMM_CUDA_TRY(self.deallocate_blocks_async_unsafe(std::move(blocks), stream)); throw; } - - return std::unique_ptr<multiple_blocks_allocation>( - new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr)));🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mr/fixed_size_memory_resource.cpp` around lines 49 - 56, The move constructor multiple_blocks_allocation(multiple_blocks_allocation&&) leaves moved-from members in unspecified state which risks double-release; use std::exchange to move blocks_ and mr_ and set size_ and stream_ to safe defaults (e.g., std::exchange(other.size_, 0), std::exchange(other.stream_, nullptr), std::exchange(other.blocks_, {}), std::exchange(other.mr_, nullptr)) so the moved-from object is explicitly empty. Also, in make_async() wrap the new multiple_blocks_allocation(...) construction inside the existing try/catch/rollback scope (or add a try around it) so that if operator new throws you call deallocate_blocks_async_unsafe() to release acquired GPU blocks before rethrowing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@cpp/src/mr/fixed_size_memory_resource.cpp`:
- Around line 108-121: The construction of the multiple_blocks_allocation handle
can throw and currently lives outside the rollback scope, so if its constructor
throws the acquired blocks in `blocks` won't be returned; move the creation of
the unique_ptr/new multiple_blocks_allocation(...) inside the same try that
obtains blocks (or add a nested try/catch around the `new
multiple_blocks_allocation(...)` that calls
`self.deallocate_blocks_async_unsafe(std::move(blocks), stream)` and rethrows)
so that any exception during `multiple_blocks_allocation` construction triggers
the same rollback path; reference `get_block`, `blocks`,
`deallocate_blocks_async_unsafe`, and `multiple_blocks_allocation` when making
the change.
- Around line 49-56: The move constructor
multiple_blocks_allocation(multiple_blocks_allocation&&) leaves moved-from
members in unspecified state which risks double-release; use std::exchange to
move blocks_ and mr_ and set size_ and stream_ to safe defaults (e.g.,
std::exchange(other.size_, 0), std::exchange(other.stream_, nullptr),
std::exchange(other.blocks_, {}), std::exchange(other.mr_, nullptr)) so the
moved-from object is explicitly empty. Also, in make_async() wrap the new
multiple_blocks_allocation(...) construction inside the existing
try/catch/rollback scope (or add a try around it) so that if operator new throws
you call deallocate_blocks_async_unsafe() to release acquired GPU blocks before
rethrowing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: a6e7e9e1-54d3-4ffd-b1b9-0ae84ba5448e
📒 Files selected for processing (4)
cpp/include/rmm/mr/detail/fixed_size_memory_resource_impl.hppcpp/include/rmm/mr/fixed_size_memory_resource.hppcpp/src/mr/detail/fixed_size_memory_resource_impl.cppcpp/src/mr/fixed_size_memory_resource.cpp
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/include/rmm/mr/fixed_size_memory_resource.hpp
- cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp
Signed-off-by: niranda perera <niranda.perera@gmail.com>
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
cpp/src/mr/fixed_size_memory_resource.cpp (1)
111-122:⚠️ Potential issue | 🟠 MajorExtend the rollback path to cover handle construction.
Line 121 is still outside the rollback
try. Ifnew multiple_blocks_allocation(...)throws after the blocks were acquired, those blocks never get returned to the pool and pool capacity leaks on the failure path.🛠️ Proposed fix
std::vector<std::byte*> blocks; blocks.reserve(num_blocks); try { for (std::size_t i = 0; i < num_blocks; ++i) { blocks.push_back( static_cast<std::byte*>(self.get_block(self.get_block_size(), stream_event).pointer())); } - } catch (...) { - RMM_CUDA_TRY(self.deallocate_blocks_async_unsafe(std::move(blocks), stream)); - throw; - } - - return std::unique_ptr<multiple_blocks_allocation>( - new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr))); + return std::unique_ptr<multiple_blocks_allocation>( + new multiple_blocks_allocation(size, std::move(blocks), stream, std::move(mr))); + } catch (...) { + RMM_CUDA_TRY(self.deallocate_blocks_async_unsafe(std::move(blocks), stream)); + throw; + }As per coding guidelines, "Detect and fix GPU memory leaks: ensure all device allocations (cudaMalloc, cudaMallocManaged, cudaHostAlloc) are properly deallocated, including error paths."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/mr/fixed_size_memory_resource.cpp` around lines 111 - 122, The allocation of blocks is protected by a try/catch but the subsequent construction of multiple_blocks_allocation(new multiple_blocks_allocation(...)) sits outside that try, so if the constructor throws the acquired blocks are never returned; move the creation of the multiple_blocks_allocation object inside the same try (or wrap the constructor call in the same try) and, in the catch, call self.deallocate_blocks_async_unsafe(std::move(blocks), stream) (using RMM_CUDA_TRY as currently done) to ensure all blocks acquired via get_block/get_block_size are returned on any failure; reference the blocks vector, multiple_blocks_allocation constructor, get_block/get_block_size, deallocate_blocks_async_unsafe, stream_event, stream and mr when making the change.cpp/include/rmm/mr/fixed_size_memory_resource.hpp (1)
91-99:⚠️ Potential issue | 🟡 MinorDocument the thread-safety contract explicitly.
The public docs explain lifetime, but they still do not say whether distinct handles may be cleared/destroyed concurrently or whether a single handle requires external synchronization. Please make that contract explicit in the class/destructor docs.
As per coding guidelines, "Document thread-safety guarantees in memory resource class documentation (Doxygen); specify which operations are thread-safe."
Also applies to: 126-132
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/rmm/mr/fixed_size_memory_resource.hpp` around lines 91 - 99, Update the Doxygen for the RAII handle and the fixed_size_memory_resource to state the thread-safety contract: specify whether distinct handle instances (the RAII allocation handle) may be destroyed/cleared concurrently without external synchronization and whether operations on a single handle (including its destructor and any clear/release method) are safe only with external synchronization; also document that the underlying fixed_size_memory_resource permits concurrent allocations/frees across different streams if applicable or requires external synchronization. Reference the RAII handle class/destructor and fixed_size_memory_resource types in the comments so readers can find the guarantees quickly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/tests/mr/fixed_size_mr_test.cpp`:
- Around line 141-151: The deallocation loop serializes teardown because
handles.pop_back() destroys the unique_ptr while holding handles_mutex; change
the loop inside the async task so it pops/moves the unique_ptr out while holding
handles_mutex (e.g., move handles.back() into a local variable and pop_back),
then release the lock and let the unique_ptr go out of scope (destroy) outside
the critical section; update the lambda used by dealloc_futs to use
handles_mutex, move the handle to a local variable before unlocking, and perform
destruction (letting multiple_blocks_allocation teardown run) outside the mutex
to allow true concurrent deallocation.
---
Duplicate comments:
In `@cpp/include/rmm/mr/fixed_size_memory_resource.hpp`:
- Around line 91-99: Update the Doxygen for the RAII handle and the
fixed_size_memory_resource to state the thread-safety contract: specify whether
distinct handle instances (the RAII allocation handle) may be destroyed/cleared
concurrently without external synchronization and whether operations on a single
handle (including its destructor and any clear/release method) are safe only
with external synchronization; also document that the underlying
fixed_size_memory_resource permits concurrent allocations/frees across different
streams if applicable or requires external synchronization. Reference the RAII
handle class/destructor and fixed_size_memory_resource types in the comments so
readers can find the guarantees quickly.
In `@cpp/src/mr/fixed_size_memory_resource.cpp`:
- Around line 111-122: The allocation of blocks is protected by a try/catch but
the subsequent construction of multiple_blocks_allocation(new
multiple_blocks_allocation(...)) sits outside that try, so if the constructor
throws the acquired blocks are never returned; move the creation of the
multiple_blocks_allocation object inside the same try (or wrap the constructor
call in the same try) and, in the catch, call
self.deallocate_blocks_async_unsafe(std::move(blocks), stream) (using
RMM_CUDA_TRY as currently done) to ensure all blocks acquired via
get_block/get_block_size are returned on any failure; reference the blocks
vector, multiple_blocks_allocation constructor, get_block/get_block_size,
deallocate_blocks_async_unsafe, stream_event, stream and mr when making the
change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 359d9832-10c9-4ee9-950d-787128b95b9c
📒 Files selected for processing (3)
cpp/include/rmm/mr/fixed_size_memory_resource.hppcpp/src/mr/fixed_size_memory_resource.cppcpp/tests/mr/fixed_size_mr_test.cpp
Summary
multiple_blocks_allocation, an RAII class that allocates device memory spanning one or more fixed-size blocks from afixed_size_memory_resourcein a single stream-ordered call.multiple_blocks_allocation::make_async(mr, size, stream)as the sole public factory, which acquires the required number of blocks under a single mutex lock and records a CUDA event for correct stream ordering.deallocate_blocks_async_unsafepath (caller holds the mutex) — avoiding a second lock round-trip.get_event/get_blocktoprotectedinstream_ordered_memory_resourceso thatfixed_size_memory_resource_implcan grant friend access tomultiple_blocks_allocation.cpp/src/mr/fixed_size_memory_resource.cppto keep the public header lean.fixed_size_mr_test.cppwhich exercises allocation sizes of 0, sub-block, exact-block, and multi-block acrosscuda_memory_resourceandcuda_async_memory_resourceupstreams, usingstatistics_resource_adaptorto verify byte counts.Test plan
FIXED_SIZE_MR_TEST(all parameter combinations pass).fixed_size_mrtests still pass.Made with Cursor