Fix pool_memory_resource crash during process exit#2367
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds process-exit detection and a registration hook; updates several memory-resource and CUDA cleanup paths to skip CUDA runtime calls during process shutdown and registers the hook at per-device static map initialization. Also adds the new source file to the CMake build. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/rmm/process_is_exiting.hpp`:
- Around line 36-37: The documentation for rmm::process_is_exiting() incorrectly
states it performs a "single relaxed atomic load" while the implementation uses
std::memory_order_acquire; update the docstring to reflect the actual memory
ordering (acquire) so the contract matches the code, e.g., state that it
performs a single atomic load with acquire semantics and never calls into CUDA,
leaving rmm::process_is_exiting() as safe to call from destructors.
In `@cpp/src/runtime_shutdown.cpp`:
- Around line 36-38: When registering the atexit handler inside the
registered.compare_exchange_strong block, check the return value of std::atexit
and handle failures: call std::atexit(...) and store its result; if it returns
non-zero (failure) then immediately set exiting_flag().store(true,
std::memory_order_release) (and optionally reset registered to false or log the
error) so the shutdown guard still becomes effective even if registration
failed. Reference the symbols registered, compare_exchange_strong, std::atexit,
and exiting_flag to locate where to add the check and fallback.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 127c17cd-0492-489d-85cc-08507e0ffab1
📒 Files selected for processing (8)
cpp/CMakeLists.txtcpp/include/rmm/detail/runtime_shutdown.hppcpp/include/rmm/mr/detail/stream_ordered_memory_resource.hppcpp/include/rmm/mr/per_device_resource.hppcpp/include/rmm/process_is_exiting.hppcpp/src/mr/detail/pool_memory_resource_impl.cppcpp/src/runtime_shutdown.cppdocs/cpp/memory_resources/memory_resources.md
cdca11f to
d522416
Compare
| @@ -1,5 +1,50 @@ | |||
| # Memory Resources | |||
|
|
|||
| ## Authoring custom memory resources | |||
There was a problem hiding this comment.
We will want to move this to the user guide (#2087).
pool_memory_resource's destructor unconditionally calls cudaEventSynchronize and cudaEventDestroy from stream_ordered_memory_resource::release(). When an owning pool_memory_resource is held in the static per-device resource map returned by detail::get_ref_map(), the destructor runs at exit() time after the CUDA primary context has been destroyed. At that point CUDA calls dereference released context state inside libcuda and crash, producing intermittent SIGSEGV at process exit. Introduce rmm::process_is_exiting(), a public API that resource destructors can consult to detect this condition. It is backed by an atexit callback registered the first time the per-device resource map is constructed; atexit/__cxa_atexit LIFO ordering ensures the callback runs before the map's destructor during teardown. stream_ordered_memory_resource::release() and pool_memory_resource_impl::release() both skip their CUDA calls when process_is_exiting() returns true, allowing associated state to leak; the OS reclaims it when the process exits. Document the memory-resource author contract in the new public header and in docs/cpp/memory_resources/memory_resources.md: all memory resources must be safe to destroy at shutdown, either by not calling CUDA APIs from destructors or by consulting rmm::process_is_exiting() and skipping CUDA calls when it returns true.
d522416 to
8660942
Compare
In response to PR rapidsai#2367 review: the requirement that resources be safe to destroy at process shutdown only applies to resources that are going to be held in the per-device resource map, so the contract belongs on the setter functions. Add a @note to each of set_per_device_resource, set_per_device_resource_ref, set_current_device_resource, and set_current_device_resource_ref stating the requirement and pointing at rmm::process_is_exiting(). Remove the duplicate narrative from docs/cpp/memory_resources/memory_resources.md; the page now only renders the doxygen memory_resources group, which includes the updated setters. Also align the rmm::process_is_exiting() docstring wording with the implementation's acquire memory ordering (per coderabbit review comment).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/include/rmm/mr/per_device_resource.hpp`:
- Around line 191-196: The doc comment around the per-device resource "_ref"
setters (references to new_resource_ref and the per-device resource map) is
misleading about lifetime; update the Doxygen text in per_device_resource.hpp
(including the related block around lines 289-294) to explicitly state the map
only stores a non-owning reference wrapper and does not extend or manage the
referenced object's lifetime, and that the shutdown-safety requirement applies
when the referenced resource instance itself may outlive CUDA teardown; keep the
required guidance that resource destructors must avoid calling CUDA APIs or must
check rmm::process_is_exiting() before making CUDA calls.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 6d825c47-6c27-407f-b15a-f923c7be9404
📒 Files selected for processing (2)
cpp/include/rmm/mr/per_device_resource.hppcpp/include/rmm/process_is_exiting.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
- cpp/include/rmm/process_is_exiting.hpp
896cdcc to
68ebf54
Compare
Two binaries: - PROCESS_IS_EXITING_TEST (gtest): verifies process_is_exiting() returns false during normal execution, is idempotent, and that register_process_exit_hook() can be called repeatedly without effect. - PROCESS_IS_EXITING_SHUTDOWN_TEST (standalone, no gtest): exercises the C++ std::atexit sequencing guarantee the fix depends on. main() registers a C atexit checker first and then calls register_process_exit_hook(), so at termination the flag-setter runs before the checker. The checker calls _Exit(1) if the flag is not set; CTest verifies exit status 0. Neither test touches CUDA.
| // The map's destructor may run owning resources' destructors during process exit, when calling | ||
| // into the CUDA runtime is undefined behavior. Register the process-exit hook here, right after | ||
| // the map is constructed: the C++ standard guarantees that an atexit callback registered after | ||
| // a static object's construction runs before that object's destructor at termination. | ||
| rmm::detail::register_process_exit_hook(); |
There was a problem hiding this comment.
If we didn't want to use an atexit hook we could instead have an object that modifies a static variable. That might be simpler?
namespace {
struct ShutdownFlag {
~ShutdownFlag() { flag.store(true, std::memory_order_release); }
static std::atomic<bool> flag;
}
bool ShutdownFlag::flag = false;
}
namespace detail {
bool is_shutting_down { return ShutdownFlag::flag.load(std::memory_order_acquire); }
}
// and now here;
{
static std::map<...> device_id_to_resource;
static shutdownobj = ShutdownFlag{};
~ShutdownFlag() will run before ~device_id_to_resource() as required.
There was a problem hiding this comment.
I guess that approach doesn't allow you to detect shutdown in cuda_stream dtors. However, arguably I think that is not necessary. cuda_streams are never kept alive by static lifetime objects in RMM. If we migrate to CCCL cuda::stream they're not going to do this dance we're doing here
| /** | ||
| * @brief Register the atexit callback that flips the flag observed by `rmm::process_is_exiting()`. | ||
| * | ||
| * Idempotent: safe to call from multiple places; only the first call registers the callback. |
There was a problem hiding this comment.
This statement is not true. I am trying to decide how much we should care about actually fixing the problem there.
A single atexit hook is not sufficient here. Consider the following code:
struct X {
~X() { cudaFree(0); }
};
void f() {
static X x;
register_process_exit_hook();
}
void g() {
static X y;
register_process_exit_hook();
}
int main() {
f();
g();
}When g runs the atexit handler has already been created, so it will not reregister. That means that the atexit handler will be sequenced before the destruction of y, so the destruction of y is unsafe.
@wence- asked about the current mixed public/private state of the function below. If we actually intend to support the above case, we need this to be a publicly exposed function in a header, and rather than storing a single value it needs to store one lock per handler. For example you could push a new atomic into a vector each time and return that index so that process_is_exiting could instead be atexit_handler_has_run(int index). That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.
We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.
There was a problem hiding this comment.
We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.
My take: we document it is UB to store RMM objects with static (or thread-local) scope.
We do all this dance privately just for the per-device map until we can kill it with fire.
There was a problem hiding this comment.
That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.
Emphasis added:
The implementation is guaranteed to support the registration of at least 32 functions. The exact limit is implementation-defined.
https://en.cppreference.com/cpp/utility/program/atexit
To verify,
long limit = sysconf(_SC_ATEXIT_MAX);gives me 2147483647. I don't expect this cap to affect us.
There was a problem hiding this comment.
Yes, I realize that most implementations set a much higher value. I wouldn't want to rely on that in the case where we we're providing a new atexit handler for every other object. In any case, as long as we're scoping this for internal use only with the per-device map I'm fine leaving it as-is.
cf5c648 to
520d6f9
Compare
wence-
left a comment
There was a problem hiding this comment.
I thought we wouldn't want process_is_exiting in the public API but I realise we need it so that downstream authors of mrs can DTRT.
vyasr
left a comment
There was a problem hiding this comment.
Some improvements can be made to the comments but the implementation looks fine now.
|
/merge |
Description
Closes #2366.
Fixes the intermittent SIGSEGV at process exit when a stateful memory resource owned by RMM's internal per-device resource map is destroyed after the CUDA primary context has already been torn down.
Root cause
pool_memory_resource's destructor callsstream_ordered_memory_resource::release(), which callscudaEventSynchronize/cudaEventDestroyon tracked events. When the pool is stored by value in the staticstd::mapreturned bydetail::get_ref_map(), its destructor runs during process termination. At that point CUDA runtime/driver calls are unsafe: in Release builds cudart can forward into libcuda, where the driver dereferences destroyed context state and SIGSEGVs; in Debug builds cudart may return shutdown-related errors such ascudaErrorContextIsDestroyed.The investigation in #2366 also showed that
cudaEventQueryis not a safe probe here: it can crash the same way ascudaEventSynchronizeonce context state is gone.Fix
Add an internal process-exit hook for RMM's per-device resource map and a public query,
rmm::process_is_exiting(), for RMM memory resource destructors installed in that map.The C++ standard (post-C++11, cppreference
std::exit) guarantees:get_ref_map()constructs the static map and then immediately callsrmm::detail::register_process_exit_hook(), which registers a singlestd::atexitcallback. By the sequencing rule above, the callback runs before the map destructor and flips the flag observed byrmm::process_is_exiting().This intentionally does not support arbitrary user-created static or thread-local containers of RMM objects. The supported case is RMM's internal per-device resource map. The setter docstrings now explicitly say that any resource installed in the map whose destructor may call CUDA APIs must consult
rmm::process_is_exiting()and skip those calls when it returnstrue.What changed
cpp/include/rmm/process_is_exiting.hpp: new public query API and documentation for shutdown-sensitive destructors.cpp/include/rmm/detail/runtime_shutdown.hppandcpp/src/runtime_shutdown.cpp: internal hook registration and flag implementation.cpp/include/rmm/mr/per_device_resource.hpp: registers the exit hook immediately after the static map is constructed, and documents the destructor requirements on the fourset_*_device_resource*setters.cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp: skips event cleanup during process exit.cpp/src/mr/detail/pool_memory_resource_impl.cpp: skips upstream block deallocation during process exit.cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp: skips upstream block deallocation during process exit.cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp: skipscudaMemPoolDestroyduring process exit.cpp/tests/process_is_exiting_tests.cpp: basic public API tests.cpp/tests/process_is_exiting_shutdown_test.cpp: standalone process-lifecycle test proving that the flag is set before a checker registered beforeget_ref_map()runs at process termination.Verification
Reproducer from #2366:
./cpp/build/latest/benchmarks/STRINGS_NVBENCH --profile --benchmark 0 --devices 0from rapidsai/cudf.Local tests run during development:
PROCESS_IS_EXITING_TESTPROCESS_IS_EXITING_PTDS_TESTPROCESS_IS_EXITING_SHUTDOWN_TESTPOOL_MR_TEST,POOL_MR_PTDS_TEST,FIXED_SIZE_MR_REF_TEST,CUDA_ASYNC_MR_SHARED_CUDART_TEST,CUDA_ASYNC_MR_STATIC_CUDART_TEST, andERROR_MACROS_TESTChecklist