Fix pool_memory_resource crash during process exit by bdice · Pull Request #2367 · rapidsai/rmm

bdice · 2026-04-23T02:59:15Z

Description

Closes #2366.

Fixes the intermittent SIGSEGV at process exit when a stateful memory resource owned by RMM's internal per-device resource map is destroyed after the CUDA primary context has already been torn down.

Root cause

pool_memory_resource's destructor calls stream_ordered_memory_resource::release(), which calls cudaEventSynchronize / cudaEventDestroy on tracked events. When the pool is stored by value in the static std::map returned by detail::get_ref_map(), its destructor runs during process termination. At that point CUDA runtime/driver calls are unsafe: in Release builds cudart can forward into libcuda, where the driver dereferences destroyed context state and SIGSEGVs; in Debug builds cudart may return shutdown-related errors such as cudaErrorContextIsDestroyed.

The investigation in #2366 also showed that cudaEventQuery is not a safe probe here: it can crash the same way as cudaEventSynchronize once context state is gone.

Fix

Add an internal process-exit hook for RMM's per-device resource map and a public query, rmm::process_is_exiting(), for RMM memory resource destructors installed in that map.

The C++ standard (post-C++11, cppreference std::exit) guarantees:

If the completion of the initialization of a static object A was sequenced-before the call to std::atexit for some function F, the call to F during termination is sequenced-before the start of the destruction of A.

get_ref_map() constructs the static map and then immediately calls rmm::detail::register_process_exit_hook(), which registers a single std::atexit callback. By the sequencing rule above, the callback runs before the map destructor and flips the flag observed by rmm::process_is_exiting().

This intentionally does not support arbitrary user-created static or thread-local containers of RMM objects. The supported case is RMM's internal per-device resource map. The setter docstrings now explicitly say that any resource installed in the map whose destructor may call CUDA APIs must consult rmm::process_is_exiting() and skip those calls when it returns true.

What changed

cpp/include/rmm/process_is_exiting.hpp: new public query API and documentation for shutdown-sensitive destructors.
cpp/include/rmm/detail/runtime_shutdown.hpp and cpp/src/runtime_shutdown.cpp: internal hook registration and flag implementation.
cpp/include/rmm/mr/per_device_resource.hpp: registers the exit hook immediately after the static map is constructed, and documents the destructor requirements on the four set_*_device_resource* setters.
cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp: skips event cleanup during process exit.
cpp/src/mr/detail/pool_memory_resource_impl.cpp: skips upstream block deallocation during process exit.
cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp: skips upstream block deallocation during process exit.
cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp: skips cudaMemPoolDestroy during process exit.
cpp/tests/process_is_exiting_tests.cpp: basic public API tests.
cpp/tests/process_is_exiting_shutdown_test.cpp: standalone process-lifecycle test proving that the flag is set before a checker registered before get_ref_map() runs at process termination.

Verification

Reproducer from #2366: ./cpp/build/latest/benchmarks/STRINGS_NVBENCH --profile --benchmark 0 --devices 0 from rapidsai/cudf.

Before: ~60–80% segfault rate at process exit.
After (Release): 0 failures in 50 consecutive runs.
After (Debug): 0 failures in 80 consecutive runs (30 + 50).

Local tests run during development:

PROCESS_IS_EXITING_TEST
PROCESS_IS_EXITING_PTDS_TEST
PROCESS_IS_EXITING_SHUTDOWN_TEST
relevant MR smoke tests including POOL_MR_TEST, POOL_MR_PTDS_TEST, FIXED_SIZE_MR_REF_TEST, CUDA_ASYNC_MR_SHARED_CUDART_TEST, CUDA_ASYNC_MR_STATIC_CUDART_TEST, and ERROR_MACROS_TEST

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-04-23T02:59:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-04-23T03:04:10Z

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

@coderabbitai resume to resume automatic reviews.
@coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

▶️ Resume reviews
🔍 Trigger review

📝 Walkthrough

Walkthrough

Adds process-exit detection and a registration hook; updates several memory-resource and CUDA cleanup paths to skip CUDA runtime calls during process shutdown and registers the hook at per-device static map initialization. Also adds the new source file to the CMake build.

Changes

Cohort / File(s)	Summary
Build `cpp/CMakeLists.txt`	Adds `src/runtime_shutdown.cpp` to the `rmm` library sources.
Process-exit infra `cpp/include/rmm/process_is_exiting.hpp`, `cpp/include/rmm/detail/runtime_shutdown.hpp`, `cpp/src/runtime_shutdown.cpp`	Adds `rmm::process_is_exiting()` and `detail::register_process_exit_hook()` implementations using an atomic flag and `std::atexit`.
Per-device resource handling `cpp/include/rmm/mr/per_device_resource.hpp`	Registers the process-exit hook when the static per-device map is initialized; docs updated to require shutdown-safe resource destructors.
Stream-ordered resource cleanup `cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp`	`release()` skips CUDA event synchronization/destruction when process is exiting; still clears internal containers.
Pool / fixed-size / async MR changes `cpp/src/mr/detail/pool_memory_resource_impl.cpp`, `cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp`, `cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp`	Upstream deallocation and CUDA pool destruction are skipped during process exit (guarded by `process_is_exiting()`); added includes for the new header.
CUDA stream cleanup `cpp/src/cuda_stream.cpp`	Destructor avoids calling `cudaStreamDestroy` when `process_is_exiting()` indicates shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug, non-breaking

Suggested reviewers

rongou
harrism

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'Fix pool_memory_resource crash during process exit' clearly and specifically summarizes the main objective of the PR—fixing a crash in pool_memory_resource during process shutdown.
Linked Issues check	✅ Passed	The PR fully addresses the requirements from issue `#2366`: introduces rmm::process_is_exiting() for safe shutdown detection, guards all CUDA API calls in destructors, documents the memory-resource author contract, and verifies the fix eliminates the crash.
Out of Scope Changes check	✅ Passed	All changes are directly in-scope: new shutdown infrastructure (process_is_exiting.hpp, runtime_shutdown files), guarded CUDA calls in destructors across memory resources, documentation updates, and CMakeLists.txt integration—all necessary to fix the process-exit crash.
Description check	✅ Passed	The description comprehensively addresses the changeset, explaining both the root cause (SIGSEGV at process exit) and the implemented fix (process-exit hook and public query API).

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/rmm/process_is_exiting.hpp`:
- Around line 36-37: The documentation for rmm::process_is_exiting() incorrectly
states it performs a "single relaxed atomic load" while the implementation uses
std::memory_order_acquire; update the docstring to reflect the actual memory
ordering (acquire) so the contract matches the code, e.g., state that it
performs a single atomic load with acquire semantics and never calls into CUDA,
leaving rmm::process_is_exiting() as safe to call from destructors.

In `@cpp/src/runtime_shutdown.cpp`:
- Around line 36-38: When registering the atexit handler inside the
registered.compare_exchange_strong block, check the return value of std::atexit
and handle failures: call std::atexit(...) and store its result; if it returns
non-zero (failure) then immediately set exiting_flag().store(true,
std::memory_order_release) (and optionally reset registered to false or log the
error) so the shutdown guard still becomes effective even if registration
failed. Reference the symbols registered, compare_exchange_strong, std::atexit,
and exiting_flag to locate where to add the check and fallback.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 127c17cd-0492-489d-85cc-08507e0ffab1

📥 Commits

Reviewing files that changed from the base of the PR and between 386f76d and cdca11f.

📒 Files selected for processing (8)

cpp/CMakeLists.txt
cpp/include/rmm/detail/runtime_shutdown.hpp
cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
cpp/include/rmm/mr/per_device_resource.hpp
cpp/include/rmm/process_is_exiting.hpp
cpp/src/mr/detail/pool_memory_resource_impl.cpp
cpp/src/runtime_shutdown.cpp
docs/cpp/memory_resources/memory_resources.md

bdice · 2026-04-23T03:18:58Z

@@ -1,5 +1,50 @@
 # Memory Resources

+## Authoring custom memory resources


We will want to move this to the user guide (#2087).

pool_memory_resource's destructor unconditionally calls cudaEventSynchronize and cudaEventDestroy from stream_ordered_memory_resource::release(). When an owning pool_memory_resource is held in the static per-device resource map returned by detail::get_ref_map(), the destructor runs at exit() time after the CUDA primary context has been destroyed. At that point CUDA calls dereference released context state inside libcuda and crash, producing intermittent SIGSEGV at process exit. Introduce rmm::process_is_exiting(), a public API that resource destructors can consult to detect this condition. It is backed by an atexit callback registered the first time the per-device resource map is constructed; atexit/__cxa_atexit LIFO ordering ensures the callback runs before the map's destructor during teardown. stream_ordered_memory_resource::release() and pool_memory_resource_impl::release() both skip their CUDA calls when process_is_exiting() returns true, allowing associated state to leak; the OS reclaims it when the process exits. Document the memory-resource author contract in the new public header and in docs/cpp/memory_resources/memory_resources.md: all memory resources must be safe to destroy at shutdown, either by not calling CUDA APIs from destructors or by consulting rmm::process_is_exiting() and skipping CUDA calls when it returns true.

@note

In response to PR rapidsai#2367 review: the requirement that resources be safe to destroy at process shutdown only applies to resources that are going to be held in the per-device resource map, so the contract belongs on the setter functions. Add a @note to each of set_per_device_resource, set_per_device_resource_ref, set_current_device_resource, and set_current_device_resource_ref stating the requirement and pointing at rmm::process_is_exiting(). Remove the duplicate narrative from docs/cpp/memory_resources/memory_resources.md; the page now only renders the doxygen memory_resources group, which includes the updated setters. Also align the rmm::process_is_exiting() docstring wording with the implementation's acquire memory ordering (per coderabbit review comment).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/rmm/mr/per_device_resource.hpp`:
- Around line 191-196: The doc comment around the per-device resource "_ref"
setters (references to new_resource_ref and the per-device resource map) is
misleading about lifetime; update the Doxygen text in per_device_resource.hpp
(including the related block around lines 289-294) to explicitly state the map
only stores a non-owning reference wrapper and does not extend or manage the
referenced object's lifetime, and that the shutdown-safety requirement applies
when the referenced resource instance itself may outlive CUDA teardown; keep the
required guidance that resource destructors must avoid calling CUDA APIs or must
check rmm::process_is_exiting() before making CUDA calls.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d825c47-6c27-407f-b15a-f923c7be9404

📥 Commits

Reviewing files that changed from the base of the PR and between 8660942 and 03d4805.

📒 Files selected for processing (2)

cpp/include/rmm/mr/per_device_resource.hpp
cpp/include/rmm/process_is_exiting.hpp

🚧 Files skipped from review as they are similar to previous changes (1)

cpp/include/rmm/process_is_exiting.hpp

Two binaries: - PROCESS_IS_EXITING_TEST (gtest): verifies process_is_exiting() returns false during normal execution, is idempotent, and that register_process_exit_hook() can be called repeatedly without effect. - PROCESS_IS_EXITING_SHUTDOWN_TEST (standalone, no gtest): exercises the C++ std::atexit sequencing guarantee the fix depends on. main() registers a C atexit checker first and then calls register_process_exit_hook(), so at termination the flag-setter runs before the checker. The checker calls _Exit(1) if the flag is not set; CTest verifies exit status 0. Neither test touches CUDA.

wence- · 2026-04-23T08:33:36Z

+  // The map's destructor may run owning resources' destructors during process exit, when calling
+  // into the CUDA runtime is undefined behavior. Register the process-exit hook here, right after
+  // the map is constructed: the C++ standard guarantees that an atexit callback registered after
+  // a static object's construction runs before that object's destructor at termination.
+  rmm::detail::register_process_exit_hook();


If we didn't want to use an atexit hook we could instead have an object that modifies a static variable. That might be simpler?

namespace { struct ShutdownFlag { ~ShutdownFlag() { flag.store(true, std::memory_order_release); } static std::atomic<bool> flag; } bool ShutdownFlag::flag = false; } namespace detail { bool is_shutting_down { return ShutdownFlag::flag.load(std::memory_order_acquire); } } // and now here; { static std::map<...> device_id_to_resource; static shutdownobj = ShutdownFlag{};

~ShutdownFlag() will run before ~device_id_to_resource() as required.

I guess that approach doesn't allow you to detect shutdown in cuda_stream dtors. However, arguably I think that is not necessary. cuda_streams are never kept alive by static lifetime objects in RMM. If we migrate to CCCL cuda::stream they're not going to do this dance we're doing here

Streams were fixed in 561ce68.

vyasr · 2026-04-23T17:20:36Z

+/**
+ * @brief Register the atexit callback that flips the flag observed by `rmm::process_is_exiting()`.
+ *
+ * Idempotent: safe to call from multiple places; only the first call registers the callback.


This statement is not true. I am trying to decide how much we should care about actually fixing the problem there.

A single atexit hook is not sufficient here. Consider the following code:

struct X { ~X() { cudaFree(0); } }; void f() { static X x; register_process_exit_hook(); } void g() { static X y; register_process_exit_hook(); } int main() { f(); g(); }

When g runs the atexit handler has already been created, so it will not reregister. That means that the atexit handler will be sequenced before the destruction of y, so the destruction of y is unsafe.

@wence- asked about the current mixed public/private state of the function below. If we actually intend to support the above case, we need this to be a publicly exposed function in a header, and rather than storing a single value it needs to store one lock per handler. For example you could push a new atomic into a vector each time and return that index so that process_is_exiting could instead be atexit_handler_has_run(int index). That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.

We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.

We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.

My take: we document it is UB to store RMM objects with static (or thread-local) scope.

We do all this dance privately just for the per-device map until we can kill it with fire.

I noted the UB in 4a814f3.

That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.

Emphasis added:

The implementation is guaranteed to support the registration of at least 32 functions. The exact limit is implementation-defined.

https://en.cppreference.com/cpp/utility/program/atexit

To verify,

long limit = sysconf(_SC_ATEXIT_MAX);

gives me 2147483647. I don't expect this cap to affect us.

Yes, I realize that most implementations set a much higher value. I wouldn't want to rely on that in the case where we we're providing a new atexit handler for every other object. In any case, as long as we're scoping this for internal use only with the per-device map I'm fine leaving it as-is.

wence-

I thought we wouldn't want process_is_exiting in the public API but I realise we need it so that downstream authors of mrs can DTRT.

vyasr

Some improvements can be made to the comments but the implementation looks fine now.

bdice · 2026-04-30T13:38:17Z

/merge

github-project-automation Bot added this to RMM Project Board Apr 23, 2026

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread cpp/include/rmm/process_is_exiting.hpp Outdated

Comment thread cpp/src/runtime_shutdown.cpp Outdated

bdice force-pushed the fix-pool-mr-shutdown-crash branch from cdca11f to d522416 Compare April 23, 2026 03:05

bdice commented Apr 23, 2026

View reviewed changes

bdice force-pushed the fix-pool-mr-shutdown-crash branch from d522416 to 8660942 Compare April 23, 2026 03:26

coderabbitai Bot reviewed Apr 23, 2026

View reviewed changes

Comment thread cpp/include/rmm/mr/per_device_resource.hpp Outdated

bdice self-assigned this Apr 23, 2026

bdice added bug Something isn't working non-breaking Non-breaking change labels Apr 23, 2026

bdice moved this to In Progress in RMM Project Board Apr 23, 2026

bdice marked this pull request as ready for review April 23, 2026 04:18

bdice requested review from a team as code owners April 23, 2026 04:18

bdice requested review from harrism and vyasr April 23, 2026 04:18

Match rmm doxygen style

68ebf54

bdice force-pushed the fix-pool-mr-shutdown-crash branch from 896cdcc to 68ebf54 Compare April 23, 2026 04:18

wence- requested changes Apr 23, 2026

View reviewed changes

github-project-automation Bot moved this from In Progress to Review in RMM Project Board Apr 23, 2026

vyasr requested changes Apr 23, 2026

View reviewed changes

vyasr mentioned this pull request Apr 23, 2026

Debug CI issues following CCCL migration rapidsai/cudf#22244

Closed

3 tasks

bdice added 4 commits April 24, 2026 11:15

Narrow shutdown guards and simplify cleanup paths

561ce68

Simplify runtime shutdown registration

573dd78

Narrow shutdown docs and strengthen tests

6ecc14b

Document shutdown support is limited to the per-device map

4a814f3

Abort if atexit registration fails

520d6f9

bdice force-pushed the fix-pool-mr-shutdown-crash branch from cf5c648 to 520d6f9 Compare April 27, 2026 21:49

bdice requested review from vyasr and wence- April 27, 2026 22:03

wence- approved these changes Apr 28, 2026

View reviewed changes

vyasr approved these changes Apr 29, 2026

View reviewed changes

Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated

Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated

Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated

Address reviews

963e54a

rapids-bot Bot merged commit 47910f9 into rapidsai:main Apr 30, 2026
85 checks passed

github-project-automation Bot moved this from Review to Done in RMM Project Board Apr 30, 2026

		@@ -1,5 +1,50 @@
		# Memory Resources

		## Authoring custom memory resources

Conversation

bdice commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Root cause

Fix

What changed

Verification

Checklist

Uh oh!

copy-pr-bot Bot commented Apr 23, 2026

Uh oh!

coderabbitai Bot commented Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reviews paused

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bdice Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wence- Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wence- Apr 23, 2026

Choose a reason for hiding this comment

Uh oh!

bdice Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vyasr Apr 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wence- Apr 24, 2026

Choose a reason for hiding this comment

Uh oh!

bdice Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bdice Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vyasr Apr 29, 2026

Choose a reason for hiding this comment

Uh oh!

wence- left a comment

Choose a reason for hiding this comment

Uh oh!

vyasr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bdice commented Apr 23, 2026 •

edited

Loading

coderabbitai Bot commented Apr 23, 2026 •

edited

Loading

wence- Apr 23, 2026 •

edited

Loading

bdice Apr 27, 2026 •

edited

Loading

vyasr Apr 23, 2026 •

edited

Loading

bdice Apr 27, 2026 •

edited

Loading

bdice Apr 27, 2026 •

edited

Loading