Skip to content

Fix pool_memory_resource crash during process exit#2367

Merged
rapids-bot[bot] merged 10 commits into
rapidsai:mainfrom
bdice:fix-pool-mr-shutdown-crash
Apr 30, 2026
Merged

Fix pool_memory_resource crash during process exit#2367
rapids-bot[bot] merged 10 commits into
rapidsai:mainfrom
bdice:fix-pool-mr-shutdown-crash

Conversation

@bdice
Copy link
Copy Markdown
Collaborator

@bdice bdice commented Apr 23, 2026

Description

Closes #2366.

Fixes the intermittent SIGSEGV at process exit when a stateful memory resource owned by RMM's internal per-device resource map is destroyed after the CUDA primary context has already been torn down.

Root cause

pool_memory_resource's destructor calls stream_ordered_memory_resource::release(), which calls cudaEventSynchronize / cudaEventDestroy on tracked events. When the pool is stored by value in the static std::map returned by detail::get_ref_map(), its destructor runs during process termination. At that point CUDA runtime/driver calls are unsafe: in Release builds cudart can forward into libcuda, where the driver dereferences destroyed context state and SIGSEGVs; in Debug builds cudart may return shutdown-related errors such as cudaErrorContextIsDestroyed.

The investigation in #2366 also showed that cudaEventQuery is not a safe probe here: it can crash the same way as cudaEventSynchronize once context state is gone.

Fix

Add an internal process-exit hook for RMM's per-device resource map and a public query, rmm::process_is_exiting(), for RMM memory resource destructors installed in that map.

The C++ standard (post-C++11, cppreference std::exit) guarantees:

If the completion of the initialization of a static object A was sequenced-before the call to std::atexit for some function F, the call to F during termination is sequenced-before the start of the destruction of A.

get_ref_map() constructs the static map and then immediately calls rmm::detail::register_process_exit_hook(), which registers a single std::atexit callback. By the sequencing rule above, the callback runs before the map destructor and flips the flag observed by rmm::process_is_exiting().

This intentionally does not support arbitrary user-created static or thread-local containers of RMM objects. The supported case is RMM's internal per-device resource map. The setter docstrings now explicitly say that any resource installed in the map whose destructor may call CUDA APIs must consult rmm::process_is_exiting() and skip those calls when it returns true.

What changed

  • cpp/include/rmm/process_is_exiting.hpp: new public query API and documentation for shutdown-sensitive destructors.
  • cpp/include/rmm/detail/runtime_shutdown.hpp and cpp/src/runtime_shutdown.cpp: internal hook registration and flag implementation.
  • cpp/include/rmm/mr/per_device_resource.hpp: registers the exit hook immediately after the static map is constructed, and documents the destructor requirements on the four set_*_device_resource* setters.
  • cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp: skips event cleanup during process exit.
  • cpp/src/mr/detail/pool_memory_resource_impl.cpp: skips upstream block deallocation during process exit.
  • cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp: skips upstream block deallocation during process exit.
  • cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp: skips cudaMemPoolDestroy during process exit.
  • cpp/tests/process_is_exiting_tests.cpp: basic public API tests.
  • cpp/tests/process_is_exiting_shutdown_test.cpp: standalone process-lifecycle test proving that the flag is set before a checker registered before get_ref_map() runs at process termination.

Verification

Reproducer from #2366: ./cpp/build/latest/benchmarks/STRINGS_NVBENCH --profile --benchmark 0 --devices 0 from rapidsai/cudf.

  • Before: ~60–80% segfault rate at process exit.
  • After (Release): 0 failures in 50 consecutive runs.
  • After (Debug): 0 failures in 80 consecutive runs (30 + 50).

Local tests run during development:

  • PROCESS_IS_EXITING_TEST
  • PROCESS_IS_EXITING_PTDS_TEST
  • PROCESS_IS_EXITING_SHUTDOWN_TEST
  • relevant MR smoke tests including POOL_MR_TEST, POOL_MR_PTDS_TEST, FIXED_SIZE_MR_REF_TEST, CUDA_ASYNC_MR_SHARED_CUDART_TEST, CUDA_ASYNC_MR_STATIC_CUDART_TEST, and ERROR_MACROS_TEST

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 23, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds process-exit detection and a registration hook; updates several memory-resource and CUDA cleanup paths to skip CUDA runtime calls during process shutdown and registers the hook at per-device static map initialization. Also adds the new source file to the CMake build.

Changes

Cohort / File(s) Summary
Build
cpp/CMakeLists.txt
Adds src/runtime_shutdown.cpp to the rmm library sources.
Process-exit infra
cpp/include/rmm/process_is_exiting.hpp, cpp/include/rmm/detail/runtime_shutdown.hpp, cpp/src/runtime_shutdown.cpp
Adds rmm::process_is_exiting() and detail::register_process_exit_hook() implementations using an atomic flag and std::atexit.
Per-device resource handling
cpp/include/rmm/mr/per_device_resource.hpp
Registers the process-exit hook when the static per-device map is initialized; docs updated to require shutdown-safe resource destructors.
Stream-ordered resource cleanup
cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
release() skips CUDA event synchronization/destruction when process is exiting; still clears internal containers.
Pool / fixed-size / async MR changes
cpp/src/mr/detail/pool_memory_resource_impl.cpp, cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp, cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp
Upstream deallocation and CUDA pool destruction are skipped during process exit (guarded by process_is_exiting()); added includes for the new header.
CUDA stream cleanup
cpp/src/cuda_stream.cpp
Destructor avoids calling cudaStreamDestroy when process_is_exiting() indicates shutdown.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug, non-breaking

Suggested reviewers

  • rongou
  • harrism
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix pool_memory_resource crash during process exit' clearly and specifically summarizes the main objective of the PR—fixing a crash in pool_memory_resource during process shutdown.
Linked Issues check ✅ Passed The PR fully addresses the requirements from issue #2366: introduces rmm::process_is_exiting() for safe shutdown detection, guards all CUDA API calls in destructors, documents the memory-resource author contract, and verifies the fix eliminates the crash.
Out of Scope Changes check ✅ Passed All changes are directly in-scope: new shutdown infrastructure (process_is_exiting.hpp, runtime_shutdown files), guarded CUDA calls in destructors across memory resources, documentation updates, and CMakeLists.txt integration—all necessary to fix the process-exit crash.
Description check ✅ Passed The description comprehensively addresses the changeset, explaining both the root cause (SIGSEGV at process exit) and the implemented fix (process-exit hook and public query API).

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/rmm/process_is_exiting.hpp`:
- Around line 36-37: The documentation for rmm::process_is_exiting() incorrectly
states it performs a "single relaxed atomic load" while the implementation uses
std::memory_order_acquire; update the docstring to reflect the actual memory
ordering (acquire) so the contract matches the code, e.g., state that it
performs a single atomic load with acquire semantics and never calls into CUDA,
leaving rmm::process_is_exiting() as safe to call from destructors.

In `@cpp/src/runtime_shutdown.cpp`:
- Around line 36-38: When registering the atexit handler inside the
registered.compare_exchange_strong block, check the return value of std::atexit
and handle failures: call std::atexit(...) and store its result; if it returns
non-zero (failure) then immediately set exiting_flag().store(true,
std::memory_order_release) (and optionally reset registered to false or log the
error) so the shutdown guard still becomes effective even if registration
failed. Reference the symbols registered, compare_exchange_strong, std::atexit,
and exiting_flag to locate where to add the check and fallback.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 127c17cd-0492-489d-85cc-08507e0ffab1

📥 Commits

Reviewing files that changed from the base of the PR and between 386f76d and cdca11f.

📒 Files selected for processing (8)
  • cpp/CMakeLists.txt
  • cpp/include/rmm/detail/runtime_shutdown.hpp
  • cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
  • cpp/include/rmm/mr/per_device_resource.hpp
  • cpp/include/rmm/process_is_exiting.hpp
  • cpp/src/mr/detail/pool_memory_resource_impl.cpp
  • cpp/src/runtime_shutdown.cpp
  • docs/cpp/memory_resources/memory_resources.md

Comment thread cpp/include/rmm/process_is_exiting.hpp Outdated
Comment thread cpp/src/runtime_shutdown.cpp Outdated
@bdice bdice force-pushed the fix-pool-mr-shutdown-crash branch from cdca11f to d522416 Compare April 23, 2026 03:05
@@ -1,5 +1,50 @@
# Memory Resources

## Authoring custom memory resources
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will want to move this to the user guide (#2087).

pool_memory_resource's destructor unconditionally calls cudaEventSynchronize
and cudaEventDestroy from stream_ordered_memory_resource::release(). When
an owning pool_memory_resource is held in the static per-device resource
map returned by detail::get_ref_map(), the destructor runs at exit() time
after the CUDA primary context has been destroyed. At that point CUDA
calls dereference released context state inside libcuda and crash,
producing intermittent SIGSEGV at process exit.

Introduce rmm::process_is_exiting(), a public API that resource destructors
can consult to detect this condition. It is backed by an atexit callback
registered the first time the per-device resource map is constructed;
atexit/__cxa_atexit LIFO ordering ensures the callback runs before the
map's destructor during teardown.

stream_ordered_memory_resource::release() and
pool_memory_resource_impl::release() both skip their CUDA calls when
process_is_exiting() returns true, allowing associated state to leak; the
OS reclaims it when the process exits.

Document the memory-resource author contract in the new public header and
in docs/cpp/memory_resources/memory_resources.md: all memory resources
must be safe to destroy at shutdown, either by not calling CUDA APIs from
destructors or by consulting rmm::process_is_exiting() and skipping CUDA
calls when it returns true.
@bdice bdice force-pushed the fix-pool-mr-shutdown-crash branch from d522416 to 8660942 Compare April 23, 2026 03:26
In response to PR rapidsai#2367 review: the requirement that resources be safe to
destroy at process shutdown only applies to resources that are going to be
held in the per-device resource map, so the contract belongs on the setter
functions. Add a @note to each of set_per_device_resource,
set_per_device_resource_ref, set_current_device_resource, and
set_current_device_resource_ref stating the requirement and pointing at
rmm::process_is_exiting(). Remove the duplicate narrative from
docs/cpp/memory_resources/memory_resources.md; the page now only renders
the doxygen memory_resources group, which includes the updated setters.

Also align the rmm::process_is_exiting() docstring wording with the
implementation's acquire memory ordering (per coderabbit review comment).
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/include/rmm/mr/per_device_resource.hpp`:
- Around line 191-196: The doc comment around the per-device resource "_ref"
setters (references to new_resource_ref and the per-device resource map) is
misleading about lifetime; update the Doxygen text in per_device_resource.hpp
(including the related block around lines 289-294) to explicitly state the map
only stores a non-owning reference wrapper and does not extend or manage the
referenced object's lifetime, and that the shutdown-safety requirement applies
when the referenced resource instance itself may outlive CUDA teardown; keep the
required guidance that resource destructors must avoid calling CUDA APIs or must
check rmm::process_is_exiting() before making CUDA calls.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 6d825c47-6c27-407f-b15a-f923c7be9404

📥 Commits

Reviewing files that changed from the base of the PR and between 8660942 and 03d4805.

📒 Files selected for processing (2)
  • cpp/include/rmm/mr/per_device_resource.hpp
  • cpp/include/rmm/process_is_exiting.hpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • cpp/include/rmm/process_is_exiting.hpp

Comment thread cpp/include/rmm/mr/per_device_resource.hpp Outdated
@bdice bdice self-assigned this Apr 23, 2026
@bdice bdice added bug Something isn't working non-breaking Non-breaking change labels Apr 23, 2026
@bdice bdice moved this to In Progress in RMM Project Board Apr 23, 2026
@bdice bdice marked this pull request as ready for review April 23, 2026 04:18
@bdice bdice requested review from a team as code owners April 23, 2026 04:18
@bdice bdice requested review from harrism and vyasr April 23, 2026 04:18
@bdice bdice force-pushed the fix-pool-mr-shutdown-crash branch from 896cdcc to 68ebf54 Compare April 23, 2026 04:18
Two binaries:

- PROCESS_IS_EXITING_TEST (gtest): verifies process_is_exiting() returns
  false during normal execution, is idempotent, and that
  register_process_exit_hook() can be called repeatedly without effect.
- PROCESS_IS_EXITING_SHUTDOWN_TEST (standalone, no gtest): exercises the
  C++ std::atexit sequencing guarantee the fix depends on. main() registers
  a C atexit checker first and then calls register_process_exit_hook(), so
  at termination the flag-setter runs before the checker. The checker calls
  _Exit(1) if the flag is not set; CTest verifies exit status 0.

Neither test touches CUDA.
Comment thread cpp/include/rmm/mr/detail/stream_ordered_memory_resource.hpp
Comment thread cpp/src/runtime_shutdown.cpp Outdated
Comment thread cpp/include/rmm/detail/runtime_shutdown.hpp
Comment thread cpp/src/runtime_shutdown.cpp
Comment on lines +80 to +84
// The map's destructor may run owning resources' destructors during process exit, when calling
// into the CUDA runtime is undefined behavior. Register the process-exit hook here, right after
// the map is constructed: the C++ standard guarantees that an atexit callback registered after
// a static object's construction runs before that object's destructor at termination.
rmm::detail::register_process_exit_hook();
Copy link
Copy Markdown
Contributor

@wence- wence- Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we didn't want to use an atexit hook we could instead have an object that modifies a static variable. That might be simpler?

namespace {
    struct ShutdownFlag {
        ~ShutdownFlag() { flag.store(true, std::memory_order_release); }
        static std::atomic<bool> flag;
    }
    bool ShutdownFlag::flag = false;
}
namespace detail {
    bool is_shutting_down { return ShutdownFlag::flag.load(std::memory_order_acquire); }
}
// and now here;
{
static std::map<...> device_id_to_resource;
static shutdownobj = ShutdownFlag{};

~ShutdownFlag() will run before ~device_id_to_resource() as required.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess that approach doesn't allow you to detect shutdown in cuda_stream dtors. However, arguably I think that is not necessary. cuda_streams are never kept alive by static lifetime objects in RMM. If we migrate to CCCL cuda::stream they're not going to do this dance we're doing here

Copy link
Copy Markdown
Collaborator Author

@bdice bdice Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Streams were fixed in 561ce68.

Comment thread cpp/include/rmm/process_is_exiting.hpp Outdated
Comment thread cpp/src/mr/detail/cuda_async_memory_resource_impl.cpp Outdated
Comment thread cpp/src/mr/detail/fixed_size_memory_resource_impl.cpp Outdated
Comment thread cpp/src/mr/detail/pool_memory_resource_impl.cpp Outdated
@github-project-automation github-project-automation Bot moved this from In Progress to Review in RMM Project Board Apr 23, 2026
/**
* @brief Register the atexit callback that flips the flag observed by `rmm::process_is_exiting()`.
*
* Idempotent: safe to call from multiple places; only the first call registers the callback.
Copy link
Copy Markdown
Contributor

@vyasr vyasr Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement is not true. I am trying to decide how much we should care about actually fixing the problem there.

A single atexit hook is not sufficient here. Consider the following code:

struct X {
  ~X() { cudaFree(0); }
};

void f() {
  static X x;
  register_process_exit_hook();
}

void g() {
  static X y;
  register_process_exit_hook();
}

int main() {
  f();
  g();
}

When g runs the atexit handler has already been created, so it will not reregister. That means that the atexit handler will be sequenced before the destruction of y, so the destruction of y is unsafe.

@wence- asked about the current mixed public/private state of the function below. If we actually intend to support the above case, we need this to be a publicly exposed function in a header, and rather than storing a single value it needs to store one lock per handler. For example you could push a new atomic into a vector each time and return that index so that process_is_exiting could instead be atexit_handler_has_run(int index). That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.

We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to decide the scope of this functionality. If it's only use case is for rmm's own per-device map, then we should hide the function. OTOH if we anticipate users storing memory resources in other objects with static scope Then we need to make sure that they have access as well. In that case, we have to worry about the total number of atexit handlers that are registered, although I don't think we have to support over 32.

My take: we document it is UB to store RMM objects with static (or thread-local) scope.

We do all this dance privately just for the per-device map until we can kill it with fire.

Copy link
Copy Markdown
Collaborator Author

@bdice bdice Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noted the UB in 4a814f3.

Copy link
Copy Markdown
Collaborator Author

@bdice bdice Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That raises another more subtle problem, which is that the standard limits the number of atexit handlers to 32.

Emphasis added:

The implementation is guaranteed to support the registration of at least 32 functions. The exact limit is implementation-defined.

https://en.cppreference.com/cpp/utility/program/atexit

To verify,

long limit = sysconf(_SC_ATEXIT_MAX);

gives me 2147483647. I don't expect this cap to affect us.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I realize that most implementations set a much higher value. I wouldn't want to rely on that in the case where we we're providing a new atexit handler for every other object. In any case, as long as we're scoping this for internal use only with the per-device map I'm fine leaving it as-is.

@bdice bdice force-pushed the fix-pool-mr-shutdown-crash branch from cf5c648 to 520d6f9 Compare April 27, 2026 21:49
@bdice bdice requested review from vyasr and wence- April 27, 2026 22:03
Copy link
Copy Markdown
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we wouldn't want process_is_exiting in the public API but I realise we need it so that downstream authors of mrs can DTRT.

Copy link
Copy Markdown
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some improvements can be made to the comments but the implementation looks fine now.

Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated
Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated
Comment thread cpp/tests/process_is_exiting_shutdown_test.cpp Outdated
@bdice
Copy link
Copy Markdown
Collaborator Author

bdice commented Apr 30, 2026

/merge

@rapids-bot rapids-bot Bot merged commit 47910f9 into rapidsai:main Apr 30, 2026
85 checks passed
@github-project-automation github-project-automation Bot moved this from Review to Done in RMM Project Board Apr 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working non-breaking Non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

Intermittent SIGSEGV at process exit when pool_memory_resource is held in the static per-device resource map

3 participants