Fix sticky CUDA error in `prefetch_resource_adaptor` when upstream returns non-managed memory by viclafargue · Pull Request #2267 · rapidsai/rmm

viclafargue · 2026-03-03T13:48:57Z

Closes #7834.

The prefetch_resource_adaptor treats prefetching of non-managed memory as a no-op, but fails to clear the sticky error from the CUDA runtime. This causes the CUDA runtime to be invalidated when using a PrefetchResourceAdaptor with a CudaMemoryResource as upstream.

…rns non-managed memory

coderabbitai · 2026-03-03T13:52:42Z

📝 Walkthrough

Summary by CodeRabbit

Bug Fixes
- Refined CUDA memory prefetch error handling to gracefully manage specific error conditions without triggering unnecessary failure states, improving system stability and reducing false error reports in certain edge scenarios.

Walkthrough

Modified CUDA prefetch error handling in cpp/src/prefetch.cpp to distinguish between cudaErrorInvalidValue and other errors. When cudaMemPrefetchAsync returns cudaErrorInvalidValue, the sticky error is cleared via cudaGetLastError without propagating as failure. Other non-success errors continue through standard error checking.

Changes

Cohort / File(s)	Summary
CUDA prefetch error handling `cpp/src/prefetch.cpp`	Modified error result handling: `cudaErrorInvalidValue` now clears the last CUDA error without failure propagation; other errors routed through `RMM_CUDA_TRY`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly identifies the specific issue being fixed: the sticky CUDA error in prefetch_resource_adaptor when non-managed memory is encountered, which matches the core change in the code.
Description check	✅ Passed	The description is directly related to the changeset, explaining the problem with sticky CUDA errors in prefetch_resource_adaptor and how it impacts downstream usage with CudaMemoryResource.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/prefetch.cpp`:
- Around line 32-35: When handling the cudaErrorInvalidValue path, don't ignore
the return of cudaGetLastError(); call cudaGetLastError() into a variable (e.g.,
last_err), check it against cudaSuccess (or expected values) and if it is not
cudaSuccess report/propagate it (use RMM_CUDA_TRY or equivalent) instead of
discarding; update the branch that currently calls cudaGetLastError() to
validate its return and handle unexpected error states rather than silently
dropping them (references: result, cudaErrorInvalidValue, cudaGetLastError,
RMM_CUDA_TRY).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9f23d and c76600c.

📒 Files selected for processing (1)

cpp/src/prefetch.cpp

coderabbitai · 2026-03-03T13:52:45Z

+  if (result == cudaErrorInvalidValue) {
+    (void)cudaGetLastError();
+  } else if (result != cudaSuccess) {
+    RMM_CUDA_TRY(result);


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and find prefetch.cpp find . -name "prefetch.cpp" -type f

Repository: rapidsai/rmm

Length of output: 78

🏁 Script executed:

# Then examine the file around the specified lines if [ -f "cpp/src/prefetch.cpp" ]; then echo "=== File content around lines 25-40 ===" sed -n '25,40p' cpp/src/prefetch.cpp | cat -n echo "" echo "=== Full function context ===" head -50 cpp/src/prefetch.cpp | cat -n fi

Repository: rapidsai/rmm

Length of output: 2344

Check the return value from cudaGetLastError() before discarding it.

Line 33 calls cudaGetLastError() to clear the sticky error state but ignores its return value. Per the coding guidelines, all CUDA API calls must have their errors checked. If the cleared error is not one of the expected values, it indicates an inconsistent error state and should be reported.

Suggested patch

if (result == cudaErrorInvalidValue) { - (void)cudaGetLastError(); + auto const cleared_error = cudaGetLastError(); + if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) { + RMM_CUDA_TRY(cleared_error); + } } else if (result != cudaSuccess) { RMM_CUDA_TRY(result); }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if (result == cudaErrorInvalidValue) {

(void)cudaGetLastError();

} else if (result != cudaSuccess) {

RMM_CUDA_TRY(result);

if (result == cudaErrorInvalidValue) {

auto const cleared_error = cudaGetLastError();

if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) {

RMM_CUDA_TRY(cleared_error);

}

} else if (result != cudaSuccess) {

RMM_CUDA_TRY(result);

}

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@cpp/src/prefetch.cpp` around lines 32 - 35, When handling the cudaErrorInvalidValue path, don't ignore the return of cudaGetLastError(); call cudaGetLastError() into a variable (e.g., last_err), check it against cudaSuccess (or expected values) and if it is not cudaSuccess report/propagate it (use RMM_CUDA_TRY or equivalent) instead of discarding; update the branch that currently calls cudaGetLastError() to validate its return and handle unexpected error states rather than silently dropping them (references: result, cudaErrorInvalidValue, cudaGetLastError, RMM_CUDA_TRY).

bdice · 2026-03-03T14:30:50Z

Please see #1648 and the discussion in #1649. I would like a slightly different approach to this, and just haven’t had time to finish that implementation.

jcrist · 2026-03-05T19:07:13Z

I think we're seeing two different things in rapidsai/cuml#7842 (or at least I don't see how they're related).

IIUC the one tackled here is that PrefetchResourceAdaptor(CudaMemoryResource()) shouldn't work (this wasn't clear to me from the docs). I only tried that case to see the cartesian product of memory resources while debugging, we don't actually need that case to work.

Perhaps in the short term a docs fix (or maybe a best effort user error check when constructing a PrefetchMemoryResource on the python side, walking up upstream_mr chain to find a ManagedMemoryResource and erroring if one isn't found)?

Fix sticky CUDA error in prefetch_resource_adaptor when upstream retu…

c76600c

…rns non-managed memory

github-project-automation Bot added this to RMM Project Board Mar 3, 2026

coderabbitai Bot reviewed Mar 3, 2026

View reviewed changes

viclafargue mentioned this pull request Mar 5, 2026

[BUG] SVR.fit issues when prefetching enabled rapidsai/cuml#7842

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix sticky CUDA error in `prefetch_resource_adaptor` when upstream returns non-managed memory#2267

Fix sticky CUDA error in `prefetch_resource_adaptor` when upstream returns non-managed memory#2267
viclafargue wants to merge 1 commit into
rapidsai:mainfrom
viclafargue:fix-sticky-cuda-error-prefetch

viclafargue commented Mar 3, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Mar 3, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Uh oh!

bdice commented Mar 3, 2026

Uh oh!

jcrist commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

viclafargue commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Mar 3, 2026

Summary by CodeRabbit

Walkthrough

Changes

Estimated code review effort

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

bdice commented Mar 3, 2026

Uh oh!

jcrist commented Mar 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

viclafargue commented Mar 3, 2026 •

edited

Loading