Skip to content

Fix sticky CUDA error in prefetch_resource_adaptor when upstream returns non-managed memory#2267

Open
viclafargue wants to merge 1 commit into
rapidsai:mainfrom
viclafargue:fix-sticky-cuda-error-prefetch
Open

Fix sticky CUDA error in prefetch_resource_adaptor when upstream returns non-managed memory#2267
viclafargue wants to merge 1 commit into
rapidsai:mainfrom
viclafargue:fix-sticky-cuda-error-prefetch

Conversation

@viclafargue
Copy link
Copy Markdown
Contributor

@viclafargue viclafargue commented Mar 3, 2026

Closes #7834.

The prefetch_resource_adaptor treats prefetching of non-managed memory as a no-op, but fails to clear the sticky error from the CUDA runtime. This causes the CUDA runtime to be invalidated when using a PrefetchResourceAdaptor with a CudaMemoryResource as upstream.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 3, 2026

📝 Walkthrough

Summary by CodeRabbit

  • Bug Fixes
    • Refined CUDA memory prefetch error handling to gracefully manage specific error conditions without triggering unnecessary failure states, improving system stability and reducing false error reports in certain edge scenarios.

Walkthrough

Modified CUDA prefetch error handling in cpp/src/prefetch.cpp to distinguish between cudaErrorInvalidValue and other errors. When cudaMemPrefetchAsync returns cudaErrorInvalidValue, the sticky error is cleared via cudaGetLastError without propagating as failure. Other non-success errors continue through standard error checking.

Changes

Cohort / File(s) Summary
CUDA prefetch error handling
cpp/src/prefetch.cpp
Modified error result handling: cudaErrorInvalidValue now clears the last CUDA error without failure propagation; other errors routed through RMM_CUDA_TRY.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly identifies the specific issue being fixed: the sticky CUDA error in prefetch_resource_adaptor when non-managed memory is encountered, which matches the core change in the code.
Description check ✅ Passed The description is directly related to the changeset, explaining the problem with sticky CUDA errors in prefetch_resource_adaptor and how it impacts downstream usage with CudaMemoryResource.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/prefetch.cpp`:
- Around line 32-35: When handling the cudaErrorInvalidValue path, don't ignore
the return of cudaGetLastError(); call cudaGetLastError() into a variable (e.g.,
last_err), check it against cudaSuccess (or expected values) and if it is not
cudaSuccess report/propagate it (use RMM_CUDA_TRY or equivalent) instead of
discarding; update the branch that currently calls cudaGetLastError() to
validate its return and handle unexpected error states rather than silently
dropping them (references: result, cudaErrorInvalidValue, cudaGetLastError,
RMM_CUDA_TRY).

ℹ️ Review info

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9b9f23d and c76600c.

📒 Files selected for processing (1)
  • cpp/src/prefetch.cpp

Comment thread cpp/src/prefetch.cpp
Comment on lines +32 to +35
if (result == cudaErrorInvalidValue) {
(void)cudaGetLastError();
} else if (result != cudaSuccess) {
RMM_CUDA_TRY(result);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and find prefetch.cpp
find . -name "prefetch.cpp" -type f

Repository: rapidsai/rmm

Length of output: 78


🏁 Script executed:

# Then examine the file around the specified lines
if [ -f "cpp/src/prefetch.cpp" ]; then
  echo "=== File content around lines 25-40 ==="
  sed -n '25,40p' cpp/src/prefetch.cpp | cat -n
  echo ""
  echo "=== Full function context ==="
  head -50 cpp/src/prefetch.cpp | cat -n
fi

Repository: rapidsai/rmm

Length of output: 2344


Check the return value from cudaGetLastError() before discarding it.

Line 33 calls cudaGetLastError() to clear the sticky error state but ignores its return value. Per the coding guidelines, all CUDA API calls must have their errors checked. If the cleared error is not one of the expected values, it indicates an inconsistent error state and should be reported.

Suggested patch
   if (result == cudaErrorInvalidValue) {
-    (void)cudaGetLastError();
+    auto const cleared_error = cudaGetLastError();
+    if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) {
+      RMM_CUDA_TRY(cleared_error);
+    }
   } else if (result != cudaSuccess) {
     RMM_CUDA_TRY(result);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (result == cudaErrorInvalidValue) {
(void)cudaGetLastError();
} else if (result != cudaSuccess) {
RMM_CUDA_TRY(result);
if (result == cudaErrorInvalidValue) {
auto const cleared_error = cudaGetLastError();
if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) {
RMM_CUDA_TRY(cleared_error);
}
} else if (result != cudaSuccess) {
RMM_CUDA_TRY(result);
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/prefetch.cpp` around lines 32 - 35, When handling the
cudaErrorInvalidValue path, don't ignore the return of cudaGetLastError(); call
cudaGetLastError() into a variable (e.g., last_err), check it against
cudaSuccess (or expected values) and if it is not cudaSuccess report/propagate
it (use RMM_CUDA_TRY or equivalent) instead of discarding; update the branch
that currently calls cudaGetLastError() to validate its return and handle
unexpected error states rather than silently dropping them (references: result,
cudaErrorInvalidValue, cudaGetLastError, RMM_CUDA_TRY).

@bdice
Copy link
Copy Markdown
Collaborator

bdice commented Mar 3, 2026

Please see #1648 and the discussion in #1649. I would like a slightly different approach to this, and just haven’t had time to finish that implementation.

@jcrist
Copy link
Copy Markdown
Member

jcrist commented Mar 5, 2026

I think we're seeing two different things in rapidsai/cuml#7842 (or at least I don't see how they're related).

IIUC the one tackled here is that PrefetchResourceAdaptor(CudaMemoryResource()) shouldn't work (this wasn't clear to me from the docs). I only tried that case to see the cartesian product of memory resources while debugging, we don't actually need that case to work.

Perhaps in the short term a docs fix (or maybe a best effort user error check when constructing a PrefetchMemoryResource on the python side, walking up upstream_mr chain to find a ManagedMemoryResource and erroring if one isn't found)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[BUG] SVR.fit issues when prefetching enabled

3 participants