Fix sticky CUDA error in prefetch_resource_adaptor when upstream returns non-managed memory#2267
Conversation
…rns non-managed memory
📝 WalkthroughSummary by CodeRabbit
WalkthroughModified CUDA prefetch error handling in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Tip Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs). Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/prefetch.cpp`:
- Around line 32-35: When handling the cudaErrorInvalidValue path, don't ignore
the return of cudaGetLastError(); call cudaGetLastError() into a variable (e.g.,
last_err), check it against cudaSuccess (or expected values) and if it is not
cudaSuccess report/propagate it (use RMM_CUDA_TRY or equivalent) instead of
discarding; update the branch that currently calls cudaGetLastError() to
validate its return and handle unexpected error states rather than silently
dropping them (references: result, cudaErrorInvalidValue, cudaGetLastError,
RMM_CUDA_TRY).
| if (result == cudaErrorInvalidValue) { | ||
| (void)cudaGetLastError(); | ||
| } else if (result != cudaSuccess) { | ||
| RMM_CUDA_TRY(result); |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's look at the file structure and find prefetch.cpp
find . -name "prefetch.cpp" -type fRepository: rapidsai/rmm
Length of output: 78
🏁 Script executed:
# Then examine the file around the specified lines
if [ -f "cpp/src/prefetch.cpp" ]; then
echo "=== File content around lines 25-40 ==="
sed -n '25,40p' cpp/src/prefetch.cpp | cat -n
echo ""
echo "=== Full function context ==="
head -50 cpp/src/prefetch.cpp | cat -n
fiRepository: rapidsai/rmm
Length of output: 2344
Check the return value from cudaGetLastError() before discarding it.
Line 33 calls cudaGetLastError() to clear the sticky error state but ignores its return value. Per the coding guidelines, all CUDA API calls must have their errors checked. If the cleared error is not one of the expected values, it indicates an inconsistent error state and should be reported.
Suggested patch
if (result == cudaErrorInvalidValue) {
- (void)cudaGetLastError();
+ auto const cleared_error = cudaGetLastError();
+ if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) {
+ RMM_CUDA_TRY(cleared_error);
+ }
} else if (result != cudaSuccess) {
RMM_CUDA_TRY(result);
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (result == cudaErrorInvalidValue) { | |
| (void)cudaGetLastError(); | |
| } else if (result != cudaSuccess) { | |
| RMM_CUDA_TRY(result); | |
| if (result == cudaErrorInvalidValue) { | |
| auto const cleared_error = cudaGetLastError(); | |
| if (cleared_error != cudaSuccess && cleared_error != cudaErrorInvalidValue) { | |
| RMM_CUDA_TRY(cleared_error); | |
| } | |
| } else if (result != cudaSuccess) { | |
| RMM_CUDA_TRY(result); | |
| } |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/prefetch.cpp` around lines 32 - 35, When handling the
cudaErrorInvalidValue path, don't ignore the return of cudaGetLastError(); call
cudaGetLastError() into a variable (e.g., last_err), check it against
cudaSuccess (or expected values) and if it is not cudaSuccess report/propagate
it (use RMM_CUDA_TRY or equivalent) instead of discarding; update the branch
that currently calls cudaGetLastError() to validate its return and handle
unexpected error states rather than silently dropping them (references: result,
cudaErrorInvalidValue, cudaGetLastError, RMM_CUDA_TRY).
|
I think we're seeing two different things in rapidsai/cuml#7842 (or at least I don't see how they're related). IIUC the one tackled here is that Perhaps in the short term a docs fix (or maybe a best effort user error check when constructing a |
Closes #7834.
The
prefetch_resource_adaptortreats prefetching of non-managed memory as a no-op, but fails to clear the sticky error from the CUDA runtime. This causes the CUDA runtime to be invalidated when using aPrefetchResourceAdaptorwith aCudaMemoryResourceas upstream.