[REVIEW] Generalize and improve cagra::optimize#1830
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/neighbors/detail/cagra/graph_core.cuh (1)
1668-1676:⚠️ Potential issue | 🟠 Major | ⚡ Quick winShort-circuit empty graphs before the batching helpers.
When
graph_size == 0, bothprune_graph_gpu()andmerge_graph_gpu()computebatch_size = 0and then evaluate(graph_size + batch_size - 1) / batch_size, which divides by zero. A fast return here avoids crashing on valid empty inputs.Suggested fix
RAFT_EXPECTS(knn_graph.extent(0) == new_graph.extent(0), "Each input array is expected to have the same number of rows"); RAFT_EXPECTS(new_graph.extent(1) <= knn_graph.extent(1), "output graph cannot have more columns than input graph"); // const uint64_t input_graph_degree = knn_graph.extent(1); const uint64_t knn_graph_degree = knn_graph.extent(1); const uint64_t output_graph_degree = new_graph.extent(1); const uint64_t graph_size = new_graph.extent(0); + + if (graph_size == 0) { return; }As per coding guidelines: "Input validation must check for negative or invalid dimensions, null pointers, and invalid parameter combinations before GPU operations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/neighbors/detail/cagra/graph_core.cuh` around lines 1668 - 1676, Detect empty input graphs early and return before calling the batching helpers to avoid divide-by-zero: check if graph_size (computed from new_graph.extent(0)) is zero and short-circuit out of the routine before computing batch_size or calling prune_graph_gpu()/merge_graph_gpu(). Place this validation immediately after computing graph_size (near the existing knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching logic when graph_size == 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/neighbors/detail/cagra/utils.hpp`:
- Around line 580-584: In prefetch_next() the stream sync is skipped because the
condition uses input_view_.extent(0) == 0; remove that guard so that inside the
if constexpr (!kPassthrough) block you always wait for the paired kernel to
finish by calling raft::resource::sync_stream(res_). Update prefetch_next() to
unconditionally call raft::resource::sync_stream(res_) (while preserving the
kPassthrough compile-time branch) so the kernel that produced the previous slot
(referenced via batch_id_ and the two-slot reuse logic) has completed before any
D2H/H2D or slot recycling occurs.
- Around line 470-476: Destructor ~batched_device_view() currently syncs res_
but returns early if batch_id_ < 0, which can leave in-flight H2D work on
copy_stream_; before the early return, also synchronize the copy_stream_
(copy_stream_) so any queued batch 0 from the constructor (e.g., when
copy_device was non-empty) completes prior to destroying device_mem_; update
~batched_device_view() to call the appropriate stream sync on copy_stream_
(using the same raft/stream sync utility as used for res_) immediately before
the if (batch_id_ < 0) return, leaving the rest of the destructor logic
unchanged.
---
Outside diff comments:
In `@cpp/src/neighbors/detail/cagra/graph_core.cuh`:
- Around line 1668-1676: Detect empty input graphs early and return before
calling the batching helpers to avoid divide-by-zero: check if graph_size
(computed from new_graph.extent(0)) is zero and short-circuit out of the routine
before computing batch_size or calling prune_graph_gpu()/merge_graph_gpu().
Place this validation immediately after computing graph_size (near the existing
knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching
logic when graph_size == 0.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 9d4344fb-3c73-4bb2-a64c-667afe4b206b
📒 Files selected for processing (5)
cpp/src/neighbors/detail/cagra/cagra_build.cuhcpp/src/neighbors/detail/cagra/graph_core.cuhcpp/src/neighbors/detail/cagra/utils.hppcpp/tests/CMakeLists.txtcpp/tests/neighbors/ann_cagra/test_batched_device_view.cu
🚧 Files skipped from review as they are similar to previous changes (2)
- cpp/tests/CMakeLists.txt
- cpp/src/neighbors/detail/cagra/cagra_build.cuh
achirkin
left a comment
There was a problem hiding this comment.
Hi Malte, thanks for the updates! Here's another batch of comments.
I see you updated optimize to decide on host/device execution based on template parameters as per our offline discussion. Do you plan to update the call site (cagra_build.cuh) to change the default choices for input/output graphs?
| } | ||
| } | ||
| __syncthreads(); | ||
| __syncwarp(); |
There was a problem hiding this comment.
Is the warp-level sync in place of the previous block-level sync really sufficient here?
There was a problem hiding this comment.
It should be -- every warp uses its own chunk of shared memory.
There was a problem hiding this comment.
Oh, so you could in theory even implement this using exclusively warp shuffle operations? Would it make sense for the perf?
| for (auto chunk_end = static_cast<int64_t>(num); chunk_end >= 1; chunk_end -= 32) { | ||
| const int64_t chunk_start_lo = chunk_end - 31; | ||
| const int64_t chunk_start = (chunk_start_lo > 1) ? chunk_start_lo : 1; | ||
| const int64_t k = chunk_start + static_cast<int64_t>(lane_id); | ||
| T val{}; | ||
| const bool active = (k <= chunk_end); | ||
| if (active) { val = array[k - 1]; } | ||
| __syncwarp(); | ||
| if (active) { array[k] = val; } | ||
| __syncwarp(); | ||
| } | ||
| } |
There was a problem hiding this comment.
Perhaps a nitpick, but I feel like this could be faster/better expressed with a single array[k] = raft::shfl_up(array[k], 1) in place of two syncwarps.
| // reverse graph creation will always use the GPU | ||
| // using default workspace resource for random access | ||
| // otherwise will be managed memory which is slow upon first access | ||
| auto d_rev_graph = raft::make_device_mdarray<IdxT>(res, raft::make_extents<int64_t>(0, 0)); | ||
| try { | ||
| d_rev_graph = raft::make_device_mdarray<IdxT>( | ||
| res, raft::make_extents<int64_t>(graph_size, output_graph_degree)); |
There was a problem hiding this comment.
You're NOT using the default workspace resource here. Is this intentional? Please either update the comment (you're using the default/current device resource) or the code. Since the d_rev_graph is O(n_rows*graph_degree) size, it only use the worksace_resource for very small problem sizes.
There was a problem hiding this comment.
I changed it to use the default resource and fall back to large resource if allocation is failing.
|
@achirkin , thanks for the additional review pass. I pushed a new version with a larger refactor covering the merge of the two iterators and also covered your other comments. |
achirkin
left a comment
There was a problem hiding this comment.
Hi @mfoerste4 , thanks for addressing my comments, LGTM! A small nitpick below
| num_oor); | ||
| // These host-side checks are expensive (O(N*D^2)) and only used as debug | ||
| // diagnostics, so only run them when debug logging is active at runtime. | ||
| if (raft::default_logger().should_log(rapids_logger::level_enum::debug)) { |
|
@achirkin thanks for the review! |
KyleFromNVIDIA
left a comment
There was a problem hiding this comment.
Approved trivial CMake changes
|
/merge |
In preparation for large scale graph creation this PR adds several changes to cagra:optimize by:
Due to the batching in all substeps the memory footprint could even be decreased while significantly improving computation time.
The optimize API now supports all variations of memory locations for knn_graph and cagra_graph.
Internally, the data will be buffered in device memory for best performance. Directly accessing managed/pinned/HMM memory from the device showed severe performance degradation upon the first access (x86/H200 with HMM):
New kernels are based on experiments by @bpark-nvidia
CC @tfeher , @irina-resh-nvda