Skip to content

[REVIEW] Generalize and improve cagra::optimize#1830

Merged
rapids-bot[bot] merged 64 commits into
rapidsai:mainfrom
mfoerste4:cagra_optimize
May 13, 2026
Merged

[REVIEW] Generalize and improve cagra::optimize#1830
rapids-bot[bot] merged 64 commits into
rapidsai:mainfrom
mfoerste4:cagra_optimize

Conversation

@mfoerste4
Copy link
Copy Markdown
Contributor

@mfoerste4 mfoerste4 commented Feb 20, 2026

In preparation for large scale graph creation this PR adds several changes to cagra:optimize by:

  • adding full device path for pruning and merging / discarding host fallback code
  • fusing two pruning steps into one batch-able kernel, reducing memory requirements
  • batched reverse graph creation
  • batched merging

Due to the batching in all substeps the memory footprint could even be decreased while significantly improving computation time.

The optimize API now supports all variations of memory locations for knn_graph and cagra_graph.
Internally, the data will be buffered in device memory for best performance. Directly accessing managed/pinned/HMM memory from the device showed severe performance degradation upon the first access (x86/H200 with HMM):

=== Benchmarks (256 MiB, 10 iterations) ===

  [malloc] 1. Copy to device:     133.783 ms total (10 iters) -> 18.69 GB/s
  [malloc] 2. Kernel direct read: 4648.468 ms total (10 iters) -> 0.54 GB/s
  [malloc] 3. Kernel subsequent read: 15.164 ms total (10 iters) -> 164.87 GB/s
  [cudaMalloc] 1. Copy to device:     1.294 ms total (10 iters) -> 1932.35 GB/s
  [cudaMalloc] 2. Kernel direct read: 14.945 ms total (10 iters) -> 167.28 GB/s
  [cudaMalloc] 3. Kernel subsequent read: 14.963 ms total (10 iters) -> 167.07 GB/s
  [cudaMallocHost] 1. Copy to device:     95.002 ms total (10 iters) -> 26.32 GB/s
  [cudaMallocHost] 2. Kernel direct read: 290.486 ms total (10 iters) -> 8.61 GB/s
  [cudaMallocHost] 3. Kernel subsequent read: 290.789 ms total (10 iters) -> 8.60 GB/s
  [cudaMallocManaged] 1. Copy to device:     136.737 ms total (10 iters) -> 18.28 GB/s
  [cudaMallocManaged] 2. Kernel direct read: 766.153 ms total (10 iters) -> 3.26 GB/s
  [cudaMallocManaged] 3. Kernel subsequent read: 15.002 ms total (10 iters) -> 166.65 GB/s

New kernels are based on experiments by @bpark-nvidia

CC @tfeher , @irina-resh-nvda

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Feb 20, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@aamijar aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 24, 2026
@mfoerste4 mfoerste4 marked this pull request as ready for review March 2, 2026 23:35
@mfoerste4 mfoerste4 requested a review from a team as a code owner March 2, 2026 23:35
@mfoerste4 mfoerste4 requested a review from a team as a code owner March 10, 2026 22:50
@mfoerste4 mfoerste4 requested a review from achirkin May 1, 2026 02:33
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/neighbors/detail/cagra/graph_core.cuh (1)

1668-1676: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit empty graphs before the batching helpers.

When graph_size == 0, both prune_graph_gpu() and merge_graph_gpu() compute batch_size = 0 and then evaluate (graph_size + batch_size - 1) / batch_size, which divides by zero. A fast return here avoids crashing on valid empty inputs.

Suggested fix
   RAFT_EXPECTS(knn_graph.extent(0) == new_graph.extent(0),
                "Each input array is expected to have the same number of rows");
   RAFT_EXPECTS(new_graph.extent(1) <= knn_graph.extent(1),
                "output graph cannot have more columns than input graph");
   // const uint64_t input_graph_degree  = knn_graph.extent(1);
   const uint64_t knn_graph_degree    = knn_graph.extent(1);
   const uint64_t output_graph_degree = new_graph.extent(1);
   const uint64_t graph_size          = new_graph.extent(0);
+
+  if (graph_size == 0) { return; }

As per coding guidelines: "Input validation must check for negative or invalid dimensions, null pointers, and invalid parameter combinations before GPU operations."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/graph_core.cuh` around lines 1668 - 1676,
Detect empty input graphs early and return before calling the batching helpers
to avoid divide-by-zero: check if graph_size (computed from new_graph.extent(0))
is zero and short-circuit out of the routine before computing batch_size or
calling prune_graph_gpu()/merge_graph_gpu(). Place this validation immediately
after computing graph_size (near the existing
knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching
logic when graph_size == 0.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/utils.hpp`:
- Around line 580-584: In prefetch_next() the stream sync is skipped because the
condition uses input_view_.extent(0) == 0; remove that guard so that inside the
if constexpr (!kPassthrough) block you always wait for the paired kernel to
finish by calling raft::resource::sync_stream(res_). Update prefetch_next() to
unconditionally call raft::resource::sync_stream(res_) (while preserving the
kPassthrough compile-time branch) so the kernel that produced the previous slot
(referenced via batch_id_ and the two-slot reuse logic) has completed before any
D2H/H2D or slot recycling occurs.
- Around line 470-476: Destructor ~batched_device_view() currently syncs res_
but returns early if batch_id_ < 0, which can leave in-flight H2D work on
copy_stream_; before the early return, also synchronize the copy_stream_
(copy_stream_) so any queued batch 0 from the constructor (e.g., when
copy_device was non-empty) completes prior to destroying device_mem_; update
~batched_device_view() to call the appropriate stream sync on copy_stream_
(using the same raft/stream sync utility as used for res_) immediately before
the if (batch_id_ < 0) return, leaving the rest of the destructor logic
unchanged.

---

Outside diff comments:
In `@cpp/src/neighbors/detail/cagra/graph_core.cuh`:
- Around line 1668-1676: Detect empty input graphs early and return before
calling the batching helpers to avoid divide-by-zero: check if graph_size
(computed from new_graph.extent(0)) is zero and short-circuit out of the routine
before computing batch_size or calling prune_graph_gpu()/merge_graph_gpu().
Place this validation immediately after computing graph_size (near the existing
knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching
logic when graph_size == 0.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d4344fb-3c73-4bb2-a64c-667afe4b206b

📥 Commits

Reviewing files that changed from the base of the PR and between 3437700 and aa250a7.

📒 Files selected for processing (5)
  • cpp/src/neighbors/detail/cagra/cagra_build.cuh
  • cpp/src/neighbors/detail/cagra/graph_core.cuh
  • cpp/src/neighbors/detail/cagra/utils.hpp
  • cpp/tests/CMakeLists.txt
  • cpp/tests/neighbors/ann_cagra/test_batched_device_view.cu
🚧 Files skipped from review as they are similar to previous changes (2)
  • cpp/tests/CMakeLists.txt
  • cpp/src/neighbors/detail/cagra/cagra_build.cuh

Comment thread cpp/src/neighbors/detail/cagra/utils.hpp Outdated
Comment thread cpp/src/neighbors/detail/cagra/utils.hpp Outdated
Copy link
Copy Markdown
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Malte, thanks for the updates! Here's another batch of comments.
I see you updated optimize to decide on host/device execution based on template parameters as per our offline discussion. Do you plan to update the call site (cagra_build.cuh) to change the default choices for input/output graphs?

Comment thread cpp/src/neighbors/detail/cagra/cagra_build.cuh
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh
}
}
__syncthreads();
__syncwarp();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the warp-level sync in place of the previous block-level sync really sufficient here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be -- every warp uses its own chunk of shared memory.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, so you could in theory even implement this using exclusively warp shuffle operations? Would it make sense for the perf?

Comment on lines +351 to +362
for (auto chunk_end = static_cast<int64_t>(num); chunk_end >= 1; chunk_end -= 32) {
const int64_t chunk_start_lo = chunk_end - 31;
const int64_t chunk_start = (chunk_start_lo > 1) ? chunk_start_lo : 1;
const int64_t k = chunk_start + static_cast<int64_t>(lane_id);
T val{};
const bool active = (k <= chunk_end);
if (active) { val = array[k - 1]; }
__syncwarp();
if (active) { array[k] = val; }
__syncwarp();
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a nitpick, but I feel like this could be faster/better expressed with a single array[k] = raft::shfl_up(array[k], 1) in place of two syncwarps.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea.

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
Comment on lines +1710 to +1716
// reverse graph creation will always use the GPU
// using default workspace resource for random access
// otherwise will be managed memory which is slow upon first access
auto d_rev_graph = raft::make_device_mdarray<IdxT>(res, raft::make_extents<int64_t>(0, 0));
try {
d_rev_graph = raft::make_device_mdarray<IdxT>(
res, raft::make_extents<int64_t>(graph_size, output_graph_degree));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're NOT using the default workspace resource here. Is this intentional? Please either update the comment (you're using the default/current device resource) or the code. Since the d_rev_graph is O(n_rows*graph_degree) size, it only use the worksace_resource for very small problem sizes.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to use the default resource and fall back to large resource if allocation is failing.

@mfoerste4
Copy link
Copy Markdown
Contributor Author

@achirkin , thanks for the additional review pass. I pushed a new version with a larger refactor covering the merge of the two iterators and also covered your other comments.

@mfoerste4 mfoerste4 requested a review from achirkin May 8, 2026 11:38
Copy link
Copy Markdown
Contributor

@achirkin achirkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mfoerste4 , thanks for addressing my comments, LGTM! A small nitpick below

num_oor);
// These host-side checks are expensive (O(N*D^2)) and only used as debug
// diagnostics, so only run them when debug logging is active at runtime.
if (raft::default_logger().should_log(rapids_logger::level_enum::debug)) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice!

Comment thread cpp/src/neighbors/detail/cagra/graph_core.cuh Outdated
@mfoerste4
Copy link
Copy Markdown
Contributor Author

@achirkin thanks for the review!

Copy link
Copy Markdown
Member

@KyleFromNVIDIA KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved trivial CMake changes

@achirkin
Copy link
Copy Markdown
Contributor

/merge

@rapids-bot rapids-bot Bot merged commit 0a85b6b into rapidsai:main May 13, 2026
86 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

7 participants