[REVIEW] Generalize and improve cagra::optimize by mfoerste4 · Pull Request #1830 · rapidsai/cuvs

mfoerste4 · 2026-02-20T12:28:20Z

In preparation for large scale graph creation this PR adds several changes to cagra:optimize by:

adding full device path for pruning and merging / discarding host fallback code
fusing two pruning steps into one batch-able kernel, reducing memory requirements
batched reverse graph creation
batched merging

Due to the batching in all substeps the memory footprint could even be decreased while significantly improving computation time.

The optimize API now supports all variations of memory locations for knn_graph and cagra_graph.
Internally, the data will be buffered in device memory for best performance. Directly accessing managed/pinned/HMM memory from the device showed severe performance degradation upon the first access (x86/H200 with HMM):

=== Benchmarks (256 MiB, 10 iterations) ===

  [malloc] 1. Copy to device:     133.783 ms total (10 iters) -> 18.69 GB/s
  [malloc] 2. Kernel direct read: 4648.468 ms total (10 iters) -> 0.54 GB/s
  [malloc] 3. Kernel subsequent read: 15.164 ms total (10 iters) -> 164.87 GB/s
  [cudaMalloc] 1. Copy to device:     1.294 ms total (10 iters) -> 1932.35 GB/s
  [cudaMalloc] 2. Kernel direct read: 14.945 ms total (10 iters) -> 167.28 GB/s
  [cudaMalloc] 3. Kernel subsequent read: 14.963 ms total (10 iters) -> 167.07 GB/s
  [cudaMallocHost] 1. Copy to device:     95.002 ms total (10 iters) -> 26.32 GB/s
  [cudaMallocHost] 2. Kernel direct read: 290.486 ms total (10 iters) -> 8.61 GB/s
  [cudaMallocHost] 3. Kernel subsequent read: 290.789 ms total (10 iters) -> 8.60 GB/s
  [cudaMallocManaged] 1. Copy to device:     136.737 ms total (10 iters) -> 18.28 GB/s
  [cudaMallocManaged] 2. Kernel direct read: 766.153 ms total (10 iters) -> 3.26 GB/s
  [cudaMallocManaged] 3. Kernel subsequent read: 15.002 ms total (10 iters) -> 166.65 GB/s

New kernels are based on experiments by @bpark-nvidia

CC @tfeher , @irina-resh-nvda

copy-pr-bot · 2026-02-20T12:28:24Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

… utilize submdspan for passthrough

coderabbitai

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

cpp/src/neighbors/detail/cagra/graph_core.cuh (1)
1668-1676: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Short-circuit empty graphs before the batching helpers.

When graph_size == 0, both prune_graph_gpu() and merge_graph_gpu() compute batch_size = 0 and then evaluate (graph_size + batch_size - 1) / batch_size, which divides by zero. A fast return here avoids crashing on valid empty inputs.
Suggested fix
   RAFT_EXPECTS(knn_graph.extent(0) == new_graph.extent(0),
                "Each input array is expected to have the same number of rows");
   RAFT_EXPECTS(new_graph.extent(1) <= knn_graph.extent(1),
                "output graph cannot have more columns than input graph");
   // const uint64_t input_graph_degree  = knn_graph.extent(1);
   const uint64_t knn_graph_degree    = knn_graph.extent(1);
   const uint64_t output_graph_degree = new_graph.extent(1);
   const uint64_t graph_size          = new_graph.extent(0);
+
+  if (graph_size == 0) { return; }
As per coding guidelines: "Input validation must check for negative or invalid dimensions, null pointers, and invalid parameter combinations before GPU operations."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/neighbors/detail/cagra/graph_core.cuh` around lines 1668 - 1676,
Detect empty input graphs early and return before calling the batching helpers
to avoid divide-by-zero: check if graph_size (computed from new_graph.extent(0))
is zero and short-circuit out of the routine before computing batch_size or
calling prune_graph_gpu()/merge_graph_gpu(). Place this validation immediately
after computing graph_size (near the existing
knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching
logic when graph_size == 0.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/neighbors/detail/cagra/utils.hpp`:
- Around line 580-584: In prefetch_next() the stream sync is skipped because the
condition uses input_view_.extent(0) == 0; remove that guard so that inside the
if constexpr (!kPassthrough) block you always wait for the paired kernel to
finish by calling raft::resource::sync_stream(res_). Update prefetch_next() to
unconditionally call raft::resource::sync_stream(res_) (while preserving the
kPassthrough compile-time branch) so the kernel that produced the previous slot
(referenced via batch_id_ and the two-slot reuse logic) has completed before any
D2H/H2D or slot recycling occurs.
- Around line 470-476: Destructor ~batched_device_view() currently syncs res_
but returns early if batch_id_ < 0, which can leave in-flight H2D work on
copy_stream_; before the early return, also synchronize the copy_stream_
(copy_stream_) so any queued batch 0 from the constructor (e.g., when
copy_device was non-empty) completes prior to destroying device_mem_; update
~batched_device_view() to call the appropriate stream sync on copy_stream_
(using the same raft/stream sync utility as used for res_) immediately before
the if (batch_id_ < 0) return, leaving the rest of the destructor logic
unchanged.

---

Outside diff comments:
In `@cpp/src/neighbors/detail/cagra/graph_core.cuh`:
- Around line 1668-1676: Detect empty input graphs early and return before
calling the batching helpers to avoid divide-by-zero: check if graph_size
(computed from new_graph.extent(0)) is zero and short-circuit out of the routine
before computing batch_size or calling prune_graph_gpu()/merge_graph_gpu().
Place this validation immediately after computing graph_size (near the existing
knn_graph_degree/output_graph_degree assignments) so you skip any GPU batching
logic when graph_size == 0.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9d4344fb-3c73-4bb2-a64c-667afe4b206b

📥 Commits

Reviewing files that changed from the base of the PR and between 3437700 and aa250a7.

📒 Files selected for processing (5)

cpp/src/neighbors/detail/cagra/cagra_build.cuh
cpp/src/neighbors/detail/cagra/graph_core.cuh
cpp/src/neighbors/detail/cagra/utils.hpp
cpp/tests/CMakeLists.txt
cpp/tests/neighbors/ann_cagra/test_batched_device_view.cu

🚧 Files skipped from review as they are similar to previous changes (2)

cpp/tests/CMakeLists.txt
cpp/src/neighbors/detail/cagra/cagra_build.cuh

achirkin

Hi Malte, thanks for the updates! Here's another batch of comments.
I see you updated optimize to decide on host/device execution based on template parameters as per our offline discussion. Do you plan to update the call site (cagra_build.cuh) to change the default choices for input/output graphs?

achirkin · 2026-05-06T11:59:28Z

    }
  }
-  __syncthreads();
+  __syncwarp();


Is the warp-level sync in place of the previous block-level sync really sufficient here?

It should be -- every warp uses its own chunk of shared memory.

Oh, so you could in theory even implement this using exclusively warp shuffle operations? Would it make sense for the perf?

achirkin · 2026-05-06T12:06:37Z

+  for (auto chunk_end = static_cast<int64_t>(num); chunk_end >= 1; chunk_end -= 32) {
+    const int64_t chunk_start_lo = chunk_end - 31;
+    const int64_t chunk_start    = (chunk_start_lo > 1) ? chunk_start_lo : 1;
+    const int64_t k              = chunk_start + static_cast<int64_t>(lane_id);
+    T val{};
+    const bool active = (k <= chunk_end);
+    if (active) { val = array[k - 1]; }
+    __syncwarp();
+    if (active) { array[k] = val; }
+    __syncwarp();
+  }
+}


Perhaps a nitpick, but I feel like this could be faster/better expressed with a single array[k] = raft::shfl_up(array[k], 1) in place of two syncwarps.

achirkin · 2026-05-06T13:13:58Z

+  // reverse graph creation will always use the GPU
+  // using default workspace resource for random access
+  // otherwise will be managed memory which is slow upon first access
+  auto d_rev_graph = raft::make_device_mdarray<IdxT>(res, raft::make_extents<int64_t>(0, 0));
+  try {
+    d_rev_graph = raft::make_device_mdarray<IdxT>(
+      res, raft::make_extents<int64_t>(graph_size, output_graph_degree));


You're NOT using the default workspace resource here. Is this intentional? Please either update the comment (you're using the default/current device resource) or the code. Since the d_rev_graph is O(n_rows*graph_degree) size, it only use the worksace_resource for very small problem sizes.

I changed it to use the default resource and fall back to large resource if allocation is failing.

…_optimize

mfoerste4 · 2026-05-08T11:38:31Z

@achirkin , thanks for the additional review pass. I pushed a new version with a larger refactor covering the merge of the two iterators and also covered your other comments.

achirkin

Hi @mfoerste4 , thanks for addressing my comments, LGTM! A small nitpick below

achirkin · 2026-05-12T11:40:06Z

-                 num_oor);
+  // These host-side checks are expensive (O(N*D^2)) and only used as debug
+  // diagnostics, so only run them when debug logging is active at runtime.
+  if (raft::default_logger().should_log(rapids_logger::level_enum::debug)) {


…_optimize

mfoerste4 · 2026-05-12T21:30:25Z

@achirkin thanks for the review!

KyleFromNVIDIA

Approved trivial CMake changes

achirkin · 2026-05-13T13:39:41Z

/merge

mfoerste4 and others added 6 commits February 16, 2026 18:52

prune kernel smem

609b0f3

reduce copies within reverse graph compute

a320e0e

optimize() draft move more compute to GPU

6d1a618

Merge branch 'rapidsai:main' into cagra_optimize

77ab079

Merge branch 'rapidsai:main' into cagra_optimize

008e0fb

some fixes, cleanup

822faea

github-project-automation Bot added this to Unstructured Data Processing Feb 20, 2026

aamijar added non-breaking Introduces a non-breaking change feature request New feature or request labels Feb 24, 2026

aamijar assigned mfoerste4 Feb 24, 2026

mfoerste4 and others added 7 commits February 24, 2026 20:17

Merge branch 'main' into cagra_optimize

8ed1497

some fixes

9b1f741

extract prune into separate function

ecf3b1d

extract optimize components

972d278

enable both host/device inout graphs for optimize

5e9ebc5

resolve conflicts

8f24d9d

smaller fixes

40977e2

mfoerste4 marked this pull request as ready for review March 2, 2026 23:35

mfoerste4 requested a review from a team as a code owner March 2, 2026 23:35

mfoerste4 added 9 commits March 3, 2026 12:41

bugfix

14e9f3e

fuse and simplify pruning, remove CPU path

416558d

cleanup merge, remove CPU path

d8d8bd8

batch reverse creation

00c4204

add prefetch view to handle managed & host

9e63a7c

fix batched iterator

a38ad52

implement fallback / simplify strategy

89b0d1c

add logging / remove stats compute

d0e3dae

add test, persist stream pool, cleanup

ec45fd2

mfoerste4 requested a review from a team as a code owner March 10, 2026 22:50

mfoerste4 and others added 3 commits May 1, 2026 01:12

refactor remove all device pointer arithmetic from batch_device_view,…

f04022c

… utilize submdspan for passthrough

simplify batched_view to 2 buffers 1 copy stream

cf86064

Merge branch 'main' into cagra_optimize

aa250a7

mfoerste4 requested a review from achirkin May 1, 2026 02:33

coderabbitai Bot reviewed May 1, 2026

View reviewed changes

Comment thread cpp/src/neighbors/detail/cagra/utils.hpp Outdated

Comment thread cpp/src/neighbors/detail/cagra/utils.hpp Outdated

mfoerste4 and others added 2 commits May 4, 2026 08:56

stream-sync fix typo

821eae6

Merge branch 'main' into cagra_optimize

cd7be32

achirkin requested changes May 6, 2026

View reviewed changes

mfoerste4 and others added 6 commits May 7, 2026 23:03

Merge branch 'main' into cagra_optimize

22b32cb

merge into batch_load_iterator

6211ff3

more review suggestions

cc8d892

Merge branch 'cagra_optimize' of github.com:mfoerste4/cuvs into cagra…

fad99af

…_optimize

fix merge conflict within kmeans

0b6ea72

more review suggestions

3b70439

mfoerste4 requested a review from achirkin May 8, 2026 11:38

mfoerste4 and others added 2 commits May 8, 2026 13:32

fix async writeback without initialization

533d19b

Merge branch 'main' into cagra_optimize

d0a4cfd

achirkin approved these changes May 12, 2026

View reviewed changes

mfoerste4 and others added 3 commits May 12, 2026 21:26

more suggestions

28372a3

Merge branch 'cagra_optimize' of github.com:mfoerste4/cuvs into cagra…

7322903

…_optimize

Merge branch 'main' into cagra_optimize

bfc0520

Merge branch 'main' into cagra_optimize

5369eb6

KyleFromNVIDIA approved these changes May 13, 2026

View reviewed changes

rapids-bot Bot merged commit 0a85b6b into rapidsai:main May 13, 2026
86 checks passed

github-project-automation Bot moved this to Done in Unstructured Data Processing May 13, 2026

This was referenced May 22, 2026

Auto select CAGRA build algorithm for hnsw::build #1719

Open

Fix workspace usage #2135

Merged

Conversation

mfoerste4 commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

copy-pr-bot Bot commented Feb 20, 2026

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mfoerste4 commented May 8, 2026

Uh oh!

achirkin left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mfoerste4 commented May 12, 2026

Uh oh!

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

Uh oh!

achirkin commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

mfoerste4 commented Feb 20, 2026 •

edited

Loading