Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions vignettes/efficiency.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,15 @@ A few things to keep in mind when moving to GPU:
- **Copying data between R and the GPU is slow.** Each transfer has a fixed overhead plus a per-byte cost. Once you've moved an array onto the GPU, keep it there; avoid `as_array()` (which copies the value back into R) inside loops you want to run fast (see [Asynchronous execution](#asynchronous-execution) below).
- **Each call to the GPU has more overhead than a CPU call.** Small operations on small arrays can actually be *slower* on GPU than on CPU because the per-call overhead dominates the actual work.

### Writing GPU-friendly code

GPUs get their speed from running thousands of operations in parallel, so they reward code that does a lot of work per call on large arrays.
A few rules of thumb:

- **Avoid R-level loops over array elements or rows.** Each iteration launches a separate kernel with non-trivial overhead, and the kernels run one after another instead of in parallel. Express the work as a single vectorized operation on the whole array whenever possible (e.g. `x * y` or `nv_reduce_sum(x)` instead of a `for` loop).
- **Prefer batched operations.** Operations like `nv_matmul()` accept a full batch of matrices at once. Stacking many small inputs into one larger array and processing them in a single call is typically much faster than calling the same operation many times. This mirrors plain R, where vectorized code beats a `for` loop -- except that on a GPU the win comes from running the batch entries in parallel rather than from avoiding interpreter overhead.
- **Sometimes the algorithm itself has to be rewritten.** Inherently sequential formulations -- e.g. a running update that depends on the previous step -- can leave most of the GPU idle. Reformulating the computation to expose parallelism can be dramatically faster, even when the parallel version does more total work.

## Data types

The data type of an `AnvlArray` (`f32`, `f64`, `i32`, `i64`, ...) affects both how much memory it uses and how fast operations on it run.
Expand Down
Loading