-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Introduction to Parallel Programming - Part I
Course from Udacity by NVIDIA, and Github repo here.
1 The GPU Programming Model
GPU favors throughput over latency, and optimizes for it.
Typical GPU program
- CPU allocates storage on GPU
cudaMalloc - CPU copies input data to GPU
cudaMemcpy - CPU launches kernel(s) on GPU to process the data
- CPU copies result back from GPU
cudaMemcpy
💡 Kernels look like serial programs. Write your program as it if it will run on one thread; the GPU will run that program on many threads.
GPU is good for
- launching a large number of threads efficiently
- running a large number of threads in parallel
kernal<<<gridSize, blockSize, shmem>>>(...)
kernal<<<dim3(bx, by, bz), dim3(tx, ty, tz)>>>(...)GPU efficiently runs on many blocks and each block has a maximum #threads.
(Images from Wikipedia thread-block.)
2 GPU Hardware & Parallel Communication Model
Abstraction & Communication Patterns
Map 1-to-1: tasks read from and write to specific data elements (memory)
map(element, function)
Gather n-to-1: tasks compute where to read data
Scatter 1-to-n: tasks compute where to write data
Stencil n-to-1: tasks read from a fixed neighborhood in an array (data re-use)
Transpose 1-to-1: tasks re-order data elements in memory
Hardware & Model
GPU is responsible for allocating blocks to SMs
- a thread block contains many threads
- a SM may run more than one blocks
- all the threads in one block may cooperate to solve a subproblem, but not to communicate with threads in other blocks (even in the same SM)
CUDA makes few guarantees about when and where thread blocks will run
- Pros: flexibility
->efficiency; scalability - Cons: no assumptions block
<->SM; no communications between blocks
CUDA guarantees:
- all threads in a block run on the same SM at the same time
- all blocks in a kernel finishe before any blocks from the next kernel run
Synchronization
- Barrier
__syncthreads(): wait till all threads arrive the proceed - Atomic Ops: only certain ops & data types, still no ordering
Strategies
Maximize arithmetic intensity: math/memory
- maximize compute ops per thread
- minimize time spent on memory
- move frequently-accessed data to fast memory:
local > shared >> global >> host - coalesce global memory accesses (
coalesced >> strided)
- move frequently-accessed data to fast memory:
Avoid thread divergence (branches & loops)
💡 How to use shared memory (static & dynamic)
UPDATE: Cooperative Groups in Cuda 9 features
- define groups of threads explicitly at sub-block and multiblock granularities
- enable new patterns such as producer-consumer parallelism, opportunistic parallelism, and global synchronization across the entire Grid
- provide an abstraction for scalable code across different GPU architectures
__global__ void cooperative_kernel(...)
{
// obtain default "current thread block" group
thread_group my_block = this_thread_block();
// subdivide into 32-thread, tiled subgroups
// Tiled subgroups evenly partition a parent group into
// adjacent sets of threads - in this case each one warp in size
thread_group my_tile = tiled_partition(my_block, 32);
// This operation will be performed by only the
// first 32-thread tile of each block
if (my_block.thread_rank() < 32) {
…
my_tile.sync();
}
}3 & 4 Fundamental GPU Algorithms
Step complexity vs. Work complexity (total amount of opearions)
Reduce
Input:
- set of elements
- reduction operator
- binary
- associative (
(a op b) op c = a op (b op c))
a b
\ /
+ c
\ /
+ d
\ /
+
a b c d
\ / \ /
+ +
\ /
+
logn step complexity - what if we only have p processor but n > p input? Brent's theorem
Scan
Input:
- set of elements
- binary associative operator
- identity element
Is.t.I op a = a
Two types of scan:
- Exclusive scan - output all element before but not current element
- Inclusive scan - output all element before and current element
| Algorithm | Desp. | Work | Step | Notes |
|---|---|---|---|---|
| Serial Scan | - | O(n) |
n |
|
| Hillis & Steele Scan | Starting with step 0, on step i, op yourself to your 2^i left neighbor (if no such neighbor copy yourself) |
O(n*logn) |
logn |
step efficient (more processor than work) |
| Blelloch Scan | reduce -> downsweep (paper, wiki) |
O(n) |
2*logn |
work efficient (more work than processor) |
Sparse matrix/dense vector multiplication (SpMv) & Segmented scan
Histogram
- Accumulate using atomics
- Per-thread local histograms, then reduce
- Sort, then reduce by key
Compact
Compact is most useful when we compact away a large number number of elements and the computation on each surviving element is expensive.
- Predicate
- Scan-in array
A(0for false and1for true) - Exclusive-sum-scan on array
A->scatter addresses (dense) - Scatter using addresses
Allocate
Using scan (good strategy)
- Input: allocation request per input element
- Output: location in array to write your thread's output
Sort
- Odd-even sort -
O(n)steps &O(n^2)works - (Parallel) Merge sort
- Tons of tasks (each task small) - task per thread
- Bunches of tasks (each task medium) - task per block
- One task (big) - split task across SMs
- Sorting network (e.g., bitonic sorter, image from wikipedia)
- Radix sort -
O(kn)wherekis#bits of the representation (quite brute-force) - Quick sort
Oblivious - behavior is independent of some aspects of the problem
