diff --git a/README.md b/README.md index 0e38ddb1..ec61675b 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,274 @@ -CUDA Stream Compaction -====================== +# **Project 2 - CUDA Stream Compaction** +## **University of Pennsylvania, CIS 5650: GPU Programming and Architecture** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +* Jefferson Koumba Moussadji Lu + * [LinkedIn](https://www.linkedin.com/in/-jeff-koumba-0b356721b/) +* Tested on: Personal Laptop, Windows 11 Home, Intel(R) Core(TM) i9-14900HX @ 2.22GHz @ 24 Cores @ 32GB RAM, Nvidia GeForce RTX 4090 @ 16 GB @ SM 8.9 -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +

+ +

Stream Compaction

+

-### (TODO: Your README) +## Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +Stream compaction is a parallel algorithm for filtering out elements from an array that do not meet a given condition, while preserving the order of the remaining elements. In other words, given an input array and a predicate (e.g. "is non-zero"), stream compaction produces a new output array containing only the elements that satisfy the predicate, in their original order. This operation is important in GPU programs to reduce unnecessary processing and memory transfer of unwanted data. It is widely used in applications like particle filtering, path tracing (to compact out terminated rays), collision detection, and sparse matrix operations. +By compacting data on the GPU, we minimize bandwidth usage when copying results back to the CPU and avoid serial processing of irrelevant elements. +A prefix-sum scan is the backbone of efficient stream compaction. A scan computes an array of prefix sums from an input array. For example, an inclusive prefix sum with addition on input ```[1 ,3, 5, 9]``` would produce ```[1, 4, 9, 18]``` (each output is the sum of all inputs up to that index). + +In this project, we focus on the exclusive scan, where the output is shifted right by one, and the first element is the identity (0 for sum). The scan is fundamental because it can quickly produce indices for writing compacted output in parallel. + +### Summary of features: + +We implemented multiple versions of scan and stream compaction on both CPU and GPU and compared their performance. The features include: + +- CPU Scan and CPU Stream Compaction (both a direct without-scan method and a with-scan method using prefix sum). + +- Naive Parallel Scan (GPU) - a work-inefficient but simple CUDA implementation. + +- Work-Efficient Parallel Scan (GPU) - an optimized scan using the Blelloch algorithm (up-sweep/down-sweep), and GPU stream compaction based on this scan. + +- Thrust Library Scan (GPU) - using CUDA Thrust’s exclusive scan for comparison. + +- Shared Memory & Radix sort: An optimized scan with shared memory for better performance, and a Parallel Radix Sort (stable, scan-based) that supports sorting signed 32-bit integers using our scan primitives. + +## Features Implemented + +### CPU Scan & Stream Compaction (CPU) + +We implemented a straightforward CPU exclusive scan and two versions of CPU stream compaction to verify correctness and provide baseline performance. The CPU scan simply iterates through the array accumulating a running sum, which runs in O(N) time. +For compaction, we developed: +- ```compactWithoutScan``` (CPU): a simple loop that checks each element and writes it to the output array if it meets the condition (non-zero, in our case). This approach filters values in one pass. + +- ```compactWithScan``` (CPU): a two-pass method that first builds a boolean mask array (1 for elements to keep, 0 for discard), then performs a prefix-sum scan on this mask to compute output indices, and finally scatters the input values to their computed positions. This method uses the scan operation to achieve compaction. + +Both CPU compaction methods produce identical results, but the direct ```compactWithoutScan``` is faster on the CPU since it does minimal work. Using a full scan on the CPU adds overhead that isn’t worthwhile in serial contexts. In our tests, the simple loop compaction outpaced the scan-based approach on CPU (as expected), reflecting that the scan method’s benefit lies in parallelization rather than efficiency on a single core. These CPU implementations served as a correctness reference for the GPU versions. + +### Naive Parallel Scan (GPU) + +

+ +

This image is an implementation of an inclusive scan. To convert this to an exclusive scan, we can simply shift the inclusive scan results one position to the right

+ +The naive GPU scan is an exclusive prefix sum implemented using a parallel version of the classic Hillis & Steele algorithm (also known as the Kogge-Stone scan). This approach is called “naive” because it is not work-efficient: it performs O(N log N) operations for an array of length N. + +The algorithm runs in $\lceil \log_2 N \rceil$ iterations; in the d-th iteration, each thread reads two values separated by $2^{d-1}$ and adds them, writing the partial sum. By iteration’s end, the array is transformed into an inclusive scan. We convert it to an exclusive scan by shifting the results right and inserting the identity (0) at the start. + +Our implementation launches enough threads to cover the array and uses GPU global memory for all reads/writes at each step. This simple approach keeps all threads busy each iteration (no branching based on index) and is relatively easy to implement. However, it does significantly more work than necessary since each element’s value is involved in many repeated additions across iterations. The naive scan also issues many global memory transactions due to repeatedly reading and writing the array on each iteration. We took care to handle non-power-of-two (NPOT) array sizes by checking bounds before accessing out-of-range indices on each iteration. The result is a correct exclusive scan that’s easy to verify but not optimal in terms of GPU work efficiency. + +### Work-Efficient Scan & Compaction (GPU) + +For a faster parallel scan, we implemented the work-efficient Blelloch scan algorithm. This approach performs only O(N) operations by using a balanced binary tree method in two phases: an up-sweep (reduce) and a down-sweep. In the up-sweep, the kernel recursively reduces the array: pairs of elements are summed to form partial sums at increasing stride distances (1, 2, 4, ...), until the last element contains the total sum. In the down-sweep phase, we first set the last element to 0 (this ensures an exclusive scan result), then we propagate the partial sums downward: each step, each thread uses the value it computed in up-sweep to set a prefix sum for two children elements. After downsweep completes, the array has been transformed into an exclusive prefix sum of the input values. + +

+ +

Up Sweep phase

+ +

+ +

Down Sweep Phase

+ +Our Blelloch scan is implemented with multiple kernel launches: each sweep level (for both up and down phases) is a CUDA kernel invocation that performs the operation at a given stride. This was simpler to implement and debug. We again handle NPOT sizes by ignoring threads/work on indices beyond the array length. The result is an accurate exclusive scan that does minimal total work compared to the naive version. However, we observed that work-efficient does not always mean faster in practice. The overhead of launching multiple kernels and the fact that many threads do nothing in certain steps (causing warp divergence) can hurt performance. For example, at each up-sweep level $d$, half of the threads become idle (those at indices not divisible by $2^d$), leading to many inactive threads that still occupy GPU resources. Despite these practical considerations, the work-efficient scan should scale better for very large N, since it has asymptotically less work than the naive scan. + +Using our GPU scan, we implemented GPU stream compaction in a work-efficient manner. The compaction consists of three steps on the device: + +1) Map: Generate a boolean mask array where each entry is 1 if the corresponding input element should be kept (e.g. is non-zero), or 0 if it should be removed. + +2) Scan: Run an exclusive scan on this mask array to produce an array of indices (or offsets). Each index tells us the target position of the corresponding element in the compacted output. The last value of the scan (plus the last mask value) also gives the total count of surviving elements. + +3) Scatter: Launch a kernel that, for each index i where the mask is 1, reads the input value and writes it to the output at index ```maskScan[i]```. Threads for which the mask is 0 do no write. This effectively packs all desired elements to the front of the output array without gaps. + +All three steps are O(N) on the GPU and involve mostly memory-bound operations (aside from the additions in the scan). Our implementation preserves the order of input elements. We verified that the GPU compaction output matches exactly with the CPU output and that the count of surviving elements is correct for both power-of-two and non-power-of-two array sizes. The GPU compaction uses the parallel scan to efficiently compute scatter locations, which is much faster than any serial method once N is large. + +### Thrust Scan (GPU) + +To compare our custom implementations against an industry-grade optimized solution, we also incorporated the Thrust library’s scan. Thrust is NVIDIA’s parallel algorithms library, and we used ```thrust::exclusive_scan``` on device arrays for benchmarking. Thrust’s implementation is highly optimized. It efficiently handles arbitrary sizes and is tuned for the GPU architecture. For example, Thrust’s scan uses shared memory and other optimizations internally. We used the Thrust scan results to validate our own scans and to serve as a performance baseline (“best case” scenario) for prefix sum on our hardware. + +### Optimized Shared-Memory Scan (GPU) + +We further optimized the work-efficient scan by using CUDA shared memory to reduce global memory traffic. In the original Blelloch scan, each partial sum operation reads and writes to the global array. Our optimized version instead loads data into each thread block’s shared memory and performs the up-sweep and down-sweep steps locally in fast on-chip memory. Each block computes a scan for its portion of the array. We then handle the inter-block sums by performing a scan of each block’s total (this uses a separate small array of block sums). Finally, we distribute the offsets to each block and do a second pass to add the block offset to each element within the block. This multi-kernel approach ensures most of the heavy-lifting is done with low-latency memory. The result is a significant speedup: by reducing expensive global memory accesses and using memory coalescing, the shared-memory scan approached the performance of Thrust’s highly optimized scan in our tests. The functionality and results of this optimized scan remained the same as our original work-efficient scan, but it highlights how careful memory layout and access patterns can drastically improve performance. + +### Parallel Radix Sort (GPU) + +For an additional challenge, we implemented a parallel radix sort on the GPU using our scan as a subroutine. The radix sort is stable and supports sorting of signed 32-bit integers (both positive and negative). Our approach is a least-significant-bit (LSB) first radix sort (also known as an LSD radix sort) that processes one bit at a time using a scatter operation based on prefix sums. For each bit position from 0 to 31, we: + +1) Compute bit mask: Generate an array of 0/1 flags indicating whether each number’s current bit is 0 or 1. For sorting in ascending order, we treat 0-bit as “keep” (these will go to the front) and 1-bit as “remove” in a sense. Alternatively, one can partition by counting 0s vs 1s. + +2) Scan the mask: We perform an exclusive prefix sum on the inverted mask (e.g. treat 1s as 0s and 0s as 1s) to get the destination indices for elements with the current bit = 0. This gives us the positions where the 0-bit elements should go in the output for this pass. + +3) Scatter pass: We scatter elements to a temporary output array: elements with 0 in the current bit are placed at the beginning (in the order determined by the scanned indices), and elements with 1 in that bit are placed after all the 0s. Because we use the scan results, this scatter is stable. It preserves the relative order of elements with the same bit value across passes, which is crucial for correctness in radix sort. + +4) Swap arrays: The output of this pass becomes the input for the next bit pass (we alternate buffers). + +We repeat the above for all 32 bits of the integer. After the final bit, the array is sorted in ascending order. To handle negative numbers correctly, we adopted a common trick: interpret the 32-bit integers in two’s complement form such that all negative values appear “smaller” than positives. In practice, this can be achieved by offsetting values (e.g., adding $2^{31}$ to each number to make them all non-negative, sorting, then subtracting it back) or by modifying the final pass to ensure the sign bit is handled appropriately. Our implementation ensures that all negative numbers end up at the front of the sorted array (still in ascending order among themselves), followed by all positive numbers. The sorting is stable by construction, which means if two elements have equal value, their original order is preserved (important if this sort were to be extended for more complex keys). The complexity is O(32 * N) = O(N), since we perform a fixed 32 passes of scan/scatter regardless of input size. Each pass is parallel and largely memory-bandwidth bound. While 32 passes is a lot of work, this method can outperform a comparison-based sort for large N because it has better parallel scalability. We compared our GPU radix sort against ```std::sort``` on the CPU to evaluate performance. + +To call Radix Sort: +- Look into the file efficient.cu +``` +// Example call (device pointers): +// n = number of elements; d_in, d_out are int* on device memory +StreamCompaction::Efficient::Radix::sort(n, d_out, d_in); +// After return, d_out holds the stable, ascending result.s +``` + +## Performance Analysis + +

+ +

Execution time (log-scale) vs. input size for various scan (prefix-sum) implementations and baselines, using a fixed block size of 128 threads

+ +| Array Size | CPU Scan | Naive Scan | WE Scan | Thrust Scan | CPU Compact (w/o scan) | CPU Compact (with scan) | WE Compact | Shared Scan | Radix Sort | +| ---------: | --------: | ---------: | -------: | ----------: | ---------------------: | ----------------------: | ---------: | ----------: | ---------: | +| 2^8 | 0.000450 | 0.135120 | 0.159744 | 0.126464 | 0.000500 | 0.004400 | 0.508832 | 0.104448 | 9.245700 | +| 2^12 | 0.005700 | 0.160800 | 0.194560 | 0.044192 | 0.006100 | 0.024900 | 0.387104 | 0.144384 | 5.800960 | +| 2^16 | 0.096250 | 0.194736 | 0.257072 | 0.042288 | 0.104350 | 0.285600 | 0.502672 | 0.194624 | 6.959100 | +| 2^20 | 1.492150 | 0.549088 | 0.354976 | 0.300816 | 1.614800 | 4.622800 | 5.415200 | 0.241312 | 11.718700 | +| 2^24 | 23.081900 | 8.591905 | 2.714130 | 1.075105 | 30.581250 | 85.811400 | 17.569300 | 0.828192 | 107.652000 | + +We can notice that GPU-based scans (Naive, Work-Efficient, Shared-memory, and Thrust) scale linearly and run much faster than the CPU scan, especially as the array size grows. + +

+ +

Execution time of stream compaction algorithms vs. CUDA block size (threads per block) for a fixed 1,048,576-element array

+ +| Block Size | CPU Scan | Naive Scan | WE Scan | Thrust Scan | CPU Compact (w/o scan) | CPU Compact (with scan) | WE Compact | Shared Scan | Radix Sort | +| ---------: | -------: | ---------: | -------: | ----------: | ---------------------: | ----------------------: | ---------: | ----------: | ---------: | +| 64 | 1.687300 | 0.485184 | 0.427136 | 0.332160 | 1.858200 | 4.837900 | 1.614050 | 0.494176 | 10.397700 | +| 128 | 1.492150 | 0.549088 | 0.354976 | 0.300816 | 1.614800 | 4.622800 | 5.415200 | 0.241312 | 11.718700 | +| 256 | 1.600250 | 0.380368 | 0.333600 | 0.337344 | 1.717300 | 4.884100 | 1.571085 | 0.217696 | 10.311700 | +| 512 | 1.462850 | 0.406608 | 0.375488 | 0.311328 | 1.638950 | 4.688700 | 1.589280 | 0.463008 | 19.420200 | +| 1024 | 1.481200 | 0.412144 | 0.675872 | 0.484688 | 1.739750 | 4.773300 | 1.562370 | 0.264000 | 10.693600 | + +We notice the GPU Work-Efficient compaction is shown in green, and its performance improves dramatically with block size up to ~256 threads. CPU baselines (horizontal lines) are much slower (higher runtime) and unaffected by block size. Lower runtime is better (log-log scale). + +### Scan Performance + + +#### GPU vs CPU Speedup +All GPU scan implementations significantly outperform the CPU sequential scan for large inputs. By the largest tested size (~16 million elements), the fastest GPU scan finishes in only a few milliseconds, whereas the CPU takes on the order of 10^1-10^2 milliseconds. This is an over 10× speedup (one to two orders of magnitude). In fact, even at more moderate sizes (e.g. around 1 million elements), the GPU already provides a ~10× overall sorting/scanning speedup over the CPU. The crossover point at which the GPU pulls ahead is on the order of only tens of thousands of elements. Beyond that, GPU parallelism dominates. + +#### Naive vs. Work-Efficient Algorithms +The Naive parallel scan algorithm performs asymptotically more work (O(n log n) operations) than the Work-Efficient scan (O(n) operations). This extra factor of log n has a noticeable impact on performance at large n: in our data, the work-efficient scan begins to outperform the naive version markedly as n grows. By the 16M input case, the work-efficient GPU scan runs roughly 2-3× faster than the naive scan (which does redundant work) because it avoids the substantial overhead of those extra log n steps. In short, minimizing algorithmic work pays off, as predicted by complexity analysis. + +#### Impact of Shared Memory & Thrust + +Using shared memory for intra-block scanning boosts performance over a naive global-memory scan. Our custom Shared Memory scan shows better throughput than the purely naive approach, thanks to fewer global memory accesses. The fully optimized Work-Efficient scan (based on the Blelloch upsweep/downsweep) is even faster, nearly matching the performance of NVIDIA’s highly-optimized ```thrust::inclusive_scan```. In fact, the Thrust library scan and our work-efficient implementation achieve similar timings at large sizes indicating that our implementation is close to state-of-the-art efficiency. + +By leveraging parallelism, the GPU scan implementations run orders of magnitude faster than a single-threaded CPU scan for substantial array sizes. The work-efficient algorithm, which achieves O(n) work complexity, overtakes the work-inefficient naive scan for large n, illustrating the importance of algorithmic efficiency on GPUs. Optimizations like shared memory usage further improve performance. In practice, all GPU variants handle millions of elements with ease (on the order of milliseconds), whereas the CPU time grows into tens or hundreds of milliseconds. The results demonstrate that GPUs are extremely well-suited to prefix-sum computations, achieving high throughput as problem size increases. + +#### Effect of CUDA Block Size + +We found that scan performance is relatively insensitive to block size in a broad range, with a slight sweet spot around medium block sizes (256-512 threads). Very small blocks (e.g. 128 threads) underutilize the GPU, leading to a bit more overhead per element, while extremely large blocks (1024 threads) can reduce occupancy or increase serialization, also slightly hurting throughput. For example, in tests at 1 million elements, using 256 threads/block yielded near-best performance, whereas 128 threads/block was a bit slower (~10-20% slower for some scan kernels) and 1024 threads showed a minor slowdown as well. Overall, as long as the block size is reasonably large, the scan kernels maintain high throughput; the performance stays close to linear scaling with input size (see figure above) regardless of block configuration. + +### Compaction Performance + +#### GPU vs CPU Compaction + +The work-efficient GPU compaction algorithm vastly outperforms the CPU-based approaches. For a ~1 million element array, the GPU compaction completes in only around 0.3-0.5 ms (at optimal block size), whereas the CPU implementations take on the order of 2-5 ms for the same task, roughly an order of magnitude slower. At the maximum tested size (~16 million elements, not shown in the block-size figure but in the size-scaling results), this gap widens further: the GPU finishes in only a few milliseconds, while the CPU requires tens of milliseconds (20× or more slower for 16M). The parallel speedup here is similar to what we observed for scanning, since GPU compaction essentially leverages a parallel scan internally to mark positions. + +#### Single-pass vs Two-pass CPU + +Between the two CPU implementations, compaction without scan (single-pass filtering) was slightly faster than compaction with scan. This makes sense since the two-pass method performs an extra full scan plus an additional memory pass to scatter elements, which in a single-threaded context just adds overhead without any parallel gain. Our results confirmed only a small difference (both CPU methods are slow), but the version avoiding the extra scan has an edge, especially as n grows. In other words, doing everything in one pass (branching on each element and writing out survivors) is more cache-friendly and incurs fewer total operations on the CPU than doing a separate prefix sum then copy. + +#### GPU Work-Efficient Compact + +Our GPU compaction uses an efficient parallel scan to identify non-zero (or valid) elements and then scatters them in parallel. Despite doing multiple steps (scan + scatter), it still runs orders of magnitude faster than the CPU because each step is massively parallel. The work-efficient GPU approach has linear work complexity overall, similar to the GPU scan. As a result, its runtime scales linearly with n and remains only slightly above the pure scan time. For instance, at 1M elements the work-efficient compaction only takes marginally longer than a scan (owing to the additional write step). This shows that the overhead of compaction beyond scanning is minimal on GPU, whereas on CPU that overhead cannot be hidden. + +#### Block Size Tuning + +The performance of the GPU compaction kernel does depend on block size more noticeably than the scans. As shown in the figure, at a very small block size (128 threads) the work-efficient compaction was about 10× slower for 1M elements compared to using 256 threads. With only 128-thread blocks, the kernel likely launched many more blocks and incurred extra coordination overhead (and possibly did not fully occupy the GPU’s SMs). Increasing to 256 threads/block improved utilization dramatically, yielding the best performance in our tests. Beyond 256, the performance remained high: 512 and 1024 threads per block gave similar results, with maybe a slight dip at 512 in our data (possibly due to some memory or occupancy trade-off). In summary, ≥256 threads per block is optimal for the compaction kernel on our GPU. In contrast, the CPU times (horizontal lines in the figure) are flat since they do not involve GPU block sizes at all, underlining that their performance is consistently poor relative to the GPU, regardless of configuration. + +The work-efficient GPU stream compaction exhibits excellent performance, completing in time comparable to a single scan and massively outperforming the CPU versions. For large arrays, the GPU achieves an order-of-magnitude (or more) speedup over the CPU for compaction. The single-pass CPU method is marginally faster than the two-pass (with scan) method, but both are far slower than the GPU. We also see that proper choice of block size is important for GPU compaction. Too small blocks can hurt performance, but with a reasonable block size (256+ threads), the GPU algorithm runs at full efficiency. These results highlight that parallel prefix-sum-based compaction on GPU not only is correct for arbitrary input sizes, but also delivers a huge performance advantage, leveraging throughput that the CPU cannot match on this problem size. + +### Radix Sort vs ```std::sort``` + +#### Crossover and Scaling + +We implemented an integer Radix Sort on the GPU (stable, least-significant-digit approach) and compared it to the C++ ```std::sort``` on the CPU. As expected, there is a crossover point beyond which the GPU outperforms the CPU. For small arrays (e.g. a few thousand elements), the CPU’s ```std::sort``` can be as fast or faster, because the GPU incurs overhead (kernel launch and, if included, data transfer) that dominates at tiny problem sizes. However, once the input size grows into the tens of thousands, the GPU gains the advantage. In our tests, around ~10k–100k elements the GPU sort begins to pull ahead. By 1 million elements, the GPU’s Radix Sort was roughly an order of magnitude faster than single-threaded ```std::sort```. This matches general observations in literature: for example, using Thrust on a GPU yields about a 10× speedup at 1M elements, and the gap tends to increase with larger n. + +#### Throughput at Large n + +The benefit of GPU sorting becomes even more pronounced at very large input sizes. The Radix Sort algorithm has linear time complexity O(n * k) (where k is a small factor related to number of digit passes), and it maps extremely well to parallel hardware. In contrast, ```std::sort``` (introspective comparison sort) has O(n log n) complexity and runs on a single thread by default. As n grows, the cost per element for ```std::sort``` increases (log n factor), whereas for Radix Sort the cost per element can actually decrease slightly due to caching efficiencies in linear passes. Consequently, the speedup of GPU Radix over CPU sort grows with problem size. NVIDIA reports that using Thrust (which employs a GPU Radix Sort under the hood) can deliver anywhere from 5× up to 100× faster sorting performance than a typical STL sort on CPU, depending on the input size and data type. Indeed, for extremely large arrays, the GPU can be two orders of magnitude faster: one user reported sorting 2^27 (~134 million) 32-bit values on a GPU ~100× faster than a 4.5 GHz CPU doing ```std::sort```. Our own tests didn’t reach that size, but at the upper end we did test (several million elements), the GPU Radix Sort consistently maintained a wide lead over ```std::sort``` (likely tens of times faster at 16M elements, extrapolating our results). + +#### Parallelism vs Optimization + +It’s worth noting that ```std::sort``` is highly optimized for general-purpose CPU usage, but in our comparison it was executed single-threaded (the default). Multi-threaded sorting on a CPU (e.g. using OpenMP or TBB parallel sort) would improve CPU performance and raise the crossover point, but leveraging more CPU cores only narrows the gap, rather than eliminating it. The GPU is specifically strong in sorting huge volumes of data in parallel due to its thousands of cores and high memory bandwidth. In other words, sorting is a data-parallel task where the GPU’s architecture shines. Our Radix Sort takes advantage of this by distributing work across many threads and using efficient memory access patterns (e.g. scatter/gather with coalescing). Meanwhile, the CPU’s comparison sort must perform more work (log n comparisons per element on average) and is limited by sequential memory access and branch mispredictions. The outcome is clear in our measurements: for large arrays, the GPU Radix Sort runs in only a few milliseconds, whereas ```std::sort``` takes tens or hundreds of milliseconds. + +GPU-based Radix Sort outperforms the C++ ```std::sort``` by a large margin for non-trivial input sizes. While ```std::sort``` is very fast for small datasets and benefits from years of optimization, it cannot keep up with the sheer parallel throughput of the GPU on large data sets. Once past a certain input size (on the order of 10^4-10^5 elements), the GPU gains a performance lead, and by millions of elements the difference is dramatic (an order of magnitude or more in favor of the GPU). These findings reinforce the advantage of linear-time, parallel sorting methods on modern GPUs: they can handle massive datasets significantly faster than the default CPU sorting routine, illustrating the power of GPU acceleration for high-volume data processing. The Radix Sort’s performance scaling is such that the larger the input, the more it outpaces the CPU. A trend that is consistent with theoretical expectations and other reports in the field. The result is a compelling win for GPU computing in the context of large-scale sorting. + +## Sample Output from Final Tests + +This is the final output running the test cases with: +- Array Power-Of-Two Size = $ 2^{24} $ +- Array Non-Power-Of-Two Size = $ 2^{24} - 3 $ +- Block Size = 128 + + +``` +**************** +** SCAN TESTS ** +**************** + [ 24 31 8 5 35 2 12 35 24 2 26 42 21 ... 12 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 24.4372ms (std::chrono Measured) + [ 0 24 55 63 68 103 105 117 152 176 178 204 246 ... 411033596 411033608 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 24.253ms (std::chrono Measured) + [ 0 24 55 63 68 103 105 117 152 176 178 204 246 ... 411033547 411033572 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 8.45494ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 8.3873ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 3.18029ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 2.86925ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.13389ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.760352ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 0 1 3 2 0 1 2 0 0 0 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 27.7898ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 3 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 27.2077ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 113.672ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 3 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 15.8828ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 15.6678ms (CUDA Measured) + passed + +***************************** +** EXTRA CREDIT TESTS ** +***************************** + [ 24 -19 -42 -45 -15 -48 -38 35 24 2 26 42 -29 ... 12 36 ] +==== radix sort ==== + elapsed time: 92.1836ms (CUDA Measured) + passed + [ 27 30 4 0 49 28 24 8 36 4 28 36 9 ... 39 14 ] + elapsed time: 42.7475ms (std::chrono Measured) +==== shared scan ==== + elapsed time: 0.852576ms (CUDA Measured) + passed +``` diff --git a/img/Stream_compaction.png b/img/Stream_compaction.png new file mode 100644 index 00000000..3ed5cca8 Binary files /dev/null and b/img/Stream_compaction.png differ diff --git a/img/UpSweep.jpg b/img/UpSweep.jpg new file mode 100644 index 00000000..22c4da0f Binary files /dev/null and b/img/UpSweep.jpg differ diff --git a/img/impact_by_array_size.png b/img/impact_by_array_size.png new file mode 100644 index 00000000..afd2dd84 Binary files /dev/null and b/img/impact_by_array_size.png differ diff --git a/img/impact_by_block_size.png b/img/impact_by_block_size.png new file mode 100644 index 00000000..208d63ae Binary files /dev/null and b/img/impact_by_block_size.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..790b16c7 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,9 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +//const int SIZE = 1 << 8; // feel free to change the size of array + +const int SIZE = 1 << 24; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -147,6 +149,53 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + ////////////////////////////////////////////////////////////// + + // ### Extra Credit Feature Tests ### + + printf("\n"); + printf("*****************************\n"); + printf("** EXTRA CREDIT TESTS **\n"); + printf("*****************************\n"); + + // Radix sort test + srand(time(nullptr)); // Seed random number generator + for (int i = 0; i < SIZE; ++i) { + a[i] = rand() % 100 - 50; // random int including negatives:contentReference[oaicite:24]{index=24} + } + printArray(SIZE, a, true); // Print input array (abridged) for reference + for (int i = 0; i < SIZE; ++i) { + b[i] = a[i]; // Copy input to b for CPU sorting + } + std::sort(b, b + SIZE); // CPU sort (std::sort) for reference output + zeroArray(SIZE, c); // Zero out output array for GPU + printDesc("radix sort"); // ==== radix sort ====:contentReference[oaicite:25]{index=25} + StreamCompaction::Radix::sort(SIZE, c, a); // GPU radix sort on input array + printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), + "(CUDA Measured)"); // GPU execution time (ms) + //printArray(SIZE, c, true); // (Optional) Debug: print sorted output array + printCmpResult(SIZE, b, c); // Compare GPU result (c) vs CPU result (b):contentReference[oaicite:26]{index=26} + + // Shared scan test + genArray(SIZE, a, 50); // Generate new random array of length SIZE:contentReference[oaicite:27]{index=27} + printArray(SIZE, a, true); // Print input array (abridged) + zeroArray(SIZE, b); + StreamCompaction::CPU::scan(SIZE, b, a); // CPU exclusive scan on input array + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), + "(std::chrono Measured)"); // CPU scan time (ms):contentReference[oaicite:28]{index=28} + //printArray(SIZE, b, true); // (Optional) Debug: print CPU scan output + zeroArray(SIZE, c); + printDesc("shared scan"); // ==== shared scan ====:contentReference[oaicite:29]{index=29} + StreamCompaction::Shared::scan(SIZE, c, a); // GPU shared-memory exclusive scan + printElapsedTime(StreamCompaction::Shared::timer().getGpuElapsedTimeForPreviousOperation(), + "(CUDA Measured)"); // GPU execution time (ms) + //printArray(SIZE, c, true); // (Optional) Debug: print GPU scan output + printCmpResult(SIZE, b, c); // Compare GPU result (c) vs CPU result (b) + + // (End of extra credit tests) + + ////////////////////////////////////////////////////////// + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index 19511caa..9ff58cf0 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -4,7 +4,7 @@ set(headers "naive.h" "efficient.h" "thrust.h" - ) + ) set(sources "common.cu" diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d630..0ce27a17 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,11 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + + int idx = threadIdx.x + (blockIdx.x * blockDim.x); // map from threadIdx/blockIdx to element idx + if (idx >= n) return; // if idx is out of bounds, return + bools[idx] = (idata[idx] != 0) ? 1 : 0; // map to 1 if idata[idx] is non-zero, else map to 0 + } /** @@ -33,6 +38,14 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + + int idx = threadIdx.x + (blockIdx.x * blockDim.x); // map from threadIdx/blockIdx to element idx + if (idx >= n) return; // if idx is out of bounds, return + + // if bools[idx] is 1, copy idata[idx] to odata[indices[idx]] + if (bools[idx] == 1) { + odata[indices[idx]] = idata[idx]; // scatter + } } } diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa115..67caeed5 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -20,7 +20,23 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO - timer().endCpuTimer(); + + // exclusive scan + if (n <= 0) { + + // Empty input, nothing to do + timer().endCpuTimer(); + return; + } + + odata[0] = 0; // first element is always 0 for exclusive scan + + // compute the rest of the elements + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i - 1]; // exclusive scan + } + + timer().endCpuTimer(); } /** @@ -31,10 +47,26 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + + int count = 0; // number of non-zero elements + + // iterate through input array + for (int i = 0; i < n; ++i) { + + if (idata[i] != 0) { // keep only non-zero elements + odata[count] = idata[i]; // write to output array + count++; // increment count + } + } + timer().endCpuTimer(); - return -1; + + return count; // return number of non-zero elements + //return -1; } + + /** * CPU stream compaction using scan and scatter, like the parallel version. * @@ -43,8 +75,46 @@ namespace StreamCompaction { int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + + // Handle edge case of empty input + if (n <= 0) { + timer().endCpuTimer(); + return 0; + } + + // Map to boolean array (1 for non-zero, 0 for zero) + int* bools = new int[n]; + for (int i = 0; i < n; ++i) { + bools[i] = (idata[i] != 0) ? 1 : 0; + } + + // Exclusive prefix sum (scan) on the boolean array + int* indices = new int[n]; // to hold the scanned indices + indices[0] = 0; // first element is always 0 for exclusive scan + + // Compute the rest of the elements + for (int i = 1; i < n; ++i) { + indices[i] = indices[i - 1] + bools[i - 1]; // exclusive scan + } + + // Compute total count of non-zero elements + int count = indices[n - 1] + bools[n - 1]; + + // Scatter - Write all non-zero elements to odata at computed indices + for (int i = 0; i < n; ++i) { + + // If the element is non-zero, write it to the output array at the scanned index + if (bools[i] == 1) { + odata[indices[i]] = idata[i]; // scatter + } + } + + delete[] bools; // free temporary boolean array + delete[] indices; // free temporary indices array + timer().endCpuTimer(); - return -1; + return count; // return number of non-zero elements + //return -1; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346ee..23703d4b 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -12,13 +12,92 @@ namespace StreamCompaction { return timer; } + // __global__ kernels and device code here + + // Up-Sweep (reduce) kernel + __global__ void kernUpSweep(int n, int step, int* data) { + int index = (threadIdx.x + blockIdx.x * blockDim.x) * step + (step - 1); // get the index of the last element in the current segment + + // make sure we don't go out of bounds + if (index < n) { + data[index] += data[index - step / 2]; // add the value of the left child to the parent + } + } + + // Down-Sweep kernel + __global__ void kernDownSweep(int n, int step, int* data) { + + int index = (threadIdx.x + blockIdx.x * blockDim.x) * step + (step - 1); // get the index of the last element in the current segment + + // make sure we don't go out of bounds + if (index < n) { + int left = index - step / 2; // get the index of the left child + int t = data[left]; // store the left child's value + data[left] = data[index]; // set the left child's value to the parent's value + data[index] += t; // set the parent's value to the sum of the left child's value and the parent's value + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + // TODO + // Edge case: if n <= 0, return + if (n <= 0) { + return; + } + + // Allocate space rounded up to next power of two + int pow2 = 1 << ilog2ceil(n); // next power of 2 + int* dev_data = nullptr; // device array + cudaMalloc(&dev_data, pow2 * sizeof(int)); // allocate device memory + checkCUDAError("cudaMalloc dev_data for efficient scan"); // check for errors + + // Copy input data to device and pad the rest with 0s + cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice); // copy input data to device + + // Pad the rest with 0s + if (pow2 > n) { + cudaMemset(dev_data + n, 0, (pow2 - n) * sizeof(int)); // set the rest to 0 + } + checkCUDAError("cudaMemcpy H2D and padding for efficient scan"); // check for errors + timer().startGpuTimer(); // TODO + + const int blockSize = 128; //128 // number of threads per block + + // Up-sweep (reduce) phase: build sum in place + for (int step = 2; step <= pow2; step *= 2) { + int threads = pow2 / step; // number of add operations at this level + int fullBlocks = (threads + blockSize - 1) / blockSize; // number of blocks needed + kernUpSweep << > > (pow2, step, dev_data); // launch kernel + checkCUDAError("kernUpSweep kernel"); // check for errors + } + + // Set the last element (total sum) to 0 for exclusive scan + cudaMemset(dev_data + (pow2 - 1), 0, sizeof(int)); // set the last element to 0 + checkCUDAError("cudaMemset root (exclusive scan)"); // check for errors + + // Down-sweep phase: distribute the prefix sums + for (int step = pow2; step >= 2; step /= 2) { + + // launch kernel + int threads = pow2 / step; // number of add operations at this level + int fullBlocks = (threads + blockSize - 1) / blockSize; // number of blocks needed + kernDownSweep << > > (pow2, step, dev_data); // launch kernel + checkCUDAError("kernDownSweep kernel"); // check for errors + } + timer().endGpuTimer(); + + // Read back the scanned result (first n elements) + cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost); // copy result back to host + checkCUDAError("cudaMemcpy D2H for efficient scan result"); // check for errors + + cudaFree(dev_data); // free device memory } /** @@ -33,8 +112,420 @@ namespace StreamCompaction { int compact(int n, int *odata, const int *idata) { timer().startGpuTimer(); // TODO + + // Edge case: if n <= 0, return 0 + if (n <= 0) { + timer().endGpuTimer(); + return 0; + } + + // Device memory allocation + int* dev_idata = nullptr; // device input array + int* dev_bools = nullptr; // device boolean array + int* dev_indices = nullptr; // device indices array + int* dev_odata = nullptr; // device output array + + cudaMalloc(&dev_idata, n * sizeof(int)); // allocate device memory for input + cudaMalloc(&dev_bools, n * sizeof(int)); // allocate device memory for boolean array + cudaMalloc(&dev_odata, n * sizeof(int)); // allocate device memory for output + checkCUDAError("cudaMalloc failed for compaction arrays"); + + // We will allocate dev_indices with padding for scan + int pow2 = 1 << ilog2ceil(n); // next power of 2 + cudaMalloc(&dev_indices, pow2 * sizeof(int)); // allocate device memory for indices with padding + checkCUDAError("cudaMalloc failed for indices array"); // check for errors + + // Copy input to device + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); // copy input data to device + checkCUDAError("cudaMemcpy H2D for compaction input"); // check for errors + + // Map input to booleans (1 = keep, 0 = discard) + int blockSize = 128; // number of threads per block + int fullBlocks = (n + blockSize - 1) / blockSize; // number of blocks needed + StreamCompaction::Common::kernMapToBoolean <<>> (n, dev_bools, dev_idata); // launch kernel + checkCUDAError("kernMapToBoolean kernel"); // check for errors + + // Scan on dev_bools -> dev_indices (inclusive of padding) + // Copy bools to indices array (and pad remaining space with 0) + cudaMemcpy(dev_indices, dev_bools, n * sizeof(int), cudaMemcpyDeviceToDevice); // copy bools to indices + + if (pow2 > n) { + cudaMemset(dev_indices + n, 0, (pow2 - n) * sizeof(int)); // pad remaining space with 0 + } + checkCUDAError("copy + pad boolean array for scan"); + + // Up-sweep phase on indices array + for (int step = 2; step <= pow2; step *= 2) { + + int threads = pow2 / step; // number of add operations at this level + fullBlocks = (threads + blockSize - 1) / blockSize; // number of blocks needed + kernUpSweep <<>> (pow2, step, dev_indices); // launch kernel + checkCUDAError("kernUpSweep (compaction) kernel"); + } + + // Set last element to 0 (prepare for exclusive scan) + cudaMemset(dev_indices + (pow2 - 1), 0, sizeof(int)); // set last element to 0 + checkCUDAError("cudaMemset root for compaction scan"); + + // Down-sweep phase + for (int step = pow2; step >= 2; step /= 2) { + + int threads = pow2 / step; // number of add operations at this level + fullBlocks = (threads + blockSize - 1) / blockSize; // number of blocks needed + kernDownSweep <<>> (pow2, step, dev_indices); // launch kernel + checkCUDAError("kernDownSweep (compaction) kernel"); + } + + // Scatter non-zero elements to output array using computed indices + fullBlocks = (n + blockSize - 1) / blockSize; // number of blocks needed + StreamCompaction::Common::kernScatter <<>> ( + n, dev_odata, dev_idata, dev_bools, dev_indices); // launch kernel + checkCUDAError("kernScatter kernel"); + + timer().endGpuTimer(); - return -1; + + // Copy compacted data back to host + cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy D2H for compaction output"); + + // Compute and return count of non-zero elements + int count = 0; + int lastBool, lastIndex; + cudaMemcpy(&lastBool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastIndex, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy D2H for compaction count"); + if (n > 0) { + count = lastIndex + lastBool; + } + + cudaFree(dev_idata); + cudaFree(dev_bools); + cudaFree(dev_indices); + cudaFree(dev_odata); + return count; } } + + /////////////////////////////////////////////////////////////////////////////// + //Extra Credit + + + // Radix Sort + namespace Radix { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() { + static PerformanceTimer timer; + return timer; + } + + // Kernel to flip the sign bit of each integer (convert to "unsigned" key by XOR with 0x80000000) + __global__ void kernFlipBits(int n, int* data) { + int i = threadIdx.x + blockIdx.x * blockDim.x; // global index + if (i < n) { + data[i] ^= 0x80000000; // Flip MSB to handle signed values + } + } + + // Kernel to map each element's specific bit to a boolean (0 or 1). + __global__ void kernMapToBit(int n, int bit, int* bools, const int* idata) { + int i = threadIdx.x + blockIdx.x * blockDim.x; // global index + if (i < n) { + int mask = 1 << bit; + // If the bit is set, mark 1; otherwise 0 + bools[i] = (idata[i] & mask) ? 1 : 0; + } + } + + // Kernel to scatter elements into output array based on computed indices for radix sort. + __global__ void kernScatterRadix(int n, const int* idata, int* odata, + const int* bools, const int* indices, int totalFalse) { + int i = threadIdx.x + blockIdx.x * blockDim.x; + if (i < n) { + if (bools[i] == 0) { + // Element has 0 in current bit, write to index = i - number of 1s before i + int pos = i - indices[i]; + odata[pos] = idata[i]; + } + else { + // Element has 1 in current bit, write to index = totalFalse + (number of 1s before i) + int pos = totalFalse + indices[i]; + odata[pos] = idata[i]; + } + } + } + + /** + * Performs radix sort on idata, storing the sorted result into odata. + * Handles signed integers by using a bit flip (XOR 0x80000000) so that negative numbers are sorted correctly. + */ + void sort(int n, int* odata, const int* idata) { + if (n <= 0) { + return; + } + // Device memory allocation + int* dev_in = nullptr; // input array + int* dev_out = nullptr; // output array + int* dev_bools = nullptr; // boolean array for current bit + int* dev_indices = nullptr; // scanned indices array + + cudaMalloc(&dev_in, n * sizeof(int)); // allocate device memory for input + cudaMalloc(&dev_out, n * sizeof(int)); // allocate device memory for output + cudaMalloc(&dev_bools, n * sizeof(int)); // allocate device memory for boolean array + + int pow2Length = 1 << ilog2ceil(n); // next power of 2 for scan + cudaMalloc(&dev_indices, pow2Length * sizeof(int)); // allocate device memory for indices with padding + checkCUDAError("cudaMalloc failed for radix sort arrays"); + + // Copy input data to device + cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy H2D for radix sort input"); + + // Transform all numbers by flipping MSB (to handle signed ints) + const int blockSize = 128; + int fullBlocks = (n + blockSize - 1) / blockSize; // number of blocks needed + kernFlipBits <<>> (n, dev_in); // flip sign bits + checkCUDAError("kernFlipBits kernel"); + + timer().startGpuTimer(); + // Loop over each bit (0 to 31) + for (int bit = 0; bit < 32; ++bit) { + + // Map each element to a boolean (is bit == 1?) + fullBlocks = (n + blockSize - 1) / blockSize; + kernMapToBit <<> > (n, bit, dev_bools, dev_in); // map to booleans + checkCUDAError("kernMapToBit kernel"); + + // Copy boolean array to dev_indices (pad to next power of two length) + cudaMemcpy(dev_indices, dev_bools, n * sizeof(int), cudaMemcpyDeviceToDevice); // copy bools to indices + if (pow2Length > n) { + cudaMemset(dev_indices + n, 0, (pow2Length - n) * sizeof(int)); + } + checkCUDAError("cudaMemcpy + cudaMemset for boolean array"); + + // Inclusive scan (prefix sum) on dev_indices (booleans array) using Blelloch scan (upsweep and downsweep) + // Upsweep phase + for (int stride = 2; stride <= pow2Length; stride *= 2) { + int threads = pow2Length / stride; + int blocks = (threads + blockSize - 1) / blockSize; + StreamCompaction::Efficient::kernUpSweep <<>> (pow2Length, stride, dev_indices); + checkCUDAError("kernUpSweep (radix) kernel"); + } + // Set last element to 0 (prepare for exclusive scan) + cudaMemset(dev_indices + (pow2Length - 1), 0, sizeof(int)); + checkCUDAError("cudaMemset root for radix scan"); + // Downsweep phase + for (int stride = pow2Length; stride >= 2; stride /= 2) { + int threads = pow2Length / stride; + int blocks = (threads + blockSize - 1) / blockSize; + StreamCompaction::Efficient::kernDownSweep << > > (pow2Length, stride, dev_indices); + checkCUDAError("kernDownSweep (radix) kernel"); + } + + // Calculate total number of 0s (false) for this bit to determine scatter offsets + int lastBool, lastIndex; + cudaMemcpy(&lastBool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastIndex, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy D2H for radix scan totals"); + int totalOnes = lastIndex + lastBool; + int totalZeros = n - totalOnes; + + // Scatter: stable partition into dev_out based on bit + fullBlocks = (n + blockSize - 1) / blockSize; + kernScatterRadix <<>> (n, dev_in, dev_out, dev_bools, dev_indices, totalZeros); + checkCUDAError("kernScatterRadix kernel"); + + // Swap dev_in and dev_out for next iteration + int* temp = dev_in; + dev_in = dev_out; + dev_out = temp; + } + timer().endGpuTimer(); + + // After 32 iterations, dev_in now holds the fully sorted values (with MSB flipped). + // Flip bits back to restore original sign bits + fullBlocks = (n + blockSize - 1) / blockSize; + kernFlipBits <<>> (n, dev_in); + checkCUDAError("kernFlipBits kernel (restore)"); + + // Copy sorted data back to host + cudaMemcpy(odata, dev_in, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy D2H for radix sort output"); + + // Free allocated memory + cudaFree(dev_in); + cudaFree(dev_out); + cudaFree(dev_bools); + cudaFree(dev_indices); + } + } + + /////////////////////////////////////////////////////////////////////////////// + namespace Shared { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() { + static PerformanceTimer timer; + return timer; + } + + // Macro for conflict-free indexing offset (to avoid shared memory bank conflicts) + #define LOG_NUM_BANKS 5 // 32 banks + #define CONFLICT_FREE_OFFSET(index) ((index) >> LOG_NUM_BANKS) + + // Kernel to perform scan (exclusive prefix sum) on each block using shared memory (Blelloch algorithm). + // Also computes the sum of each block's elements and stores it in blockSums[blockIdx.x]. + __global__ void kernScanBlock(int n, const int* idata, int* odata, int* blockSums) { + extern __shared__ int temp[]; // shared memory array for scanning (with padding for bank conflicts) + int index = threadIdx.x; + int globalIndex = blockIdx.x * blockDim.x + index; + int blockSize = blockDim.x; + // Load data into shared memory (with conflict-free padding) + if (globalIndex < n) { + temp[index + CONFLICT_FREE_OFFSET(index)] = idata[globalIndex]; + } + else { + temp[index + CONFLICT_FREE_OFFSET(index)] = 0; + } + __syncthreads(); + + int offset = 1; + int paddedLength = blockSize; // we treat each block segment length as blockSize (pad with zeros if needed) + // Up-sweep (reduce) phase + for (offset = 1; offset < paddedLength; offset *= 2) { + int idx = (index + 1) * offset * 2 - 1; + if (idx < paddedLength) { + int ai = idx - offset; + int bi = idx; + // Add value from ai to bi with conflict-free indexing + int offAi = ai + CONFLICT_FREE_OFFSET(ai); + int offBi = bi + CONFLICT_FREE_OFFSET(bi); + temp[offBi] += temp[offAi]; + } + __syncthreads(); + } + + // Save total sum of this block and clear the last element for down-sweep + if (index == 0) { + int lastIdx = paddedLength - 1; + int offLast = lastIdx + CONFLICT_FREE_OFFSET(lastIdx); + blockSums[blockIdx.x] = temp[offLast]; // total sum of block + temp[offLast] = 0; // set root to 0 for exclusive scan + } + __syncthreads(); + + // Down-sweep phase + for (offset = blockSize; offset >= 1; offset /= 2) { + int idx = (index + 1) * offset * 2 - 1; + if (idx < blockSize) { + int ai = idx - offset; + int bi = idx; + int offAi = ai + CONFLICT_FREE_OFFSET(ai); + int offBi = bi + CONFLICT_FREE_OFFSET(bi); + int t = temp[offAi]; + temp[offAi] = temp[offBi]; + temp[offBi] += t; + } + __syncthreads(); + } + + // Write the exclusive scan results for this block to global memory + if (globalIndex < n) { + odata[globalIndex] = temp[index + CONFLICT_FREE_OFFSET(index)]; + } + } + + // Kernel to add block offsets to each element of the scanned blocks. + __global__ void kernAddBlockOffsets(int n, const int* blockOffsets, int* odata) { + int i = threadIdx.x + blockIdx.x * blockDim.x; + if (i < n) { + // Determine which block this index belongs to + int blockId = i / blockDim.x; + if (blockId > 0) { + // Add the sum of all preceding blocks to this element + odata[i] += blockOffsets[blockId]; + } + } + } + + /** + * Performs exclusive prefix-sum (scan) on idata, storing the result into odata. + * Uses a work-efficient approach with shared memory for per-block scans and a second pass to add block sums. + */ + void scan(int n, int* odata, const int* idata) { + if (n <= 0) { + return; + } + // Allocate device memory for input and output + int* dev_in = nullptr; + int* dev_out = nullptr; + int* dev_blockSums = nullptr; + int* dev_blockScan = nullptr; + cudaMalloc(&dev_in, n * sizeof(int)); + cudaMalloc(&dev_out, n * sizeof(int)); + checkCUDAError("cudaMalloc dev_in/dev_out for shared scan"); + // Determine number of blocks needed + const int blockSize = 128; + int numBlocks = (n + blockSize - 1) / blockSize; + cudaMalloc(&dev_blockSums, numBlocks * sizeof(int)); + // Allocate array for scanned block sums (pad to next power of two for scanning) + int pow2Blocks = 1 << ilog2ceil(numBlocks); + cudaMalloc(&dev_blockScan, pow2Blocks * sizeof(int)); + checkCUDAError("cudaMalloc blockSums/blockScan for shared scan"); + + // Copy input to device + cudaMemcpy(dev_in, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy H2D for shared scan input"); + + timer().startGpuTimer(); + // Scan each block's data into dev_out, and write block sums + // Use dynamic shared memory: each block gets (blockSize + blockSize/32) * sizeof(int) bytes + size_t sharedMemSize = (blockSize + (blockSize >> 5)) * sizeof(int); + kernScanBlock <<>> (n, dev_in, dev_out, dev_blockSums); + checkCUDAError("kernScanBlock kernel"); + + // Scan the array of block sums to get the offset for each block + // Copy block sums to padded array for scanning + cudaMemcpy(dev_blockScan, dev_blockSums, numBlocks * sizeof(int), cudaMemcpyDeviceToDevice); + if (pow2Blocks > numBlocks) { + cudaMemset(dev_blockScan + numBlocks, 0, (pow2Blocks - numBlocks) * sizeof(int)); + } + checkCUDAError("cudaMemcpy + cudaMemset for block sums"); + // Upsweep on block sums + for (int stride = 2; stride <= pow2Blocks; stride *= 2) { + int threads = pow2Blocks / stride; + int blocks = (threads + blockSize - 1) / blockSize; + StreamCompaction::Efficient::kernUpSweep <<>> (pow2Blocks, stride, dev_blockScan); + checkCUDAError("kernUpSweep (blockSums) kernel"); + } + // Set last element to 0 for exclusive scan + cudaMemset(dev_blockScan + (pow2Blocks - 1), 0, sizeof(int)); + checkCUDAError("cudaMemset root for block sums scan"); + // Downsweep on block sums + for (int stride = pow2Blocks; stride >= 2; stride /= 2) { + int threads = pow2Blocks / stride; + int blocks = (threads + blockSize - 1) / blockSize; + StreamCompaction::Efficient::kernDownSweep <<>> (pow2Blocks, stride, dev_blockScan); + checkCUDAError("kernDownSweep (blockSums) kernel"); + } + + // Add block offsets to every element in dev_out to get the final scanned output + int fullBlocks = (n + blockSize - 1) / blockSize; + kernAddBlockOffsets <<>> (n, dev_blockScan, dev_out); + checkCUDAError("kernAddBlockOffsets kernel"); + timer().endGpuTimer(); + + // Copy the result back to host + cudaMemcpy(odata, dev_out, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy D2H for shared scan output"); + + // Free device memory + cudaFree(dev_in); + cudaFree(dev_out); + cudaFree(dev_blockSums); + cudaFree(dev_blockScan); + } + } + + + } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4fe..1178b4ea 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -7,7 +7,16 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); - int compact(int n, int *odata, const int *idata); } + // Extra credit + + namespace Radix { + StreamCompaction::Common::PerformanceTimer& timer(); //noexcept + void sort(int n, int* odata, const int* idata); // Stable sort from least significant bit to most significant bit + } + namespace Shared { + StreamCompaction::Common::PerformanceTimer& timer(); //noexcept + void scan(int n, int* odata, const int* idata); // Shared memory scan kernel + } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 43088769..ab792d10 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -13,13 +13,77 @@ namespace StreamCompaction { } // TODO: __global__ + // Performs one step of the scan, with the given offset + __global__ void kernScanStep(int n, int offset, const int* idata, int* odata) { + + int k = threadIdx.x + blockIdx.x * blockDim.x; // map from threadIdx to array index + + if (k >= n) return; // check array bounds + + if (k >= offset) { + // Add the element from index [k - offset] (previous partial sum) + odata[k] = idata[k] + idata[k - offset]; + } + else { + // Copy the element as is (no element to add from behind) + odata[k] = idata[k]; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + //TODO + // Allocate device memory + if (n <= 0) { + return; // no elements to scan + } + + int* dev_in = nullptr; // input array + int* dev_out = nullptr; // output array + cudaMalloc(&dev_in, n * sizeof(int)); // allocate device memory + cudaMalloc(&dev_out, n * sizeof(int)); // allocate device memory + checkCUDAError("cudaMalloc failed for scan buffers"); // check for errors + + // Exclusive scan initial state: dev_in[0] = 0, dev_in[1..n-1] = idata[0..n-2] + cudaMemset(dev_in, 0, sizeof(int)); // set first element to 0 + if (n > 1) { + cudaMemcpy(dev_in + 1, idata, (n - 1) * sizeof(int), cudaMemcpyHostToDevice); // copy rest of elements + } + checkCUDAError("cudaMemcpy H2D for input data"); // check for errors + + timer().startGpuTimer(); // TODO + + // Iterative up-sweep (Hillis-Steele): offset = 1, 2, 4, ... < n + const int blockSize = 128; //128 // number of threads per block + + // Ping-pong buffers: in each iteration, read from dev_in, write to dev_out, then swap + for (int offset = 1; offset < n; offset *= 2) { + + // Launch kernel + int fullBlocks = (n + blockSize - 1) / blockSize; + kernScanStep << > > (n, offset, dev_in, dev_out); // launch kernel + checkCUDAError("kernScanStep kernel"); // check for errors + + cudaDeviceSynchronize(); // wait for kernel to finish, for timing purposes only + // Swap the ping-pong buffers for next iteration + int* temp = dev_in; + dev_in = dev_out; + dev_out = temp; + } + timer().endGpuTimer(); + + // At this point, dev_in holds the scanned output (because of final swap) + cudaMemcpy(odata, dev_in, n * sizeof(int), cudaMemcpyDeviceToHost); // copy result back to host + checkCUDAError("cudaMemcpy D2H for scan result"); // check for errors + + cudaFree(dev_in); // free device memory + cudaFree(dev_out); // free device memory } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e7..7ab62036 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,34 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + // TODO use `thrust::exclusive_scan` + if (n <= 0) { + return; + } + + // Allocate device_vectors and copy input data + thrust::device_vector dv_in(n); // device_vector manages memory and automatically cleans up + thrust::device_vector dv_out(n); // device_vector manages memory and automatically cleans up + cudaMemcpy(thrust::raw_pointer_cast(dv_in.data()), idata, + n * sizeof(int), cudaMemcpyHostToDevice); // copy input array to device + checkCUDAError("cudaMemcpy H2D for thrust scan input"); + + timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + // Use Thrust library's exclusive_scan on the device vector + thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + timer().endGpuTimer(); + + // Copy result back to host + cudaMemcpy(odata, thrust::raw_pointer_cast(dv_out.data()), + n * sizeof(int), cudaMemcpyDeviceToHost); // copy output array to host + checkCUDAError("cudaMemcpy D2H for thrust scan output"); } } }