diff --git a/README.md b/README.md index 0e38ddb1..ec61675b 100644 --- a/README.md +++ b/README.md @@ -1,14 +1,274 @@ -CUDA Stream Compaction -====================== +# **Project 2 - CUDA Stream Compaction** +## **University of Pennsylvania, CIS 5650: GPU Programming and Architecture** -**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** +* Jefferson Koumba Moussadji Lu + * [LinkedIn](https://www.linkedin.com/in/-jeff-koumba-0b356721b/) +* Tested on: Personal Laptop, Windows 11 Home, Intel(R) Core(TM) i9-14900HX @ 2.22GHz @ 24 Cores @ 32GB RAM, Nvidia GeForce RTX 4090 @ 16 GB @ SM 8.9 -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +
+
+
Stream Compaction
+ -### (TODO: Your README) +## Overview -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +Stream compaction is a parallel algorithm for filtering out elements from an array that do not meet a given condition, while preserving the order of the remaining elements. In other words, given an input array and a predicate (e.g. "is non-zero"), stream compaction produces a new output array containing only the elements that satisfy the predicate, in their original order. This operation is important in GPU programs to reduce unnecessary processing and memory transfer of unwanted data. It is widely used in applications like particle filtering, path tracing (to compact out terminated rays), collision detection, and sparse matrix operations. +By compacting data on the GPU, we minimize bandwidth usage when copying results back to the CPU and avoid serial processing of irrelevant elements. +A prefix-sum scan is the backbone of efficient stream compaction. A scan computes an array of prefix sums from an input array. For example, an inclusive prefix sum with addition on input ```[1 ,3, 5, 9]``` would produce ```[1, 4, 9, 18]``` (each output is the sum of all inputs up to that index). + +In this project, we focus on the exclusive scan, where the output is shifted right by one, and the first element is the identity (0 for sum). The scan is fundamental because it can quickly produce indices for writing compacted output in parallel. + +### Summary of features: + +We implemented multiple versions of scan and stream compaction on both CPU and GPU and compared their performance. The features include: + +- CPU Scan and CPU Stream Compaction (both a direct without-scan method and a with-scan method using prefix sum). + +- Naive Parallel Scan (GPU) - a work-inefficient but simple CUDA implementation. + +- Work-Efficient Parallel Scan (GPU) - an optimized scan using the Blelloch algorithm (up-sweep/down-sweep), and GPU stream compaction based on this scan. + +- Thrust Library Scan (GPU) - using CUDA Thrust’s exclusive scan for comparison. + +- Shared Memory & Radix sort: An optimized scan with shared memory for better performance, and a Parallel Radix Sort (stable, scan-based) that supports sorting signed 32-bit integers using our scan primitives. + +## Features Implemented + +### CPU Scan & Stream Compaction (CPU) + +We implemented a straightforward CPU exclusive scan and two versions of CPU stream compaction to verify correctness and provide baseline performance. The CPU scan simply iterates through the array accumulating a running sum, which runs in O(N) time. +For compaction, we developed: +- ```compactWithoutScan``` (CPU): a simple loop that checks each element and writes it to the output array if it meets the condition (non-zero, in our case). This approach filters values in one pass. + +- ```compactWithScan``` (CPU): a two-pass method that first builds a boolean mask array (1 for elements to keep, 0 for discard), then performs a prefix-sum scan on this mask to compute output indices, and finally scatters the input values to their computed positions. This method uses the scan operation to achieve compaction. + +Both CPU compaction methods produce identical results, but the direct ```compactWithoutScan``` is faster on the CPU since it does minimal work. Using a full scan on the CPU adds overhead that isn’t worthwhile in serial contexts. In our tests, the simple loop compaction outpaced the scan-based approach on CPU (as expected), reflecting that the scan method’s benefit lies in parallelization rather than efficiency on a single core. These CPU implementations served as a correctness reference for the GPU versions. + +### Naive Parallel Scan (GPU) + +
+
+
This image is an implementation of an inclusive scan. To convert this to an exclusive scan, we can simply shift the inclusive scan results one position to the right
+ +The naive GPU scan is an exclusive prefix sum implemented using a parallel version of the classic Hillis & Steele algorithm (also known as the Kogge-Stone scan). This approach is called “naive” because it is not work-efficient: it performs O(N log N) operations for an array of length N. + +The algorithm runs in $\lceil \log_2 N \rceil$ iterations; in the d-th iteration, each thread reads two values separated by $2^{d-1}$ and adds them, writing the partial sum. By iteration’s end, the array is transformed into an inclusive scan. We convert it to an exclusive scan by shifting the results right and inserting the identity (0) at the start. + +Our implementation launches enough threads to cover the array and uses GPU global memory for all reads/writes at each step. This simple approach keeps all threads busy each iteration (no branching based on index) and is relatively easy to implement. However, it does significantly more work than necessary since each element’s value is involved in many repeated additions across iterations. The naive scan also issues many global memory transactions due to repeatedly reading and writing the array on each iteration. We took care to handle non-power-of-two (NPOT) array sizes by checking bounds before accessing out-of-range indices on each iteration. The result is a correct exclusive scan that’s easy to verify but not optimal in terms of GPU work efficiency. + +### Work-Efficient Scan & Compaction (GPU) + +For a faster parallel scan, we implemented the work-efficient Blelloch scan algorithm. This approach performs only O(N) operations by using a balanced binary tree method in two phases: an up-sweep (reduce) and a down-sweep. In the up-sweep, the kernel recursively reduces the array: pairs of elements are summed to form partial sums at increasing stride distances (1, 2, 4, ...), until the last element contains the total sum. In the down-sweep phase, we first set the last element to 0 (this ensures an exclusive scan result), then we propagate the partial sums downward: each step, each thread uses the value it computed in up-sweep to set a prefix sum for two children elements. After downsweep completes, the array has been transformed into an exclusive prefix sum of the input values. + +
+
+
Up Sweep phase
+ +
+
+
Down Sweep Phase
+ +Our Blelloch scan is implemented with multiple kernel launches: each sweep level (for both up and down phases) is a CUDA kernel invocation that performs the operation at a given stride. This was simpler to implement and debug. We again handle NPOT sizes by ignoring threads/work on indices beyond the array length. The result is an accurate exclusive scan that does minimal total work compared to the naive version. However, we observed that work-efficient does not always mean faster in practice. The overhead of launching multiple kernels and the fact that many threads do nothing in certain steps (causing warp divergence) can hurt performance. For example, at each up-sweep level $d$, half of the threads become idle (those at indices not divisible by $2^d$), leading to many inactive threads that still occupy GPU resources. Despite these practical considerations, the work-efficient scan should scale better for very large N, since it has asymptotically less work than the naive scan. + +Using our GPU scan, we implemented GPU stream compaction in a work-efficient manner. The compaction consists of three steps on the device: + +1) Map: Generate a boolean mask array where each entry is 1 if the corresponding input element should be kept (e.g. is non-zero), or 0 if it should be removed. + +2) Scan: Run an exclusive scan on this mask array to produce an array of indices (or offsets). Each index tells us the target position of the corresponding element in the compacted output. The last value of the scan (plus the last mask value) also gives the total count of surviving elements. + +3) Scatter: Launch a kernel that, for each index i where the mask is 1, reads the input value and writes it to the output at index ```maskScan[i]```. Threads for which the mask is 0 do no write. This effectively packs all desired elements to the front of the output array without gaps. + +All three steps are O(N) on the GPU and involve mostly memory-bound operations (aside from the additions in the scan). Our implementation preserves the order of input elements. We verified that the GPU compaction output matches exactly with the CPU output and that the count of surviving elements is correct for both power-of-two and non-power-of-two array sizes. The GPU compaction uses the parallel scan to efficiently compute scatter locations, which is much faster than any serial method once N is large. + +### Thrust Scan (GPU) + +To compare our custom implementations against an industry-grade optimized solution, we also incorporated the Thrust library’s scan. Thrust is NVIDIA’s parallel algorithms library, and we used ```thrust::exclusive_scan``` on device arrays for benchmarking. Thrust’s implementation is highly optimized. It efficiently handles arbitrary sizes and is tuned for the GPU architecture. For example, Thrust’s scan uses shared memory and other optimizations internally. We used the Thrust scan results to validate our own scans and to serve as a performance baseline (“best case” scenario) for prefix sum on our hardware. + +### Optimized Shared-Memory Scan (GPU) + +We further optimized the work-efficient scan by using CUDA shared memory to reduce global memory traffic. In the original Blelloch scan, each partial sum operation reads and writes to the global array. Our optimized version instead loads data into each thread block’s shared memory and performs the up-sweep and down-sweep steps locally in fast on-chip memory. Each block computes a scan for its portion of the array. We then handle the inter-block sums by performing a scan of each block’s total (this uses a separate small array of block sums). Finally, we distribute the offsets to each block and do a second pass to add the block offset to each element within the block. This multi-kernel approach ensures most of the heavy-lifting is done with low-latency memory. The result is a significant speedup: by reducing expensive global memory accesses and using memory coalescing, the shared-memory scan approached the performance of Thrust’s highly optimized scan in our tests. The functionality and results of this optimized scan remained the same as our original work-efficient scan, but it highlights how careful memory layout and access patterns can drastically improve performance. + +### Parallel Radix Sort (GPU) + +For an additional challenge, we implemented a parallel radix sort on the GPU using our scan as a subroutine. The radix sort is stable and supports sorting of signed 32-bit integers (both positive and negative). Our approach is a least-significant-bit (LSB) first radix sort (also known as an LSD radix sort) that processes one bit at a time using a scatter operation based on prefix sums. For each bit position from 0 to 31, we: + +1) Compute bit mask: Generate an array of 0/1 flags indicating whether each number’s current bit is 0 or 1. For sorting in ascending order, we treat 0-bit as “keep” (these will go to the front) and 1-bit as “remove” in a sense. Alternatively, one can partition by counting 0s vs 1s. + +2) Scan the mask: We perform an exclusive prefix sum on the inverted mask (e.g. treat 1s as 0s and 0s as 1s) to get the destination indices for elements with the current bit = 0. This gives us the positions where the 0-bit elements should go in the output for this pass. + +3) Scatter pass: We scatter elements to a temporary output array: elements with 0 in the current bit are placed at the beginning (in the order determined by the scanned indices), and elements with 1 in that bit are placed after all the 0s. Because we use the scan results, this scatter is stable. It preserves the relative order of elements with the same bit value across passes, which is crucial for correctness in radix sort. + +4) Swap arrays: The output of this pass becomes the input for the next bit pass (we alternate buffers). + +We repeat the above for all 32 bits of the integer. After the final bit, the array is sorted in ascending order. To handle negative numbers correctly, we adopted a common trick: interpret the 32-bit integers in two’s complement form such that all negative values appear “smaller” than positives. In practice, this can be achieved by offsetting values (e.g., adding $2^{31}$ to each number to make them all non-negative, sorting, then subtracting it back) or by modifying the final pass to ensure the sign bit is handled appropriately. Our implementation ensures that all negative numbers end up at the front of the sorted array (still in ascending order among themselves), followed by all positive numbers. The sorting is stable by construction, which means if two elements have equal value, their original order is preserved (important if this sort were to be extended for more complex keys). The complexity is O(32 * N) = O(N), since we perform a fixed 32 passes of scan/scatter regardless of input size. Each pass is parallel and largely memory-bandwidth bound. While 32 passes is a lot of work, this method can outperform a comparison-based sort for large N because it has better parallel scalability. We compared our GPU radix sort against ```std::sort``` on the CPU to evaluate performance. + +To call Radix Sort: +- Look into the file efficient.cu +``` +// Example call (device pointers): +// n = number of elements; d_in, d_out are int* on device memory +StreamCompaction::Efficient::Radix::sort(n, d_out, d_in); +// After return, d_out holds the stable, ascending result.s +``` + +## Performance Analysis + +
+
+
Execution time (log-scale) vs. input size for various scan (prefix-sum) implementations and baselines, using a fixed block size of 128 threads
+ +| Array Size | CPU Scan | Naive Scan | WE Scan | Thrust Scan | CPU Compact (w/o scan) | CPU Compact (with scan) | WE Compact | Shared Scan | Radix Sort | +| ---------: | --------: | ---------: | -------: | ----------: | ---------------------: | ----------------------: | ---------: | ----------: | ---------: | +| 2^8 | 0.000450 | 0.135120 | 0.159744 | 0.126464 | 0.000500 | 0.004400 | 0.508832 | 0.104448 | 9.245700 | +| 2^12 | 0.005700 | 0.160800 | 0.194560 | 0.044192 | 0.006100 | 0.024900 | 0.387104 | 0.144384 | 5.800960 | +| 2^16 | 0.096250 | 0.194736 | 0.257072 | 0.042288 | 0.104350 | 0.285600 | 0.502672 | 0.194624 | 6.959100 | +| 2^20 | 1.492150 | 0.549088 | 0.354976 | 0.300816 | 1.614800 | 4.622800 | 5.415200 | 0.241312 | 11.718700 | +| 2^24 | 23.081900 | 8.591905 | 2.714130 | 1.075105 | 30.581250 | 85.811400 | 17.569300 | 0.828192 | 107.652000 | + +We can notice that GPU-based scans (Naive, Work-Efficient, Shared-memory, and Thrust) scale linearly and run much faster than the CPU scan, especially as the array size grows. + +
+
+
Execution time of stream compaction algorithms vs. CUDA block size (threads per block) for a fixed 1,048,576-element array
+ +| Block Size | CPU Scan | Naive Scan | WE Scan | Thrust Scan | CPU Compact (w/o scan) | CPU Compact (with scan) | WE Compact | Shared Scan | Radix Sort | +| ---------: | -------: | ---------: | -------: | ----------: | ---------------------: | ----------------------: | ---------: | ----------: | ---------: | +| 64 | 1.687300 | 0.485184 | 0.427136 | 0.332160 | 1.858200 | 4.837900 | 1.614050 | 0.494176 | 10.397700 | +| 128 | 1.492150 | 0.549088 | 0.354976 | 0.300816 | 1.614800 | 4.622800 | 5.415200 | 0.241312 | 11.718700 | +| 256 | 1.600250 | 0.380368 | 0.333600 | 0.337344 | 1.717300 | 4.884100 | 1.571085 | 0.217696 | 10.311700 | +| 512 | 1.462850 | 0.406608 | 0.375488 | 0.311328 | 1.638950 | 4.688700 | 1.589280 | 0.463008 | 19.420200 | +| 1024 | 1.481200 | 0.412144 | 0.675872 | 0.484688 | 1.739750 | 4.773300 | 1.562370 | 0.264000 | 10.693600 | + +We notice the GPU Work-Efficient compaction is shown in green, and its performance improves dramatically with block size up to ~256 threads. CPU baselines (horizontal lines) are much slower (higher runtime) and unaffected by block size. Lower runtime is better (log-log scale). + +### Scan Performance + + +#### GPU vs CPU Speedup +All GPU scan implementations significantly outperform the CPU sequential scan for large inputs. By the largest tested size (~16 million elements), the fastest GPU scan finishes in only a few milliseconds, whereas the CPU takes on the order of 10^1-10^2 milliseconds. This is an over 10× speedup (one to two orders of magnitude). In fact, even at more moderate sizes (e.g. around 1 million elements), the GPU already provides a ~10× overall sorting/scanning speedup over the CPU. The crossover point at which the GPU pulls ahead is on the order of only tens of thousands of elements. Beyond that, GPU parallelism dominates. + +#### Naive vs. Work-Efficient Algorithms +The Naive parallel scan algorithm performs asymptotically more work (O(n log n) operations) than the Work-Efficient scan (O(n) operations). This extra factor of log n has a noticeable impact on performance at large n: in our data, the work-efficient scan begins to outperform the naive version markedly as n grows. By the 16M input case, the work-efficient GPU scan runs roughly 2-3× faster than the naive scan (which does redundant work) because it avoids the substantial overhead of those extra log n steps. In short, minimizing algorithmic work pays off, as predicted by complexity analysis. + +#### Impact of Shared Memory & Thrust + +Using shared memory for intra-block scanning boosts performance over a naive global-memory scan. Our custom Shared Memory scan shows better throughput than the purely naive approach, thanks to fewer global memory accesses. The fully optimized Work-Efficient scan (based on the Blelloch upsweep/downsweep) is even faster, nearly matching the performance of NVIDIA’s highly-optimized ```thrust::inclusive_scan```. In fact, the Thrust library scan and our work-efficient implementation achieve similar timings at large sizes indicating that our implementation is close to state-of-the-art efficiency. + +By leveraging parallelism, the GPU scan implementations run orders of magnitude faster than a single-threaded CPU scan for substantial array sizes. The work-efficient algorithm, which achieves O(n) work complexity, overtakes the work-inefficient naive scan for large n, illustrating the importance of algorithmic efficiency on GPUs. Optimizations like shared memory usage further improve performance. In practice, all GPU variants handle millions of elements with ease (on the order of milliseconds), whereas the CPU time grows into tens or hundreds of milliseconds. The results demonstrate that GPUs are extremely well-suited to prefix-sum computations, achieving high throughput as problem size increases. + +#### Effect of CUDA Block Size + +We found that scan performance is relatively insensitive to block size in a broad range, with a slight sweet spot around medium block sizes (256-512 threads). Very small blocks (e.g. 128 threads) underutilize the GPU, leading to a bit more overhead per element, while extremely large blocks (1024 threads) can reduce occupancy or increase serialization, also slightly hurting throughput. For example, in tests at 1 million elements, using 256 threads/block yielded near-best performance, whereas 128 threads/block was a bit slower (~10-20% slower for some scan kernels) and 1024 threads showed a minor slowdown as well. Overall, as long as the block size is reasonably large, the scan kernels maintain high throughput; the performance stays close to linear scaling with input size (see figure above) regardless of block configuration. + +### Compaction Performance + +#### GPU vs CPU Compaction + +The work-efficient GPU compaction algorithm vastly outperforms the CPU-based approaches. For a ~1 million element array, the GPU compaction completes in only around 0.3-0.5 ms (at optimal block size), whereas the CPU implementations take on the order of 2-5 ms for the same task, roughly an order of magnitude slower. At the maximum tested size (~16 million elements, not shown in the block-size figure but in the size-scaling results), this gap widens further: the GPU finishes in only a few milliseconds, while the CPU requires tens of milliseconds (20× or more slower for 16M). The parallel speedup here is similar to what we observed for scanning, since GPU compaction essentially leverages a parallel scan internally to mark positions. + +#### Single-pass vs Two-pass CPU + +Between the two CPU implementations, compaction without scan (single-pass filtering) was slightly faster than compaction with scan. This makes sense since the two-pass method performs an extra full scan plus an additional memory pass to scatter elements, which in a single-threaded context just adds overhead without any parallel gain. Our results confirmed only a small difference (both CPU methods are slow), but the version avoiding the extra scan has an edge, especially as n grows. In other words, doing everything in one pass (branching on each element and writing out survivors) is more cache-friendly and incurs fewer total operations on the CPU than doing a separate prefix sum then copy. + +#### GPU Work-Efficient Compact + +Our GPU compaction uses an efficient parallel scan to identify non-zero (or valid) elements and then scatters them in parallel. Despite doing multiple steps (scan + scatter), it still runs orders of magnitude faster than the CPU because each step is massively parallel. The work-efficient GPU approach has linear work complexity overall, similar to the GPU scan. As a result, its runtime scales linearly with n and remains only slightly above the pure scan time. For instance, at 1M elements the work-efficient compaction only takes marginally longer than a scan (owing to the additional write step). This shows that the overhead of compaction beyond scanning is minimal on GPU, whereas on CPU that overhead cannot be hidden. + +#### Block Size Tuning + +The performance of the GPU compaction kernel does depend on block size more noticeably than the scans. As shown in the figure, at a very small block size (128 threads) the work-efficient compaction was about 10× slower for 1M elements compared to using 256 threads. With only 128-thread blocks, the kernel likely launched many more blocks and incurred extra coordination overhead (and possibly did not fully occupy the GPU’s SMs). Increasing to 256 threads/block improved utilization dramatically, yielding the best performance in our tests. Beyond 256, the performance remained high: 512 and 1024 threads per block gave similar results, with maybe a slight dip at 512 in our data (possibly due to some memory or occupancy trade-off). In summary, ≥256 threads per block is optimal for the compaction kernel on our GPU. In contrast, the CPU times (horizontal lines in the figure) are flat since they do not involve GPU block sizes at all, underlining that their performance is consistently poor relative to the GPU, regardless of configuration. + +The work-efficient GPU stream compaction exhibits excellent performance, completing in time comparable to a single scan and massively outperforming the CPU versions. For large arrays, the GPU achieves an order-of-magnitude (or more) speedup over the CPU for compaction. The single-pass CPU method is marginally faster than the two-pass (with scan) method, but both are far slower than the GPU. We also see that proper choice of block size is important for GPU compaction. Too small blocks can hurt performance, but with a reasonable block size (256+ threads), the GPU algorithm runs at full efficiency. These results highlight that parallel prefix-sum-based compaction on GPU not only is correct for arbitrary input sizes, but also delivers a huge performance advantage, leveraging throughput that the CPU cannot match on this problem size. + +### Radix Sort vs ```std::sort``` + +#### Crossover and Scaling + +We implemented an integer Radix Sort on the GPU (stable, least-significant-digit approach) and compared it to the C++ ```std::sort``` on the CPU. As expected, there is a crossover point beyond which the GPU outperforms the CPU. For small arrays (e.g. a few thousand elements), the CPU’s ```std::sort``` can be as fast or faster, because the GPU incurs overhead (kernel launch and, if included, data transfer) that dominates at tiny problem sizes. However, once the input size grows into the tens of thousands, the GPU gains the advantage. In our tests, around ~10k–100k elements the GPU sort begins to pull ahead. By 1 million elements, the GPU’s Radix Sort was roughly an order of magnitude faster than single-threaded ```std::sort```. This matches general observations in literature: for example, using Thrust on a GPU yields about a 10× speedup at 1M elements, and the gap tends to increase with larger n. + +#### Throughput at Large n + +The benefit of GPU sorting becomes even more pronounced at very large input sizes. The Radix Sort algorithm has linear time complexity O(n * k) (where k is a small factor related to number of digit passes), and it maps extremely well to parallel hardware. In contrast, ```std::sort``` (introspective comparison sort) has O(n log n) complexity and runs on a single thread by default. As n grows, the cost per element for ```std::sort``` increases (log n factor), whereas for Radix Sort the cost per element can actually decrease slightly due to caching efficiencies in linear passes. Consequently, the speedup of GPU Radix over CPU sort grows with problem size. NVIDIA reports that using Thrust (which employs a GPU Radix Sort under the hood) can deliver anywhere from 5× up to 100× faster sorting performance than a typical STL sort on CPU, depending on the input size and data type. Indeed, for extremely large arrays, the GPU can be two orders of magnitude faster: one user reported sorting 2^27 (~134 million) 32-bit values on a GPU ~100× faster than a 4.5 GHz CPU doing ```std::sort```. Our own tests didn’t reach that size, but at the upper end we did test (several million elements), the GPU Radix Sort consistently maintained a wide lead over ```std::sort``` (likely tens of times faster at 16M elements, extrapolating our results). + +#### Parallelism vs Optimization + +It’s worth noting that ```std::sort``` is highly optimized for general-purpose CPU usage, but in our comparison it was executed single-threaded (the default). Multi-threaded sorting on a CPU (e.g. using OpenMP or TBB parallel sort) would improve CPU performance and raise the crossover point, but leveraging more CPU cores only narrows the gap, rather than eliminating it. The GPU is specifically strong in sorting huge volumes of data in parallel due to its thousands of cores and high memory bandwidth. In other words, sorting is a data-parallel task where the GPU’s architecture shines. Our Radix Sort takes advantage of this by distributing work across many threads and using efficient memory access patterns (e.g. scatter/gather with coalescing). Meanwhile, the CPU’s comparison sort must perform more work (log n comparisons per element on average) and is limited by sequential memory access and branch mispredictions. The outcome is clear in our measurements: for large arrays, the GPU Radix Sort runs in only a few milliseconds, whereas ```std::sort``` takes tens or hundreds of milliseconds. + +GPU-based Radix Sort outperforms the C++ ```std::sort``` by a large margin for non-trivial input sizes. While ```std::sort``` is very fast for small datasets and benefits from years of optimization, it cannot keep up with the sheer parallel throughput of the GPU on large data sets. Once past a certain input size (on the order of 10^4-10^5 elements), the GPU gains a performance lead, and by millions of elements the difference is dramatic (an order of magnitude or more in favor of the GPU). These findings reinforce the advantage of linear-time, parallel sorting methods on modern GPUs: they can handle massive datasets significantly faster than the default CPU sorting routine, illustrating the power of GPU acceleration for high-volume data processing. The Radix Sort’s performance scaling is such that the larger the input, the more it outpaces the CPU. A trend that is consistent with theoretical expectations and other reports in the field. The result is a compelling win for GPU computing in the context of large-scale sorting. + +## Sample Output from Final Tests + +This is the final output running the test cases with: +- Array Power-Of-Two Size = $ 2^{24} $ +- Array Non-Power-Of-Two Size = $ 2^{24} - 3 $ +- Block Size = 128 + + +``` +**************** +** SCAN TESTS ** +**************** + [ 24 31 8 5 35 2 12 35 24 2 26 42 21 ... 12 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 24.4372ms (std::chrono Measured) + [ 0 24 55 63 68 103 105 117 152 176 178 204 246 ... 411033596 411033608 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 24.253ms (std::chrono Measured) + [ 0 24 55 63 68 103 105 117 152 176 178 204 246 ... 411033547 411033572 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 8.45494ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 8.3873ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 3.18029ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 2.86925ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.13389ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.760352ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 3 0 1 3 2 0 1 2 0 0 0 1 ... 2 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 27.7898ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 3 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 27.2077ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 113.672ms (std::chrono Measured) + [ 2 3 1 3 2 1 2 1 2 3 1 1 1 ... 3 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 15.8828ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 15.6678ms (CUDA Measured) + passed + +***************************** +** EXTRA CREDIT TESTS ** +***************************** + [ 24 -19 -42 -45 -15 -48 -38 35 24 2 26 42 -29 ... 12 36 ] +==== radix sort ==== + elapsed time: 92.1836ms (CUDA Measured) + passed + [ 27 30 4 0 49 28 24 8 36 4 28 36 9 ... 39 14 ] + elapsed time: 42.7475ms (std::chrono Measured) +==== shared scan ==== + elapsed time: 0.852576ms (CUDA Measured) + passed +``` diff --git a/img/Stream_compaction.png b/img/Stream_compaction.png new file mode 100644 index 00000000..3ed5cca8 Binary files /dev/null and b/img/Stream_compaction.png differ diff --git a/img/UpSweep.jpg b/img/UpSweep.jpg new file mode 100644 index 00000000..22c4da0f Binary files /dev/null and b/img/UpSweep.jpg differ diff --git a/img/impact_by_array_size.png b/img/impact_by_array_size.png new file mode 100644 index 00000000..afd2dd84 Binary files /dev/null and b/img/impact_by_array_size.png differ diff --git a/img/impact_by_block_size.png b/img/impact_by_block_size.png new file mode 100644 index 00000000..208d63ae Binary files /dev/null and b/img/impact_by_block_size.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..790b16c7 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,9 @@ #include