Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
188 changes: 182 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,188 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Marcus Hedlund
* [LinkedIn](https://www.linkedin.com/in/marcushedlund/)
* Tested on: Windows 11, Intel Core Ultra 9 185H @ 2.5 GHz 16GB, NVIDIA GeForce RTX 4070 Laptop GPU 8GB (Personal Computer)

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
# Overview

In this project I implemented multiple versions of the exclusive scan (prefix sum) algorithm and used them to build stream compaction and radix sort. The implementations range from a baseline CPU version to a work efficient GPU version using shared memory and memory bank conflict avoidance. I also compared my implementations against Thrust's implementation of scan for performance benchmarking.

First, a brief description of each algorithm:
* Scan: Calculate all prefix sums of an array
* Stream Compaction: Given an array, create a new array with all elements that fit a certain criteria
* Radix Sort: Efficiently sort an integer array by processing individual bits

|![scan](img/Scan.png)|![stream compaction](img/StreamCompaction.png)|![radix sort](img/RadixSort.png)|
|:--:|:--:|:--:|

<p align="center"><i>Scan, Stream Compaction, and Radix Sort adapted from CIS 5650 slides</i></p>

# Implementations

### CPU Baselines

The CPU versions of all three algorithms are straightforward:
* Scan: We use a for loop that iterates through the array while maintaining a running sum
* Stream Compaction: We use a two-pointer approach while iterating through the array keeping track of both our current element and where the next desired element should go.
* Sort: We simply use std::sort as a reference to check my GPU implementation of radix sort

### Naive GPU Scan
The naive scan works through the prefix sum problem in a series of kernel launches. In each pass, every thread updates its element by adding the value of an element at a certain offset earlier in the array. By updating the array in place and doubling the offset each pass, by the end of `ilog2(n)` total passes each element will hold the sum of all of the previous values.
|![GPU Naive Scan](img/figure9.png)|
|:--:|
|GPU Naive Scan from CIS 5650 slides|

### Work Efficient GPU Scan
To improve on the naive approach, work efficient scan organizes the array as a balanced binary tree and computes the prefix sums in two phases. In the upsweep phase, threads do a parallel reduction that combines elements to create partial sums going up the tree until the root holds the total sum. The downsweep phase then traverses down the tree, piecing the sums together to compute the correct prefix sum values at each leaf.
![Work Efficient Upsweep](img/figure10.png)|![Work Efficient Downsweep](img/figure11.png)|
|:--:|:--:|
|Work Efficient Upsweep from CIS 5650 slides|Work Efficient Downsweep from CIS 5650 slides|


### Shared Memory GPU Scan
The shared memory GPU scan is based on the work efficient implementation but leverages shared memory over global memory to improve lookup speed. Each thread loads two elements from global memory into shared memory so that each block can process up to 2048 elements effectively. For arrays larger than 2048 elements, we first divide the original array into tiles of length 2048 that are then processed in parallel with the total sums of each tile being stored into a block sums array. This block sums array is then itself scanned using the shared memory implementations and the results are added back to each tile to compute the final prefix sums.

<p align="center">
<img src="img/figure17.png" alt="Shared Memory Scan" width="500"/>
<br>
<i>Shared Memory Scan from GPU Gems 39.2.4</i>
</p>


### Thrust Scan
I also made a simple wrapper around Thrust's scan function to compare my implementations against.

### GPU Stream Compaction
The GPU implementation of stream compaction directly leverages scan. First, the input array is mapped to a boolean array based on which elements should be kept. Next, the boolean array is scanned to produce an array of target indices. Finally, a scatter step writes each desired element into its correct position in the compacted array by using both the boolean and scan outputs.

![GPU Stream Compaction](img/figure12.png)|
|:--:|
|GPU Stream Compaction from CIS 5650 slides|

### GPU Radix Sort
Lastly, I implemented radix sort on the GPU leveraging both scan and scatter. Radix sort works by processing the binary representation of each integer starting from the least significant bit until the most significant. At each iteration, the algorithm partitions the array so that all elements with a zero in the current bit appear before any with a one, while preserving the relative order of elements in each of the partitions.
![CPU Radix Sort](img/figure13.png)|
|:--:|
|CPU Radix Sort from CIS 5650 slides|

On the GPU each iteration is performed by first mapping each element to a boolean array based on the current bit. Then, in several steps including a scan pass we calculate the new positions to scatter each element to. Lastly, we scatter the elements based on the boolean and new position arrays. Repeating this process for each bit gives us the fully sorted array.
|![GPU Radix Sort 1](img/figure16.png)|![GPU Radix Sort 2](img/figure15.png)|
|:--:|:--:|
|GPU Radix Sort scan step from CIS 5650 slides| GPU Radix Sort scatter step from CIS 5650 slides|

# Performance Analysis

### Data Collection

For performance measurement I used `std::chrono` to record CPU timings and CUDA events for GPU timings. All GPU measurements excluded initial and final memory operations such as copying data to and from the host and device for fair comparison. Additionally each data point was collected over ten trials and then averaged, and the complete data can be found in the [data](https://github.com/mhedlund7/Project2-Stream-Compaction/tree/main/data) folder.

### Varying Block Count

![Time vs Block Size](img/figure1.png)

For all the implementations varying the block size did not have a large effect on performance. In general, a block size of 256 threads seemed to work pretty well across all implementations, though the work efficient scan showed slightly better performance at 128 threads and the naive scan was slightly better at 256 threads. Overall though the block size did not seem like a major factor towards increasing the speed of the implementations.

### Varying Array Size

|![Time vs Array Size 1](img/figure2.png)|![Time vs Array Size 2](img/figure3.png)|
|:--:|:--:|
|Various Implementations: Time vs Array Size|Shared Memory vs Global Memory Comparison|

When varying array size though the results were much clearer. The CPU and naive GPU scans performed the worst for large inputs, with the naive GPU scan surprisingly even falling behind the CPU likely because of how much global memory usage it needs each iteration. The work efficient GPU scan greatly improved over both the previous implementations and using shared memory optimized it even further with the largest gains at nearly a ten times speedup compared to the normal work efficient version. Thrust's scan however consistently outperformed all of my results although my shared memory implementation came quite close on smaller array sizes. By profiling with Nsight Compute I was able to look more into these differences:

### Performance Comparisons

|![Naive NSIGHT Compute Output](img/figure5.png)|![Efficient NSIGHT Compute Output](img/figure6.png)|
|:--:|:--:|
|Naive NSIGHT Compute Output|Work Efficient NSIGHT Compute Output|
|![Shared Memory NSIGHT Compute Output](img/figure7.png)|![Thrust NSIGHT Compute Output](img/figure8.png)||
|Shared Memory NSIGHT Compute Output|Thrust NSIGHT Compute Output|

In Nsight Compute each iteration of the naive implementation shows a memory throughput utilization of around 80% while the work efficient version also reaches around 80% early in the upsweep and late in the downsweep, but utilizes less memory when it launches fewer threads higher up in the tree. Even so it is clear that their performances were both primarily memory-bound. By contrast, the shared memory scan reduced memory throughput to around 25% during the tiled scan step, allowing it to have higher computation and greatly improve performance, but it was still restricted at around 84% when adding all of the block sums back to their respective tiles because it isn't leveraging shared memory during that step. Thrust's implementation also had lower memory throughput at around 65% and was able to better balance both its memory access and actual computation.

### Radix Sort Array Size
![Time vs Array Size 3](img/figure4.png)
Lastly, for Radix Sort we can see a clear linear relationship as the size of the array grows, as expected (appears exponential here because of the log-scaled graph). I configured my radix sort to do 16 passes in order to support integers up to 1 million, but this number could be decreased to gain speed, or increased in order to handle larger integers if needed.

### Sample Program Output
I added additional tests for radix sort comparing against the result of std::sort and also tested my shared memory scan implementation. Radix sort can be called through `StreamCompaction::Radix::radix(int n, int *odata, const int *idata)`.
```
****************
** SCAN TESTS **
****************
[ 18 12 23 7 17 14 27 46 43 43 48 42 0 ... 24 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0062ms (std::chrono Measured)
[ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100095 100119 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0042ms (std::chrono Measured)
[ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100026 100053 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0.3072ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0.34816ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.841728ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 1.56262ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 1.9159ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.063488ms (CUDA Measured)
passed
==== shared mem work-efficient scan, power-of-two ====
elapsed time: 0.176128ms (CUDA Measured)
[ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100095 100119 ]
passed
==== shared mem work-efficient scan, non-power-of-two ====
elapsed time: 0.294912ms (CUDA Measured)
[ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100026 100053 ]
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 0 1 3 0 0 1 3 3 3 1 2 2 3 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.014ms (std::chrono Measured)
[ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 1 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0136ms (std::chrono Measured)
[ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 2 1 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0253ms (std::chrono Measured)
[ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 1 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.90112ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.359424ms (CUDA Measured)
passed

*****************************
******** RADIX TESTS ********
*****************************
==== cpu sort, power-of-two ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
==== radix, power-of-two ====
elapsed time: 10.2533ms (CUDA Measured)
passed
==== cpu sort, non-power-of-two ====
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
==== radix, non-power-of-two ====
elapsed time: 9.64915ms (CUDA Measured)
passed

```
46 changes: 46 additions & 0 deletions data/array_size_vs_time_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
Array Size,CPUCompactWithScan,CPUCompactWithoutuScan,CPUScan,EfficientBankOptimizedSharedMemScan,EfficientCompact,EfficientScan,EfficientSharedMemScan,NaiveScan,RadixSort,ThrustScan
256,0.00312,0.00018,0.0001,0.0139264,0.1664,0.330854,0.0183296,0.042496,0.0299008,0.224973
1024,0.0041,0.00078,0.00024,0.0106496,0.179814,0.420966,0.0118784,0.0838656,0.0769024,0.0295936
8192,0.03728,0.00944,0.00291,0.0196608,0.234394,0.443187,0.022528,0.0986112,0.108851,0.0304128
32768,0.13038,0.05933,0.00854,0.0489184,0.262406,0.532586,0.0482336,0.177744,0.213395,0.0769472
131072,0.49113,0.21795,0.03016,0.0634368,0.362787,0.607254,0.0605696,0.155574,0.168806,0.0520704
1048576,4.091855,2.256526667,0.3241733333,0.111488,0.5056456667,1.014022667,0.137469,0.4068615,0.4617236667,0.146208
4194304,16.8136,8.59004,2.07581,0.353306,1.32138,1.75386,0.422819,1.706,2.1252,0.203578
16777216,68.0669,34.5437,8.79466,0.0025472,8.11009,7.4386,0.0025632,17.7327,15.6761,0.677619
67108864,439.386,151.334,33.1646,0.0027392,34.6541,25.291,0.0029728,79.3013,93.8617,2.72899
,,,,,,,,,,
,,,,,,,,,,
Array Size,CPUScan,NaiveScan,EfficientScan,ThrustScan,,,,,,
256,0.0001,0.042496,0.330854,0.224973,,,,,,
1024,0.00024,0.0838656,0.420966,0.0295936,,,,,,
8192,0.00291,0.0986112,0.443187,0.0304128,,,,,,
32768,0.00854,0.177744,0.532586,0.0769472,,,,,,
131072,0.03016,0.155574,0.607254,0.0520704,,,,,,
1048576,0.3241733333,0.4068615,1.014022667,0.146208,,,,,,
4194304,2.07581,1.706,1.75386,0.203578,,,,,,
16777216,8.79466,17.7327,7.4386,0.677619,,,,,,
67108864,33.1646,79.3013,25.291,2.72899,,,,,,
,,,,,,,,,,
,,,,,,,,,,
Array Size,EfficientScan,EfficientBankOptimizedSharedMemScan,,,,,,,,
256,0.330854,0.0139264,,,,,,,,
1024,0.420966,0.0106496,,,,,,,,
8192,0.443187,0.0196608,,,,,,,,
32768,0.532586,0.0489184,,,,,,,,
131072,0.607254,0.0634368,,,,,,,,
1048576,1.014022667,0.111488,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
,,,,,,,,,,
Array Size,RadixSort,,,,,,,,,
256,0.0299008,,,,,,,,,
1024,0.0769024,,,,,,,,,
8192,0.108851,,,,,,,,,
32768,0.213395,,,,,,,,,
131072,0.168806,,,,,,,,,
1048576,0.4617236667,,,,,,,,,
4194304,2.1252,,,,,,,,,
16777216,15.6761,,,,,,,,,
67108864,93.8617,,,,,,,,,
6 changes: 6 additions & 0 deletions data/block_size_vs_time_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
Block Size,EfficientCompact,EfficientScan,NaiveScan,RadixSort
64,0.481744,0.951005,0.547853,0.638733
128,0.490736,0.943344,0.384598,0.373843
256,0.48271,0.978797,0.344522,0.35759
512,0.559203,1.09746,0.340048,0.595139
1024,0.533606,1.0962,0.499978,0.465843
Loading