Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 112 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,118 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zhenzhong Tang
* [LinkedIn](https://www.linkedin.com/in/zhenzhong-anthony-tang-82334a210), [Instagram](https://instagram.com/toytag12), [personal website](https://toytag.net/)
* Tested on: Windows 11 Pro 22H2, AMD EPYC 7V12 64-Core Processor (4 vCPU cores) @ 2.44GHz 28GiB, Tesla T4 16GiB (Azure)

### (TODO: Your README)
## Implementations

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
We have five different implementations to compare:
- CPU Scan: iterates through the array and add the previous element to the current element.
- GPU Naive Scan: each thread add its element to an element at some offset.
- GPU Work Efficient Scan from Lecture: up-sweep and down-sweep.
- GPU Work Efficient Scan from GPU Gems3
- Rearranged threads to prevent **warp partitioning**.
- Implemented **shared memory** acceleration.
- Optimized shared memory access to avoid **bank conflicts**.
- Recursive call to handle arbitrary array size without rounding to the next power of 2.
- Thrust Scan: optimized implementation from NVIDIA.

### Sample Output
Tested with `int[2^29]` and CUDA Block Size 128.

```
****************
** SCAN TESTS **
****************
[ 39 29 12 48 27 43 42 11 9 8 5 1 5 ... 11 0 ]
==== cpu scan, power-of-two ====
elapsed time: 1370.71ms (std::chrono Measured)
[ 0 39 68 80 128 155 198 240 251 260 268 273 274 ... 263761477 263761488 ]
==== cpu scan, non-power-of-two ====
elapsed time: 1369.71ms (std::chrono Measured)
[ 0 39 68 80 128 155 198 240 251 260 268 273 274 ... 263761350 263761388 ]
passed
==== naive scan, power-of-two ====
elapsed time: 508.908ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 506.161ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 48.7929ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 48.7834ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 17.1458ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 16.7649ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 0 0 0 1 3 1 3 3 1 3 0 1 ... 0 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 1345.11ms (std::chrono Measured)
[ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 2 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 1300.07ms (std::chrono Measured)
[ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 1 2 ]
passed
==== cpu compact with scan ====
elapsed time: 2031.72ms (std::chrono Measured)
[ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 2 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 183.325ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 180.469ms (CUDA Measured)
passed
```


## Performance Analysis
Tests done with CUDA block size as 128 unless otherwise specified. And we choose to test array size with non-power-of-2 to represent more general use cases.

### Array Size

First, not shown in the graph, with less than or around a few thousands of elements, the CPU sequential scan is actually faster than the GPU implementations. This is because the overhead of copying data to GPU and back is too large compared to the actual computation time. The GPU implementations are all faster than the CPU sequential scan when the array size is large enough.

![](img/Runtime%20Comparison%20for%20CPU%20and%20GPU%20Scan.svg)

Now let's look at the graph. With over 1 million elements, the CPU sequential scan starts to show inefficiency. Naive GPU implementation is actually a bit faster than the work efficient version from lecture. Based on GPU Gems3 Scan, the optimized work efficient scan is much faster than any previous methods. Thrust scan is the fastest and optimized implementation from NVIDIA.

![](img/Runtime%20Comparison%20for%20CPU%20and%20GPU%20Scan%20Bar.svg)

In conclusion, using thrust implementation as a baseline, the work efficient scan from lecture is around 10x to 20x slower. The optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, which is a big leap.

### Block Size

The influence of block size on performance is generally associated with hardware. It usually does not have a significant effect on the outcome.

In our case, especially with work efficient scan based on gems3, the block size has a relatively large impact on performance. The optimal point is around 128, and the runtime is half as block size 16 and two thirds as block size 1024. This is because, the work efficient scan based on gems3 runs a full up-sweep and down-sweep scan in one kernel call. If block size is too small, then the kernel would be called recursively many times to complete the task. If block size is too large, then in the middle of the scan, a number of threads would be idle, waiting for other threads to finish. This is a waste of resources.

![](img/Runtime%20Comparison%20for%20GPU%20Scan%20with%20different%20CUDA%20Block%20Size.svg)

All in All, the three GPU scan implementations follow the same trend, showing that the block size around 128 is the optimal point.

### Why is My GPU Approach So Slow?

This is true for relatively small size array and implementations from lecture slides. Copying to and from GPU, synchronizing threads, and other overheads are too large compared to the actual computation time. With a few thousand elements, the CPU could even store most of them in cache and run the computation even faster.

However, when the array size is significantly bigger, to beyond 1 million, CPU starts to fall short. All GPU implementations are faster than CPU sequential scan. Thrust absolutely dominates the game. Our optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, not bad.

So, fortunately, we did not encounter such issue. :D

### Trace of the Thrust Scan

![](img/trace-thrust.png)
![](img/trace-gems3.png)

We see the that thrust kernel call is very short. Our kernel is much slower. Thrust used static shared memory and more shared memory in total, and used more registers per thread. All these factors contribute to the performance difference.
1 change: 1 addition & 0 deletions img/Runtime Comparison for CPU and GPU Scan Bar.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1 change: 1 addition & 0 deletions img/Runtime Comparison for CPU and GPU Scan.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/trace-gems3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/trace-thrust.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 1 addition & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -71,7 +71,6 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
Expand Down
10 changes: 10 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,10 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) return;

bools[idx] = idata[idx] == 0 ? 0 : 1;
}

/**
Expand All @@ -33,6 +37,12 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) return;

if (bools[idx] == 1) {
odata[indices[idx]] = idata[idx];
}
}

}
Expand Down
34 changes: 30 additions & 4 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

namespace StreamCompaction {
namespace CPU {
bool disableScanTimer = false;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
Expand All @@ -18,9 +19,13 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
if (!disableScanTimer) timer().startCpuTimer();
// TODO
timer().endCpuTimer();
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];
}
if (!disableScanTimer) timer().endCpuTimer();
}

/**
Expand All @@ -29,10 +34,16 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
int count = 0;
timer().startCpuTimer();
// TODO
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[count++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return count;
}

/**
Expand All @@ -41,10 +52,25 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
int* tmp = new int[n];
disableScanTimer = true;
timer().startCpuTimer();
// TODO
for (int i = 0; i < n; i++) {
tmp[i] = idata[i] == 0 ? 0 : 1;
}
scan(n, odata, tmp);
for (int i = 0; i < n; i++) {
if (tmp[i] != 0) {
// odata[i] <= i, so there is no race condition
odata[odata[i]] = idata[i];
}
}
timer().endCpuTimer();
return -1;
disableScanTimer = false;
int count = odata[n - 1] + tmp[n - 1];
delete[] tmp;
return count;
}
}
}
Loading