Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
102 changes: 96 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,102 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Xiaoxiao Zou
* [LinkedIn](https://www.linkedin.com/in/xiaoxiao-zou-23482a1b9/)
* Tested on: Windows 11, AMD Ryzen 9 7940HS @ 4.00 GHz, RTX 4060 Laptop

### (TODO: Your README)
### Implementations:
I implemented basic CPU scan & compact, GPU naive scan, GPU work efficient scan & compact, Thrust scan. In addition to those, I also implemented GPU work efficient scan Upgrade, GPU work efficient scan Upgrade with Shared Memory.

The four basic implementations just followed the instructions from slide.

<b>GPU Work efficient: </b> Benchmark with all modulo operations and multiply operations converted to bitwise operations (give fair amount of speedup).

<b>GPU Work efficient scan Upgrade: </b> I calculated actual number of blocks will be needed will be needed for each round of up sweep and down sweep in order to reduce number of blocks (total number of threads) need to be launched each time. This gives around up to <b>5x speedup</b>.

<b>GPU Work efficient scan Upgrade with Shared Memory: </b> I used shared memory to do block-wise scan for each block, then, I do scan on the increments. At last, I add increments back to block. Here, I made a design choice for the scan on increments, for this scan, I use GPU Work efficient scan Upgrade method instead of GPU Work efficient scane Upgrade with shared memory. By implementing GPU Work efficient scan Upgrade with shared memory on increments array will result in recursive looping on increments array. (I tried to do it just by appending new increments array to old one). However, I found that actually slow the performance somehow due to the need to addition from new increments array to old arrays. I found just using simple GPU Work efficient scan Upgrade is not that bad. This overall give up to <b>16x speedup</b>.

Blocksize limitation: by doing shared memory, my block size will be limited to block size 64, (starting at 128, I think there is some memory conflict inside each block, which resulting in error). For other methods, blocksize does not influence performance that much starting at blocksize 32. (if block size too small, will slow down performance project 1)

### Performance Analysis
![](img/p1.png)
![](img/p2.png)
![](img/p3.png)

The one thing I noticed first is my CPU is way stronger than I thought. Only when it reachs array=2^24, it starts to show up slowdown on performance. But right after 2^28, my CPU is no longer compatible of doing this arithematics.

For general GPU side performance, it starts to showing slowing down when it reachs 2^20. For thrust, it starts to slow down on 2^28. (I personally think it will 2^28 is the bottleneck, since at 2^29, 50ms implies 20fps and this only counts the calculation for scan not including those memory operations). My Work efficient method is not effiecient at all, however, the upgrade one gives fairly good opitimization compared to naive one. The one with upgrade SM gives fairly good optimization compared to upgrade especially at 2^28.

Some potential opitimization: by observing thrust, I found there is some insufficient threads usage for my SM method. In upgrade method, there is a way to just not lauching the threads in kernel. However, for SM one, although I am only launching blocksize/2 threads for each block, but when they are sweeping, most time there is only part of threads are working in the block. I dont know is there any more wise way to use those threads (probably just do mutiple additions at once, like two or three layers all together when downsweep). Another opitimization I would think of, swapping is not essentially needed if there is a wise way to just caculated the index to do the computation.

#### Output for arraysize=2^26
```

****************
** SCAN TESTS **
****************
[ 2 10 43 45 10 38 5 10 13 25 24 17 9 ... 33 0 ]
==== cpu scan, power-of-two ====
elapsed time: 30.0245ms (std::chrono Measured)
[ 0 2 12 55 100 110 148 153 163 176 201 225 242 ... 1643506275 1643506308 ]
==== cpu scan, non-power-of-two ====
elapsed time: 41.5261ms (std::chrono Measured)
[ 0 2 12 55 100 110 148 153 163 176 201 225 242 ... 1643506220 1643506227 ]
passed
==== naive scan, power-of-two ====
elapsed time: 85.8092ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 82.2282ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 135.767ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 130.076ms (CUDA Measured)
passed
==== work-efficient scan upgrade, power-of-two ====
elapsed time: 31.5261ms (CUDA Measured)
passed
==== work-efficient scan upgrade, non-power-of-two ====
elapsed time: 31.4493ms (CUDA Measured)
passed
==== work-efficient scan upgrade with SM, power-of-two ====
elapsed time: 11.6919ms (CUDA Measured)
passed
==== work-efficient scan upgrade with SM, non-power-of-two ====
elapsed time: 12.0757ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 5.33914ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 5.62893ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 3 0 3 2 2 3 3 3 0 3 1 1 0 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 139.038ms (std::chrono Measured)
[ 3 3 2 2 3 3 3 3 1 1 2 1 2 ... 1 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 129.736ms (std::chrono Measured)
[ 3 3 2 2 3 3 3 3 1 1 2 1 2 ... 3 1 ]
passed
==== cpu compact with scan ====
elapsed time: 318.162ms (std::chrono Measured)
[ 3 3 2 2 3 3 3 3 1 1 2 1 2 ... 1 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 42.9237ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 43.177ms (CUDA Measured)
passed
Press any key to continue . . .
```

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)

Binary file added img/p1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/p2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/p3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
32 changes: 30 additions & 2 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
#include <stream_compaction/thrust.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 29; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand Down Expand Up @@ -69,13 +69,41 @@ int main(int argc, char* argv[]) {

zeroArray(SIZE, c);
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
StreamCompaction::Efficient::oldscan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan, non-power-of-two");
StreamCompaction::Efficient::oldscan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan upgrade, power-of-two");
StreamCompaction::Efficient::scanupgrade(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan upgrade, non-power-of-two");
StreamCompaction::Efficient::scanupgrade(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan upgrade with SM, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, c);
printDesc("work-efficient scan upgrade with SM, non-power-of-two");
StreamCompaction::Efficient::scan(NPOT, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(NPOT, c, true);
Expand Down
17 changes: 17 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,16 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (idata[index] == 0) {
bools[index] = 0;
}
else {
bools[index] = 1;
}
}

/**
Expand All @@ -33,6 +43,13 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}
if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
1 change: 1 addition & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)

#define blockSize 64
/**
* Check for CUDA errors; print and exit if there was a problem.
*/
Expand Down
41 changes: 39 additions & 2 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 0; i < n-1; i++) {
odata[i+1] = idata[i]+odata[i];
}
timer().endCpuTimer();
}

Expand All @@ -31,8 +35,14 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int j = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[j++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return j;
}

/**
Expand All @@ -43,8 +53,35 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO

int* temp=(int*)malloc(n * sizeof(int));
int* temp2 = (int*)malloc(n * sizeof(int));

//mapping
for (int i = 0; i < n; i++) {
if (idata[i] == 0) {
temp[i] = 0;
}
else {
temp[i] = 1;
}
}
//scan
temp2[0] = 0;
for (int i = 0; i < n - 1; i++) {
temp2[i + 1] = temp[i] + temp2[i];
}
//scatter
for (int i = 0; i < n; i++) {
if (temp[i]==1) {
odata[temp2[i]] = idata[i];
}
}
int cnt = temp2[n - 1];
free(temp);
free(temp2);
timer().endCpuTimer();
return -1;
return cnt;
}
}
}
Loading