Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 98 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,104 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Zhiyu Lei
* [LinkedIn](https://www.linkedin.com/in/zhiyu-lei/), [Github](https://github.com/Zhiyu-Lei)
* Tested on: Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (CETS Virtual Lab)

### (TODO: Your README)
### Project Description
* CPU Scan & Stream Compaction & Quick Sort ([stream_compaction/cpu.cu](stream_compaction/cpu.cu))
* Naive GPU Scan Algorithm ([stream_compaction/naive.cu](stream_compaction/naive.cu))
* Work-Efficient GPU Scan & Stream Compaction([stream_compaction/efficient.cu](stream_compaction/efficient.cu))
* Using Thrust's Implementation ([stream_compaction/thrust.cu](stream_compaction/thrust.cu))
* Radix Sort ([stream_compaction/radix_sort.cu](stream_compaction/radix_sort.cu))

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Performance Analysis
#### Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
The following table shows a comparison of run time (in milliseconds) between various block sizes for each of the implementations. The run time is measured by scanning an array of size $2^{20}$. The block size does not affect performance very significantly, but a block size of 128 seems to be optimal.
block size|naive scan|work-efficient scan|thrust scan
:---:|:---:|:---:|:---:
64|1.6761|3.0861|0.1686
128|1.5749|1.9997|0.1480
256|1.8605|2.1077|0.1639
512|1.6586|2.5638|0.1679

#### Compare all of these GPU Scan implementations to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).
![](img/README/time-size.png)
With a smaller array size, CPU scan is faster than GPU scan; but with a larger array size, GPU scan, especially Thrust's implementation, tends to be faster, and work-efficient scan also becomes faster than naive scan. Theoretically, GPU scan algorithms' run time increases logarithmically against the array size, but the plot does not show any sublinear trend.

#### Write a brief explanation of the phenomena you see here.
Since I implemented both naive and work-efficient scan algorithms using global memory, the performance bottlenecks were mainly memory I/O. Accessing to global memory is more costly than accessing to shared memory. As for Thrust's implementation, the Nsight timeline shows the occupancy is full, so it tends to use the computability as much as possible.

#### Test Program Output
Array size is $2^{20}$, and array values are in range $[0,1000)$. Radix sort tests are added.
```
****************
** SCAN TESTS **
****************
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674 0 ]
==== cpu scan, power-of-two ====
elapsed time: 1.7442ms (std::chrono Measured)
[ 0 559 1456 1787 2027 2938 3712 3973 4332 4803 5726 6181 7151 ... 521313475 521314149 ]
==== cpu scan, non-power-of-two ====
elapsed time: 1.7567ms (std::chrono Measured)
[ 0 559 1456 1787 2027 2938 3712 3973 4332 4803 5726 6181 7151 ... 521311914 521312911 ]
passed
==== naive scan, power-of-two ====
elapsed time: 1.56285ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 1.55731ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 1.99274ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 1.99523ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.187808ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.166112ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.3507ms (std::chrono Measured)
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 356 674 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.4523ms (std::chrono Measured)
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 997 208 ]
passed
==== cpu compact with scan ====
elapsed time: 3.6566ms (std::chrono Measured)
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 356 674 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 2.19942ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 4.09008ms (CUDA Measured)
passed

**********************
** RADIX SORT TESTS **
**********************
[ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674 0 ]
==== cpu sort, power-of-two ====
elapsed time: 50.9862ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 999 999 ]
==== radix sort, power-of-two ====
elapsed time: 74.2602ms (CUDA Measured)
passed
==== cpu sort, non-power-of-two ====
elapsed time: 53.0439ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 999 999 ]
==== radix sort, non-power-of-two ====
elapsed time: 71.4663ms (CUDA Measured)
passed
```
Binary file added img/README/time-size.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
44 changes: 41 additions & 3 deletions src/main.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -11,9 +11,10 @@
#include <stream_compaction/naive.h>
#include <stream_compaction/efficient.h>
#include <stream_compaction/thrust.h>
#include <stream_compaction/radix_sort.h>
#include "testing_helpers.hpp"

const int SIZE = 1 << 8; // feel free to change the size of array
const int SIZE = 1 << 20; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
Expand All @@ -27,7 +28,7 @@ int main(int argc, char* argv[]) {
printf("** SCAN TESTS **\n");
printf("****************\n");

genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case
genArray(SIZE - 1, a, 1000); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

Expand Down Expand Up @@ -102,7 +103,7 @@ int main(int argc, char* argv[]) {

// Compaction tests

genArray(SIZE - 1, a, 4); // Leave a 0 at the end to test that edge case
genArray(SIZE - 1, a, 1000); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

Expand Down Expand Up @@ -147,6 +148,43 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);

printf("\n");
printf("**********************\n");
printf("** RADIX SORT TESTS **\n");
printf("**********************\n");

// Radix sort tests

genArray(SIZE - 1, a, 1000); // Leave a 0 at the end to test that edge case
a[SIZE - 1] = 0;
printArray(SIZE, a, true);

zeroArray(SIZE, b);
printDesc("cpu sort, power-of-two");
StreamCompaction::CPU::sort(SIZE, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(SIZE, b, true);

zeroArray(SIZE, c);
printDesc("radix sort, power-of-two");
StreamCompaction::RadixSort::sort(SIZE, c, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);

zeroArray(SIZE, b);
printDesc("cpu sort, non-power-of-two");
StreamCompaction::CPU::sort(NPOT, b, a);
printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
printArray(NPOT, b, true);

zeroArray(SIZE, c);
printDesc("radix sort, non-power-of-two");
StreamCompaction::RadixSort::sort(NPOT, c, a);
printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
//printArray(SIZE, c, true);
printCmpResult(NPOT, b, c);

system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
Expand Down
2 changes: 2 additions & 0 deletions stream_compaction/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
"radix_sort.h"
)

set(sources
Expand All @@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
"radix_sort.cu"
)

list(SORT headers)
Expand Down
10 changes: 8 additions & 2 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,10 @@ namespace StreamCompaction {
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n) {
bools[index] = idata[index] ? 1 : 0;
}
}

/**
Expand All @@ -32,7 +35,10 @@ namespace StreamCompaction {
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int index = (blockIdx.x * blockDim.x) + threadIdx.x;
if (index < n && bools[index]) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
1 change: 1 addition & 0 deletions stream_compaction/common.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@

#define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
#define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
#define blockSize 128

/**
* Check for CUDA errors; print and exit if there was a problem.
Expand Down
45 changes: 40 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +33,14 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int idx = 0;
for (int i = 0; i < n; i++) {
if (idata[i]) {
odata[idx++] = idata[i];
}
}
timer().endCpuTimer();
return -1;
return idx;
}

/**
Expand All @@ -42,9 +50,36 @@ namespace StreamCompaction {
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int* bools = new int[n];
for (int i = 0; i < n; i++) {
bools[i] = idata[i] ? 1 : 0;
}
int* indices = new int[n];
indices[0] = 0;
for (int i = 1; i < n; i++) {
indices[i] = indices[i - 1] + bools[i - 1];
}
for (int i = 0; i < n; i++) {
if (bools[i]) {
odata[indices[i]] = idata[i];
}
}
int count = bools[n - 1] + indices[n - 1];
delete[] bools;
delete[] indices;
timer().endCpuTimer();
return count;
}

int compare(const void *a, const void *b) {
return (*(int*)a - *(int*)b);
}

void sort(int n, int *odata, const int *idata) {
timer().startCpuTimer();
memcpy(odata, idata, n * sizeof(int));
qsort(odata, n, sizeof(int), compare);
timer().endCpuTimer();
return -1;
}
}
}
2 changes: 2 additions & 0 deletions stream_compaction/cpu.h
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata);

int compactWithScan(int n, int *odata, const int *idata);

void sort(int n, int *odata, const int *idata);
}
}
Loading