CIS565-Fall-2023 · Zhiyu-Lei · Sep 17, 2023 · Sep 18, 2023 · Sep 19, 2023 · Sep 19, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,104 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Zhiyu Lei
+  * [LinkedIn](https://www.linkedin.com/in/zhiyu-lei/), [Github](https://github.com/Zhiyu-Lei)
+* Tested on: Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (CETS Virtual Lab)
 
-### (TODO: Your README)
+### Project Description
+* CPU Scan & Stream Compaction & Quick Sort ([stream_compaction/cpu.cu](stream_compaction/cpu.cu))
+* Naive GPU Scan Algorithm ([stream_compaction/naive.cu](stream_compaction/naive.cu))
+* Work-Efficient GPU Scan & Stream Compaction([stream_compaction/efficient.cu](stream_compaction/efficient.cu))
+* Using Thrust's Implementation ([stream_compaction/thrust.cu](stream_compaction/thrust.cu))
+* Radix Sort ([stream_compaction/radix_sort.cu](stream_compaction/radix_sort.cu))
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+### Performance Analysis
+#### Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.
+The following table shows a comparison of run time (in milliseconds) between various block sizes for each of the implementations. The run time is measured by scanning an array of size $2^{20}$. The block size does not affect performance very significantly, but a block size of 128 seems to be optimal.
+block size|naive scan|work-efficient scan|thrust scan
+:---:|:---:|:---:|:---:
+64|1.6761|3.0861|0.1686
+128|1.5749|1.9997|0.1480
+256|1.8605|2.1077|0.1639
+512|1.6586|2.5638|0.1679
 
+#### Compare all of these GPU Scan implementations to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).
+![](img/README/time-size.png)
+With a smaller array size, CPU scan is faster than GPU scan; but with a larger array size, GPU scan, especially Thrust's implementation, tends to be faster, and work-efficient scan also becomes faster than naive scan. Theoretically, GPU scan algorithms' run time increases logarithmically against the array size, but the plot does not show any sublinear trend.
+
+#### Write a brief explanation of the phenomena you see here.
+Since I implemented both naive and work-efficient scan algorithms using global memory, the performance bottlenecks were mainly memory I/O. Accessing to global memory is more costly than accessing to shared memory. As for Thrust's implementation, the Nsight timeline shows the occupancy is full, so it tends to use the computability as much as possible.
+
+#### Test Program Output
+Array size is $2^{20}$, and array values are in range $[0,1000)$. Radix sort tests are added.
+```
+****************
+** SCAN TESTS **
+****************
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1.7442ms    (std::chrono Measured)
+    [   0 559 1456 1787 2027 2938 3712 3973 4332 4803 5726 6181 7151 ... 521313475 521314149 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1.7567ms    (std::chrono Measured)
+    [   0 559 1456 1787 2027 2938 3712 3973 4332 4803 5726 6181 7151 ... 521311914 521312911 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 1.56285ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 1.55731ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 1.99274ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 1.99523ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 0.187808ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 0.166112ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 0.3507ms    (std::chrono Measured)
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 356 674 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 0.4523ms    (std::chrono Measured)
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 997 208 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 3.6566ms    (std::chrono Measured)
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 356 674 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 2.19942ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 4.09008ms    (CUDA Measured)
+    passed
+
+**********************
+** RADIX SORT TESTS **
+**********************
+    [ 559 897 331 240 911 774 261 359 471 923 455 970 436 ... 674   0 ]
+==== cpu sort, power-of-two ====
+   elapsed time: 50.9862ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 999 999 ]
+==== radix sort, power-of-two ====
+   elapsed time: 74.2602ms    (CUDA Measured)
+    passed
+==== cpu sort, non-power-of-two ====
+   elapsed time: 53.0439ms    (std::chrono Measured)
+    [   0   0   0   0   0   0   0   0   0   0   0   0   0 ... 999 999 ]
+==== radix sort, non-power-of-two ====
+   elapsed time: 71.4663ms    (CUDA Measured)
+    passed
+```
diff --git a/img/README/time-size.png b/img/README/time-size.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -11,9 +11,10 @@
 #include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
 #include <stream_compaction/thrust.h>
+#include <stream_compaction/radix_sort.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 20; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -27,7 +28,7 @@ int main(int argc, char* argv[]) {
     printf("** SCAN TESTS **\n");
     printf("****************\n");
 
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 1000);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -102,7 +103,7 @@ int main(int argc, char* argv[]) {
 
     // Compaction tests
 
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+    genArray(SIZE - 1, a, 1000);  // Leave a 0 at the end to test that edge case
     a[SIZE - 1] = 0;
     printArray(SIZE, a, true);
 
@@ -147,6 +148,43 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    printf("\n");
+    printf("**********************\n");
+    printf("** RADIX SORT TESTS **\n");
+    printf("**********************\n");
+
+    // Radix sort tests
+
+    genArray(SIZE - 1, a, 1000);  // Leave a 0 at the end to test that edge case
+    a[SIZE - 1] = 0;
+    printArray(SIZE, a, true);
+
+    zeroArray(SIZE, b);
+    printDesc("cpu sort, power-of-two");
+    StreamCompaction::CPU::sort(SIZE, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(SIZE, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, power-of-two");
+    StreamCompaction::RadixSort::sort(SIZE, c, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(SIZE, b, c);
+
+    zeroArray(SIZE, b);
+    printDesc("cpu sort, non-power-of-two");
+    StreamCompaction::CPU::sort(NPOT, b, a);
+    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
+    printArray(NPOT, b, true);
+
+    zeroArray(SIZE, c);
+    printDesc("radix sort, non-power-of-two");
+    StreamCompaction::RadixSort::sort(NPOT, c, a);
+    printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(SIZE, c, true);
+    printCmpResult(NPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
     "naive.h"
     "efficient.h"
     "thrust.h"
+    "radix_sort.h"
     )
 
 set(sources
@@ -12,6 +13,7 @@ set(sources
     "naive.cu"
     "efficient.cu"
     "thrust.cu"
+    "radix_sort.cu"
     )
 
 list(SORT headers)

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -23,7 +23,10 @@ namespace StreamCompaction {
          * which map to 0 will be removed, and elements which map to 1 will be kept.
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index < n) {
+                bools[index] = idata[index] ? 1 : 0;
+            }
         }
 
         /**
@@ -32,7 +35,10 @@ namespace StreamCompaction {
          */
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
-            // TODO
+            int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (index < n && bools[index]) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -12,6 +12,7 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+#define blockSize 128
 
 /**
  * Check for CUDA errors; print and exit if there was a problem.

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -19,7 +19,10 @@ namespace StreamCompaction {
          */
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -30,9 +33,14 @@ namespace StreamCompaction {
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int idx = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i]) {
+                    odata[idx++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return idx;
         }
 
         /**
@@ -42,9 +50,36 @@ namespace StreamCompaction {
          */
         int compactWithScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
-            // TODO
+            int* bools = new int[n];
+            for (int i = 0; i < n; i++) {
+                bools[i] = idata[i] ? 1 : 0;
+            }
+            int* indices = new int[n];
+            indices[0] = 0;
+            for (int i = 1; i < n; i++) {
+                indices[i] = indices[i - 1] + bools[i - 1];
+            }
+            for (int i = 0; i < n; i++) {
+                if (bools[i]) {
+                    odata[indices[i]] = idata[i];
+                }
+            }
+            int count = bools[n - 1] + indices[n - 1];
+            delete[] bools;
+            delete[] indices;
+            timer().endCpuTimer();
+            return count;
+        }
+
+        int compare(const void *a, const void *b) {
+            return (*(int*)a - *(int*)b);
+        }
+
+        void sort(int n, int *odata, const int *idata) {
+            timer().startCpuTimer();
+            memcpy(odata, idata, n * sizeof(int));
+            qsort(odata, n, sizeof(int), compare);
             timer().endCpuTimer();
-            return -1;
         }
     }
 }
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
@@ -11,5 +11,7 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata);
 
         int compactWithScan(int n, int *odata, const int *idata);
+
+        void sort(int n, int *odata, const int *idata);
     }
 }