CIS565-Fall-2023 · toytag · Sep 19, 2023 · Sep 20, 2023 · Sep 20, 2023 · Sep 20, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,118 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Zhenzhong Tang
+  * [LinkedIn](https://www.linkedin.com/in/zhenzhong-anthony-tang-82334a210), [Instagram](https://instagram.com/toytag12), [personal website](https://toytag.net/)
+* Tested on: Windows 11 Pro 22H2, AMD EPYC 7V12 64-Core Processor (4 vCPU cores) @ 2.44GHz 28GiB, Tesla T4 16GiB (Azure)
 
-### (TODO: Your README)
+## Implementations
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+We have five different implementations to compare:
+- CPU Scan: iterates through the array and add the previous element to the current element.
+- GPU Naive Scan: each thread add its element to an element at some offset.
+- GPU Work Efficient Scan from Lecture: up-sweep and down-sweep.
+- GPU Work Efficient Scan from GPU Gems3
+  - Rearranged threads to prevent **warp partitioning**.
+  - Implemented **shared memory** acceleration.
+  - Optimized shared memory access to avoid **bank conflicts**.
+  - Recursive call to handle arbitrary array size without rounding to the next power of 2.
+- Thrust Scan: optimized implementation from NVIDIA.
 
+### Sample Output
+Tested with `int[2^29]` and CUDA Block Size 128.
+
+```
+****************
+** SCAN TESTS **
+****************
+    [  39  29  12  48  27  43  42  11   9   8   5   1   5 ...  11   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 1370.71ms    (std::chrono Measured)
+    [   0  39  68  80 128 155 198 240 251 260 268 273 274 ... 263761477 263761488 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 1369.71ms    (std::chrono Measured)
+    [   0  39  68  80 128 155 198 240 251 260 268 273 274 ... 263761350 263761388 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 508.908ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 506.161ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 48.7929ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 48.7834ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 17.1458ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 16.7649ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   3   0   0   0   1   3   1   3   3   1   3   0   1 ...   0   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 1345.11ms    (std::chrono Measured)
+    [   3   1   3   1   3   3   1   3   1   3   2   3   3 ...   2   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 1300.07ms    (std::chrono Measured)
+    [   3   1   3   1   3   3   1   3   1   3   2   3   3 ...   1   2 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2031.72ms    (std::chrono Measured)
+    [   3   1   3   1   3   3   1   3   1   3   2   3   3 ...   2   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 183.325ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 180.469ms    (CUDA Measured)
+    passed
+```
+
+
+## Performance Analysis
+Tests done with CUDA block size as 128 unless otherwise specified. And we choose to test array size with non-power-of-2 to represent more general use cases.
+
+### Array Size
+
+First, not shown in the graph, with less than or around a few thousands of elements, the CPU sequential scan is actually faster than the GPU implementations. This is because the overhead of copying data to GPU and back is too large compared to the actual computation time. The GPU implementations are all faster than the CPU sequential scan when the array size is large enough.
+
+![](img/Runtime%20Comparison%20for%20CPU%20and%20GPU%20Scan.svg)
+
+Now let's look at the graph. With over 1 million elements, the CPU sequential scan starts to show inefficiency. Naive GPU implementation is actually a bit faster than the work efficient version from lecture. Based on GPU Gems3 Scan, the optimized work efficient scan is much faster than any previous methods. Thrust scan is the fastest and optimized implementation from NVIDIA.
+
+![](img/Runtime%20Comparison%20for%20CPU%20and%20GPU%20Scan%20Bar.svg)
+
+In conclusion, using thrust implementation as a baseline, the work efficient scan from lecture is around 10x to 20x slower. The optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, which is a big leap.
+
+### Block Size
+
+The influence of block size on performance is generally associated with hardware. It usually does not have a significant effect on the outcome.
+
+In our case, especially with work efficient scan based on gems3, the block size has a relatively large impact on performance. The optimal point is around 128, and the runtime is half as block size 16 and two thirds as block size 1024. This is because, the work efficient scan based on gems3 runs a full up-sweep and down-sweep scan in one kernel call. If block size is too small, then the kernel would be called recursively many times to complete the task. If block size is too large, then in the middle of the scan, a number of threads would be idle, waiting for other threads to finish. This is a waste of resources.
+
+![](img/Runtime%20Comparison%20for%20GPU%20Scan%20with%20different%20CUDA%20Block%20Size.svg)
+
+All in All, the three GPU scan implementations follow the same trend, showing that the block size around 128 is the optimal point.
+
+### Why is My GPU Approach So Slow?
+
+This is true for relatively small size array and implementations from lecture slides. Copying to and from GPU, synchronizing threads, and other overheads are too large compared to the actual computation time. With a few thousand elements, the CPU could even store most of them in cache and run the computation even faster.
+
+However, when the array size is significantly bigger, to beyond 1 million, CPU starts to fall short. All GPU implementations are faster than CPU sequential scan. Thrust absolutely dominates the game. Our optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, not bad.
+
+So, fortunately, we did not encounter such issue. :D
+
+### Trace of the Thrust Scan
+
+![](img/trace-thrust.png)
+![](img/trace-gems3.png)
+
+We see the that thrust kernel call is very short. Our kernel is much slower. Thrust used static shared memory and more shared memory in total, and used more registers per thread. All these factors contribute to the performance difference.
diff --git a/img/Runtime Comparison for CPU and GPU Scan Bar.svg b/img/Runtime Comparison for CPU and GPU Scan Bar.svg
diff --git a/img/Runtime Comparison for CPU and GPU Scan.svg b/img/Runtime Comparison for CPU and GPU Scan.svg
diff --git a/img/Runtime Comparison for GPU Scan with different CUDA Block Size.svg b/img/Runtime Comparison for GPU Scan with different CUDA Block Size.svg
diff --git a/img/trace-gems3.png b/img/trace-gems3.png
diff --git a/img/trace-thrust.png b/img/trace-thrust.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -71,7 +71,6 @@ int main(int argc, char* argv[]) {
     printDesc("work-efficient scan, power-of-two");
     StreamCompaction::Efficient::scan(SIZE, c, a);
     printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
     zeroArray(SIZE, c);

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,10 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+
+            bools[idx] = idata[idx] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +37,12 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+            if (idx >= n) return;
+
+            if (bools[idx] == 1) {
+                odata[indices[idx]] = idata[idx];
+            }
         }
 
     }

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -5,6 +5,7 @@
 
 namespace StreamCompaction {
     namespace CPU {
+        bool disableScanTimer = false;
         using StreamCompaction::Common::PerformanceTimer;
         PerformanceTimer& timer()
         {
@@ -18,9 +19,13 @@ namespace StreamCompaction {
          * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
          */
         void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
+            if (!disableScanTimer) timer().startCpuTimer();
             // TODO
-            timer().endCpuTimer();
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
+            if (!disableScanTimer) timer().endCpuTimer();
         }
 
         /**
@@ -29,10 +34,16 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithoutScan(int n, int *odata, const int *idata) {
+            int count = 0;
             timer().startCpuTimer();
             // TODO
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[count++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return count;
         }
 
         /**
@@ -41,10 +52,25 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* tmp = new int[n];
+            disableScanTimer = true;
             timer().startCpuTimer();
             // TODO
+            for (int i = 0; i < n; i++) {
+                tmp[i] = idata[i] == 0 ? 0 : 1;
+            }
+            scan(n, odata, tmp);
+            for (int i = 0; i < n; i++) {
+                if (tmp[i] != 0) {
+                    // odata[i] <= i, so there is no race condition
+                    odata[odata[i]] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            disableScanTimer = false;
+            int count = odata[n - 1] + tmp[n - 1];
+            delete[] tmp;
+            return count;
         }
     }
 }