CIS565-Fall-2023 · Alan-Qiao · Sep 16, 2023 · Sep 18, 2023 · Sep 19, 2023 · Sep 20, 2023
diff --git a/README.md b/README.md
@@ -3,12 +3,149 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Alan Qiao
+* Tested on: Windows 11 22H2, Intel Xeon W-2145 @ 3.70GHz 64GB, RTX 3070 8GB (Dorm Workstation)
 
-### (TODO: Your README)
+## Introduction
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project explores the performance improvement of using GPU to parallelize functions that work on large array inputs. Specifically, this project explores two parallelizable reduction functions: prefix-sum (scan) and stream compaction.
 
+### Prefix Sum (Scan)
+
+Given an array of integers $A$ of length $N$, its inclusive prefix-sum array $P_I$ is defined as 
+$$P_I[i] = \sum_{k=0}^{i}{A[k]}, 0\leq i \leq N$$
+This is also known as an inclusive scan.  
+Similarly, we also define the exclusive prefix-sum array $P_E$ as
+$$P_E[i] = \sum_{k=0}^{i-1}{A[k]}, 0\leq i \leq N$$
+This is also known as an exclusive scan.
+
+In this project, four different implementations of exclusive scan are explored for performance comparison.
+1. **CPU**: Sequential CPU scan that adds the $i^{th}$ value to the $(i-1)^{th}$ prefix sum
+2. **GPU Naive**: Naive parallel scan that runs pairwise additions with widening offsets
+3. **GPU Work-Efficient**: Better pairwise scan that reduces unnecessary additions by spliting the process into a parallel reduction phase `kernUpSweep()` and a down-sweep phase `kernDownSweep()`
+4. **GPU Thrust API**: Uses the `exclusive_scan()` function from Thrust library
+
+### Stream Compaction
+Stream compaction is the process of filtering an array down to only the elements that satisfy a certain condition. Conceptually, this is essentially a filter function. In this specific project, the compaction condition is filtering for non-zero values, which has the effect of making the array more compact by removing unnecessary data. This is especially useful in graphics applications for reducing memory transfer by eliminating elements from buffers that are no longer in use.
+
+Four different implementations of stream compaction are explored for performance comparison.
+1. **CPU without scan**: A basic but efficient CPU sequential implementation
+2. **CPU with scan**: CPU implementation that follows the parallelizable procedure used in GPU implementations. (Not efficient for sequential execution.)
+3. **GPU with Work-Efficient Scan**: GPU implementation that creates a boolean array of whether the element satisfies the condition, then exclusively scans the array to get index of kept elements in the output array, and finally scatters the index array to extract the output elements.
+4. **GPU Thrust API**: Uses the `remove_if()` function from Thrust API.
+
+## Block Size
+The impact of block size is usually relatively miniscule for reasonable block sizes, but the reasonable range changes depending on the hardware. The following diagram illustrates the runtime of the different functions for a range of block sizes between 128 - 512 for the RTX 3070 8GB.
+
+![Figure 1](img/blocksize_comp.png)
+*Figure 1: Impact of kernel block size on GPU scan runtime*
+
+The difference is relatively insignificant, but all subsequent tests were conducted using a **block size of 256** as this seemed to offer best overall performance accross the different functions.
+
+## Sample Output
+This is a sample output for the testing program with block size of 256 and input size of $2^{28}$.
+```
+****************
+** SCAN TESTS **
+****************
+    [  15  15  29  14  43   8   9  44   2   6  42   7  14 ...   2   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 497.847ms    (std::chrono Measured)
+    [   0  15  30  59  73 116 124 133 177 179 185 227 234 ... -2015579320 -2015579318 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 485.974ms    (std::chrono Measured)
+    [   0  15  30  59  73 116 124 133 177 179 185 227 234 ... -2015579346 -2015579341 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 177.569ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 177.386ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 58.0995ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 59.1894ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 6.5176ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 7.06138ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   2   1   2   2   3   0   0   3   1   1   1   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 912.455ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   2   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 689.073ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2022.49ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   2   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 79.7784ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 79.9567ms    (CUDA Measured)
+    passed
+==== thrust compact, power-of-two ====
+   elapsed time: 7.84336ms    (CUDA Measured)
+    passed
+==== thrust compact, non-power-of-two ====
+   elapsed time: 7.02477ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+Since many of these GPU kernel functions only work on array sizes that are powers of two, each implementation is tested with two types of inputs. Arrays with sizes that are powers of two, and arrays with non-power-of-two sizes (NPOT).  
+Furthermore, to eliminate the impact of variable and expensive memory allocation and release costs from the analysis. All measurements are taken with all reasonable memory allocations completed before measurement and releases after measurement.
+
+### Scan with Power of 2 Array Sizes
+![](img/scan_performance.png)
+*Figure 2: Performance comparison of scan implementations given power of 2 array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:  
+* CPU implementation scales linearly with array size.
+* At smaller sizes ($N \lessapprox 2^{18}$), the kernel launching cost outweights the benefit of parallelization, resulting in superior CPU runtime.
+* Both Naive and work efficient level-out at large inputs to approximately the same linear scaling as the CPU implementation, but Naive is about $2\times$ faster than CPU, and Work-efficient is about $2\times$ faster than Naive.
+* Thrust implementation suffers a similar kernel start cost, but runtime scales slower than the other implementations, resulting in a significant performance improvement at large inputs.
+
+### Scan with Non Power of 2 (NPOT) Array Sizes
+To simulate this, the array sizes for the different tests are computed as $2^d - 3$, where $d$ is the power of 2 being tested.
+![](img/scan_performance_npot.png)
+*Figure 2: Performance comparison of scan implementations given NPOT array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:  
+* CPU implementation is agnostic to whether input size is a power of 2.
+* Naive and Work-efficient implementations scales similarly compared to power of 2 input sizes, but is overall a little bit slower with the additional overhead of morphing the input to a power of 2.
+* Thrust implementation seems to actually perform slightly better with NPOT input size, in fact a nearly constant runtime is observed up until the inflection point at ($N \approx 2^{18}$)
+
+### Why is Thrust Implementation much faster
+Without inspecting the source code for the Thrust implementation, the cause of the performance difference cannot be accurately deduced, but some insight can be gained from examining the cudaAPI profile.
+
+![](img/thrust_scan.png)
+*Figure 3: NVIDIA SYSTEMS profiling of Thrust::exclusive_scan()*
+
+Perhaps the biggest difference between the implementations is that Thrust likely loads the entire array into shared memory (`DeviceScanInitKernel()`) and makes all calculations (`DeviceScankernel()`) in shared memory. This is suggested by the use of `cudaStreamSynchronize()` to sync threads between iterations. This would be a major improvement as shared memory access is much faster than global memory access that is used exclusively in the Naive and Work-efficient implementation.
+
+### Stream Compaction
+The next figure groups all Stream Compaction tests, both power of 2 input size and NPOT input size into one graph for a most explicit side-by-side comparison. The same block size and range of input sizes are tested.
+
+![](img/compaction_performance.png)
+*Figure 4: Performance comparison of stream compaction implementations given NPOT array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:
+* CPU implementations are again agnostic to whether input size is a power of 2 and generally scales linearly
+* CPU with scan performes $2,3\times$ slower than without scan. This is expected as there are two extra iteration passes compared to the without scan implementation.
+* Kernel invocation costs are significant until $(N \approx 2^{22})$, where the parallelization benefit finally outweights the cost of kernel launching. The larger kernel cost is expected because two additional kernel functions are used compared to exclusive scan.
+* The difference between NPOT and power of 2 input sizes become insignificant relative to the overall function runtime at large enough inputs.
+* The Thrust implementation is much faster than the Work-Efficient implementation, at approximately $8\times$. This makes it faster than CPU implementation by $N \approx 2^{14}$.
diff --git a/img/blocksize_comp.png b/img/blocksize_comp.png
diff --git a/img/compaction_performance.png b/img/compaction_performance.png
diff --git a/img/scan_performance.png b/img/scan_performance.png
diff --git a/img/scan_performance_npot.png b/img/scan_performance_npot.png
diff --git a/img/thrust_compaction.png b/img/thrust_compaction.png
diff --git a/img/thrust_scan.png b/img/thrust_scan.png
diff --git a/src/main.cpp b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -54,11 +54,11 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    //// For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    //onesArray(SIZE, c);
+    //printDesc("1s array for finding bugs");
+    //StreamCompaction::Naive::scan(SIZE, c, a);
+    //printArray(SIZE, c, true); 
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
@@ -147,6 +147,20 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("thrust compact, power-of-two");
+    count = StreamCompaction::Thrust::compact(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(count, c, true);
+    printCmpLenResult(count, expectedCount, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust compact, non-power-of-two");
+    count = StreamCompaction::Thrust::compact(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(count, c, true);
+    printCmpLenResult(count, expectedNPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;

diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
@@ -24,6 +24,11 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            size_t index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +38,13 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            size_t index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }

diff --git a/stream_compaction/common.h b/stream_compaction/common.h
@@ -10,6 +10,10 @@
 #include <chrono>
 #include <stdexcept>
 
+#define BLOCK_SIZE 256
+#define GLOBAL_MEM 1
+#define SHARED_MEM 0
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 

diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +35,14 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int k = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[k++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return k;
         }
 
         /**
@@ -41,10 +51,27 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* nonzero = new int[n];
+            int* outIndex = new int[n];
             timer().startCpuTimer();
             // TODO
+            for (int i = 0; i < n; i++) {
+                nonzero[i] = idata[i] != 0;
+            }
+            outIndex[0] = 0;
+            for (int i = 1; i < n; i++) {
+                outIndex[i] = outIndex[i - 1] + nonzero[i - 1];
+            }
+            int cnt = outIndex[n - 1];
+            for (int i = 0; i < n; i++) {
+                if (nonzero[i]) {
+                    odata[outIndex[i]] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            delete[] nonzero;
+            delete[] outIndex;
+            return cnt;
         }
     }
 }