diff --git a/README.md b/README.md
index 0e38ddb..fb7f33c 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,149 @@ CUDA Stream Compaction
 
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Alan Qiao
+* Tested on: Windows 11 22H2, Intel Xeon W-2145 @ 3.70GHz 64GB, RTX 3070 8GB (Dorm Workstation)
 
-### (TODO: Your README)
+## Introduction
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project explores the performance improvement of using GPU to parallelize functions that work on large array inputs. Specifically, this project explores two parallelizable reduction functions: prefix-sum (scan) and stream compaction.
 
+### Prefix Sum (Scan)
+
+Given an array of integers $A$ of length $N$, its inclusive prefix-sum array $P_I$ is defined as 
+$$P_I[i] = \sum_{k=0}^{i}{A[k]}, 0\leq i \leq N$$
+This is also known as an inclusive scan.  
+Similarly, we also define the exclusive prefix-sum array $P_E$ as
+$$P_E[i] = \sum_{k=0}^{i-1}{A[k]}, 0\leq i \leq N$$
+This is also known as an exclusive scan.
+
+In this project, four different implementations of exclusive scan are explored for performance comparison.
+1. **CPU**: Sequential CPU scan that adds the $i^{th}$ value to the $(i-1)^{th}$ prefix sum
+2. **GPU Naive**: Naive parallel scan that runs pairwise additions with widening offsets
+3. **GPU Work-Efficient**: Better pairwise scan that reduces unnecessary additions by spliting the process into a parallel reduction phase `kernUpSweep()` and a down-sweep phase `kernDownSweep()`
+4. **GPU Thrust API**: Uses the `exclusive_scan()` function from Thrust library
+
+### Stream Compaction
+Stream compaction is the process of filtering an array down to only the elements that satisfy a certain condition. Conceptually, this is essentially a filter function. In this specific project, the compaction condition is filtering for non-zero values, which has the effect of making the array more compact by removing unnecessary data. This is especially useful in graphics applications for reducing memory transfer by eliminating elements from buffers that are no longer in use.
+
+Four different implementations of stream compaction are explored for performance comparison.
+1. **CPU without scan**: A basic but efficient CPU sequential implementation
+2. **CPU with scan**: CPU implementation that follows the parallelizable procedure used in GPU implementations. (Not efficient for sequential execution.)
+3. **GPU with Work-Efficient Scan**: GPU implementation that creates a boolean array of whether the element satisfies the condition, then exclusively scans the array to get index of kept elements in the output array, and finally scatters the index array to extract the output elements.
+4. **GPU Thrust API**: Uses the `remove_if()` function from Thrust API.
+
+## Block Size
+The impact of block size is usually relatively miniscule for reasonable block sizes, but the reasonable range changes depending on the hardware. The following diagram illustrates the runtime of the different functions for a range of block sizes between 128 - 512 for the RTX 3070 8GB.
+
+![Figure 1](img/blocksize_comp.png)
+*Figure 1: Impact of kernel block size on GPU scan runtime*
+
+The difference is relatively insignificant, but all subsequent tests were conducted using a **block size of 256** as this seemed to offer best overall performance accross the different functions.
+
+## Sample Output
+This is a sample output for the testing program with block size of 256 and input size of $2^{28}$.
+```
+****************
+** SCAN TESTS **
+****************
+    [  15  15  29  14  43   8   9  44   2   6  42   7  14 ...   2   0 ]
+==== cpu scan, power-of-two ====
+   elapsed time: 497.847ms    (std::chrono Measured)
+    [   0  15  30  59  73 116 124 133 177 179 185 227 234 ... -2015579320 -2015579318 ]
+==== cpu scan, non-power-of-two ====
+   elapsed time: 485.974ms    (std::chrono Measured)
+    [   0  15  30  59  73 116 124 133 177 179 185 227 234 ... -2015579346 -2015579341 ]
+    passed
+==== naive scan, power-of-two ====
+   elapsed time: 177.569ms    (CUDA Measured)
+    passed
+==== naive scan, non-power-of-two ====
+   elapsed time: 177.386ms    (CUDA Measured)
+    passed
+==== work-efficient scan, power-of-two ====
+   elapsed time: 58.0995ms    (CUDA Measured)
+    passed
+==== work-efficient scan, non-power-of-two ====
+   elapsed time: 59.1894ms    (CUDA Measured)
+    passed
+==== thrust scan, power-of-two ====
+   elapsed time: 6.5176ms    (CUDA Measured)
+    passed
+==== thrust scan, non-power-of-two ====
+   elapsed time: 7.06138ms    (CUDA Measured)
+    passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+    [   2   2   1   2   2   3   0   0   3   1   1   1   3 ...   2   0 ]
+==== cpu compact without scan, power-of-two ====
+   elapsed time: 912.455ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   2   2 ]
+    passed
+==== cpu compact without scan, non-power-of-two ====
+   elapsed time: 689.073ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   3   3 ]
+    passed
+==== cpu compact with scan ====
+   elapsed time: 2022.49ms    (std::chrono Measured)
+    [   2   2   1   2   2   3   3   1   1   1   3   1   1 ...   2   2 ]
+    passed
+==== work-efficient compact, power-of-two ====
+   elapsed time: 79.7784ms    (CUDA Measured)
+    passed
+==== work-efficient compact, non-power-of-two ====
+   elapsed time: 79.9567ms    (CUDA Measured)
+    passed
+==== thrust compact, power-of-two ====
+   elapsed time: 7.84336ms    (CUDA Measured)
+    passed
+==== thrust compact, non-power-of-two ====
+   elapsed time: 7.02477ms    (CUDA Measured)
+    passed
+```
+
+## Performance Analysis
+Since many of these GPU kernel functions only work on array sizes that are powers of two, each implementation is tested with two types of inputs. Arrays with sizes that are powers of two, and arrays with non-power-of-two sizes (NPOT).  
+Furthermore, to eliminate the impact of variable and expensive memory allocation and release costs from the analysis. All measurements are taken with all reasonable memory allocations completed before measurement and releases after measurement.
+
+### Scan with Power of 2 Array Sizes
+![](img/scan_performance.png)
+*Figure 2: Performance comparison of scan implementations given power of 2 array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:  
+* CPU implementation scales linearly with array size.
+* At smaller sizes ($N \lessapprox 2^{18}$), the kernel launching cost outweights the benefit of parallelization, resulting in superior CPU runtime.
+* Both Naive and work efficient level-out at large inputs to approximately the same linear scaling as the CPU implementation, but Naive is about $2\times$ faster than CPU, and Work-efficient is about $2\times$ faster than Naive.
+* Thrust implementation suffers a similar kernel start cost, but runtime scales slower than the other implementations, resulting in a significant performance improvement at large inputs.
+
+### Scan with Non Power of 2 (NPOT) Array Sizes
+To simulate this, the array sizes for the different tests are computed as $2^d - 3$, where $d$ is the power of 2 being tested.
+![](img/scan_performance_npot.png)
+*Figure 2: Performance comparison of scan implementations given NPOT array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:  
+* CPU implementation is agnostic to whether input size is a power of 2.
+* Naive and Work-efficient implementations scales similarly compared to power of 2 input sizes, but is overall a little bit slower with the additional overhead of morphing the input to a power of 2.
+* Thrust implementation seems to actually perform slightly better with NPOT input size, in fact a nearly constant runtime is observed up until the inflection point at ($N \approx 2^{18}$)
+
+### Why is Thrust Implementation much faster
+Without inspecting the source code for the Thrust implementation, the cause of the performance difference cannot be accurately deduced, but some insight can be gained from examining the cudaAPI profile.
+
+![](img/thrust_scan.png)
+*Figure 3: NVIDIA SYSTEMS profiling of Thrust::exclusive_scan()*
+
+Perhaps the biggest difference between the implementations is that Thrust likely loads the entire array into shared memory (`DeviceScanInitKernel()`) and makes all calculations (`DeviceScankernel()`) in shared memory. This is suggested by the use of `cudaStreamSynchronize()` to sync threads between iterations. This would be a major improvement as shared memory access is much faster than global memory access that is used exclusively in the Naive and Work-efficient implementation.
+
+### Stream Compaction
+The next figure groups all Stream Compaction tests, both power of 2 input size and NPOT input size into one graph for a most explicit side-by-side comparison. The same block size and range of input sizes are tested.
+
+![](img/compaction_performance.png)
+*Figure 4: Performance comparison of stream compaction implementations given NPOT array sizes. Both axis are displayed on $\log_2$ scale for visibility.*
+
+Some key observations:
+* CPU implementations are again agnostic to whether input size is a power of 2 and generally scales linearly
+* CPU with scan performes $2,3\times$ slower than without scan. This is expected as there are two extra iteration passes compared to the without scan implementation.
+* Kernel invocation costs are significant until $(N \approx 2^{22})$, where the parallelization benefit finally outweights the cost of kernel launching. The larger kernel cost is expected because two additional kernel functions are used compared to exclusive scan.
+* The difference between NPOT and power of 2 input sizes become insignificant relative to the overall function runtime at large enough inputs.
+* The Thrust implementation is much faster than the Work-Efficient implementation, at approximately $8\times$. This makes it faster than CPU implementation by $N \approx 2^{14}$.
diff --git a/img/blocksize_comp.png b/img/blocksize_comp.png
new file mode 100644
index 0000000..8d69d83
Binary files /dev/null and b/img/blocksize_comp.png differ
diff --git a/img/compaction_performance.png b/img/compaction_performance.png
new file mode 100644
index 0000000..e86d5b9
Binary files /dev/null and b/img/compaction_performance.png differ
diff --git a/img/scan_performance.png b/img/scan_performance.png
new file mode 100644
index 0000000..5fd703c
Binary files /dev/null and b/img/scan_performance.png differ
diff --git a/img/scan_performance_npot.png b/img/scan_performance_npot.png
new file mode 100644
index 0000000..645e099
Binary files /dev/null and b/img/scan_performance_npot.png differ
diff --git a/img/thrust_compaction.png b/img/thrust_compaction.png
new file mode 100644
index 0000000..56ead48
Binary files /dev/null and b/img/thrust_compaction.png differ
diff --git a/img/thrust_scan.png b/img/thrust_scan.png
new file mode 100644
index 0000000..436c47f
Binary files /dev/null and b/img/thrust_scan.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..081f079 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,7 +13,7 @@
 #include <stream_compaction/thrust.h>
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
 const int NPOT = SIZE - 3; // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
@@ -54,11 +54,11 @@ int main(int argc, char* argv[]) {
     //printArray(SIZE, c, true);
     printCmpResult(SIZE, b, c);
 
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
+    //// For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
+    //onesArray(SIZE, c);
+    //printDesc("1s array for finding bugs");
+    //StreamCompaction::Naive::scan(SIZE, c, a);
+    //printArray(SIZE, c, true); 
 
     zeroArray(SIZE, c);
     printDesc("naive scan, non-power-of-two");
@@ -147,6 +147,20 @@ int main(int argc, char* argv[]) {
     //printArray(count, c, true);
     printCmpLenResult(count, expectedNPOT, b, c);
 
+    zeroArray(SIZE, c);
+    printDesc("thrust compact, power-of-two");
+    count = StreamCompaction::Thrust::compact(SIZE, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(count, c, true);
+    printCmpLenResult(count, expectedCount, b, c);
+
+    zeroArray(SIZE, c);
+    printDesc("thrust compact, non-power-of-two");
+    count = StreamCompaction::Thrust::compact(NPOT, c, a);
+    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+    //printArray(count, c, true);
+    printCmpLenResult(count, expectedNPOT, b, c);
+
     system("pause"); // stop Win32 console from closing on exit
     delete[] a;
     delete[] b;
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..9a481fd 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,11 @@ namespace StreamCompaction {
          */
         __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
             // TODO
+            size_t index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            bools[index] = idata[index] == 0 ? 0 : 1;
         }
 
         /**
@@ -33,6 +38,13 @@ namespace StreamCompaction {
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices) {
             // TODO
+            size_t index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= n) {
+                return;
+            }
+            if (bools[index] == 1) {
+                odata[indices[index]] = idata[index];
+            }
         }
 
     }
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed..ee86ae3 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -10,6 +10,10 @@
 #include <chrono>
 #include <stdexcept>
 
+#define BLOCK_SIZE 256
+#define GLOBAL_MEM 1
+#define SHARED_MEM 0
+
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
 
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..e4ecdb3 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -20,6 +20,10 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            odata[0] = 0;
+            for (int i = 1; i < n; i++) {
+                odata[i] = odata[i - 1] + idata[i - 1];
+            }
             timer().endCpuTimer();
         }
 
@@ -31,8 +35,14 @@ namespace StreamCompaction {
         int compactWithoutScan(int n, int *odata, const int *idata) {
             timer().startCpuTimer();
             // TODO
+            int k = 0;
+            for (int i = 0; i < n; i++) {
+                if (idata[i] != 0) {
+                    odata[k++] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            return k;
         }
 
         /**
@@ -41,10 +51,27 @@ namespace StreamCompaction {
          * @returns the number of elements remaining after compaction.
          */
         int compactWithScan(int n, int *odata, const int *idata) {
+            int* nonzero = new int[n];
+            int* outIndex = new int[n];
             timer().startCpuTimer();
             // TODO
+            for (int i = 0; i < n; i++) {
+                nonzero[i] = idata[i] != 0;
+            }
+            outIndex[0] = 0;
+            for (int i = 1; i < n; i++) {
+                outIndex[i] = outIndex[i - 1] + nonzero[i - 1];
+            }
+            int cnt = outIndex[n - 1];
+            for (int i = 0; i < n; i++) {
+                if (nonzero[i]) {
+                    odata[outIndex[i]] = idata[i];
+                }
+            }
             timer().endCpuTimer();
-            return -1;
+            delete[] nonzero;
+            delete[] outIndex;
+            return cnt;
         }
     }
 }
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..f367a27 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,6 +3,10 @@
 #include "common.h"
 #include "efficient.h"
 
+#define NUM_BANKS 32 
+#define LOG_NUM_BANKS 5 
+#define CONFLICT_FREE_OFFSET(n) ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS)) 
+
 namespace StreamCompaction {
     namespace Efficient {
         using StreamCompaction::Common::PerformanceTimer;
@@ -12,13 +16,54 @@ namespace StreamCompaction {
             return timer;
         }
 
+        __global__ void kernUpSweep(int N, int offset, int* data) {
+            size_t k = (threadIdx.x + (blockIdx.x * blockDim.x)) * 2 * offset;
+            if (k >= N) {
+                return;
+            }
+            data[k + 2 * offset - 1] += data[k + offset - 1];
+        }
+
+        __global__ void kernDownSweep(int N, int offset, int* data) {
+            size_t k = (threadIdx.x + (blockIdx.x * blockDim.x)) * 2 * offset;
+            if (k >= N) {
+                return;
+            }
+            size_t left = k + offset - 1;
+            size_t right = k + 2 * offset - 1;
+            int temp = data[left];
+            data[left] = data[right];
+            data[right] += temp;
+        }
+
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            int N = 1 << ilog2ceil(n);
+            int* dev_data;
+            cudaMalloc((void**)&dev_data, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_data failed");
+            cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAErrorFn("copy idata to dev_data failed");
+            dim3 B((N + BLOCK_SIZE - 1) / BLOCK_SIZE);
             timer().startGpuTimer();
             // TODO
+            for (int i = 1; i < N; i <<= 1) {
+                kernUpSweep << <B, BLOCK_SIZE >> > (N, i, dev_data);
+                B.x = std::max(B.x >> 1, 1U);
+            }
+            cudaMemset(dev_data + N - 1, 0, sizeof(int));
+            checkCUDAErrorFn("set dev_data root to 0 failed");
+            for (int i = N >> 1; i > 0; i >>= 1) {
+                B.x = (N / (2 * i) + BLOCK_SIZE - 1) / BLOCK_SIZE;
+                kernDownSweep << <B, BLOCK_SIZE >> > (N, i, dev_data);
+            }
             timer().endGpuTimer();
+            cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAErrorFn("copy dev_data to odata failed");
+            cudaFree(dev_data);
+            checkCUDAErrorFn("cudaFree dev_data failed");
         }
 
         /**
@@ -31,10 +76,60 @@ namespace StreamCompaction {
          * @returns      The number of elements remaining after compaction.
          */
         int compact(int n, int *odata, const int *idata) {
+            int N = 1 << ilog2ceil(n);
+            int* dev_idata;
+            int* dev_bools;
+            int* dev_index;
+            int* dev_odata;
+            int compressedLength;
+
+            cudaMalloc((void**)&dev_idata, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_idata failed");
+            cudaMemset(dev_idata, 0, N * sizeof(int));
+            checkCUDAErrorFn("zeroing dev_idata failed");
+            cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAErrorFn("copy idata to dev_idata failed");
+            cudaMalloc((void**)&dev_bools, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_bools failed");
+            cudaMalloc((void**)&dev_index, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_index failed");
+
+            cudaMalloc((void**)&dev_odata, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_odata failed");
+            
+            dim3 fullBlocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE);
+            dim3 B((N + BLOCK_SIZE - 1) / BLOCK_SIZE);
             timer().startGpuTimer();
             // TODO
+            StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, BLOCK_SIZE>>>(N, dev_bools, dev_idata);
+            cudaMemcpy(dev_index, dev_bools, N * sizeof(int), cudaMemcpyDeviceToDevice);
+            for (int i = 1; i < N; i <<= 1) {
+                kernUpSweep << <B, BLOCK_SIZE >> > (N, i, dev_index);
+                B.x = std::max(B.x >> 1, 1U);
+            }
+            cudaMemset(dev_index + N - 1, 0, sizeof(int));
+            checkCUDAErrorFn("set dev_bools root to 0 failed");
+            for (int i = N >> 1; i > 0; i >>= 1) {
+                B.x = (N / i + BLOCK_SIZE - 1) / BLOCK_SIZE;
+                kernDownSweep << <B, BLOCK_SIZE >> > (N, i, dev_index);
+            }
+            cudaMemcpy(&compressedLength, dev_index + N - 1, sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAErrorFn("copy dev_index[N-1] to compressedLength failed");
+            StreamCompaction::Common::kernScatter << <fullBlocksPerGrid, BLOCK_SIZE >> > (n, dev_odata, dev_idata, dev_bools, dev_index);
             timer().endGpuTimer();
-            return -1;
+            cudaMemcpy(odata, dev_odata, compressedLength * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAErrorFn("copy dev_odata to odata failed");
+
+            cudaFree(dev_idata);
+            checkCUDAErrorFn("cudaFree dev_idata failed");
+            cudaFree(dev_odata);
+            checkCUDAErrorFn("cudaFree dev_odata failed");
+            cudaFree(dev_bools);
+            checkCUDAErrorFn("cudaFree dev_bools failed");
+            cudaFree(dev_index);
+            checkCUDAErrorFn("cudaFree dev_index failed");
+
+            return compressedLength;
         }
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..5adfeba 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -3,6 +3,8 @@
 #include "common.h"
 #include "naive.h"
 
+// 48kb shmem per threadblock. What should block size be?
+
 namespace StreamCompaction {
     namespace Naive {
         using StreamCompaction::Common::PerformanceTimer;
@@ -11,15 +13,91 @@ namespace StreamCompaction {
             static PerformanceTimer timer;
             return timer;
         }
+
+#if GLOBAL_MEM
+        int* dev_buf1;
+        int* dev_buf2;
+
         // TODO: __global__
+        __global__ void kernArrayShiftRight(int N, int* odata, const int* idata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= N) {
+                return;
+            }
+            odata[index] = (index == 0) ? 0 : idata[index - 1];
+        }
+
+        __global__ void kernArrayAddWithOffset(int N, int offset, int* odata, const int* idata) {
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);
+            if (index >= N) {
+                return;
+            }
+            odata[index] = idata[index];
+            if (index >= offset) {
+                odata[index] += idata[index - offset];
+            }
+        }
+
+        /**
+         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+         */
+        void scan(int n, int* odata, const int* idata) {
+            int N = 1 << ilog2ceil(n);
+            cudaMalloc((void**)&dev_buf1, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_buf1 failed");
+            cudaMalloc((void**)&dev_buf2, N * sizeof(int));
+            checkCUDAErrorFn("malloc dev_buf2 failed");
+            cudaMemcpy(dev_buf1, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+            checkCUDAErrorFn("copy idata to dev_buf1 failed");
+
+            dim3 fullBlocksPerGrid((N + BLOCK_SIZE - 1) / BLOCK_SIZE);
+
+            timer().startGpuTimer();
+            // TODO
+            for (int d = 1; d < N; d <<= 1) {
+                kernArrayAddWithOffset << <fullBlocksPerGrid, BLOCK_SIZE >> > (N, d, dev_buf2, dev_buf1);
+                std::swap(dev_buf1, dev_buf2);
+            }
+            kernArrayShiftRight<<<fullBlocksPerGrid, BLOCK_SIZE>>>(N, dev_buf2, dev_buf1);
+            timer().endGpuTimer();
+
+            cudaMemcpy(odata, dev_buf2, n * sizeof(int), cudaMemcpyDeviceToHost);
+            checkCUDAErrorFn("copy dev_buf2 to odata failed");
+            cudaFree(dev_buf1);
+            checkCUDAErrorFn("free dev_buf1 failed");
+            cudaFree(dev_buf2);
+            checkCUDAErrorFn("free dev_buf2 failed");
+        }
+#endif
+#if SHARED_MEM
+        __global__ void scan(float* g_odata, float* g_idata, int n) {
+            extern __shared__ float temp[]; // allocated on invocation    
+            int index = threadIdx.x + (blockIdx.x * blockDim.x);   
+            int pout = 0, pin = 1;   // Load input into shared memory.    
+                                     // This is exclusive scan, so shift right by one    
+                                     // and set first element to 0   
+            temp[pout*n + index] = (index > 0) ? g_idata[index-1] : 0;   
+            __syncthreads();   
+            for (int offset = 1; offset < n; offset <<= 1)   {     
+                pout = 1 - pout; // swap double buffer indices     
+                pin = 1 - pout;     
+                if (index >= offset)       
+                    temp[pout*n+index] += temp[pin*n+index - offset];     
+                else       
+                    temp[pout*n+index] = temp[pin*n+index];
+                __syncthreads();   
+            }   
+            g_odata[index] = temp[pout*n+index]; // write output 
+        } 
 
         /**
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
-        void scan(int n, int *odata, const int *idata) {
+        void scan(int n, int* odata, const int* idata) {
             timer().startGpuTimer();
             // TODO
             timer().endGpuTimer();
         }
+#endif
     }
 }
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..e3b1c84 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,6 +3,7 @@
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/scan.h>
+#include <thrust/remove.h>
 #include "common.h"
 #include "thrust.h"
 
@@ -18,11 +19,34 @@ namespace StreamCompaction {
          * Performs prefix-sum (aka scan) on idata, storing the result into odata.
          */
         void scan(int n, int *odata, const int *idata) {
+            thrust::device_vector<int> dev_in(idata, idata + n);
+            thrust::device_vector<int> dev_out(n);
             timer().startGpuTimer();
             // TODO use `thrust::exclusive_scan`
             // example: for device_vectors dv_in and dv_out:
             // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+            thrust::exclusive_scan(dev_in.begin(), dev_in.end(), dev_out.begin());
             timer().endGpuTimer();
+            thrust::copy_n(dev_out.begin(), n, odata);
+        }
+
+        struct isZero
+        {
+            __host__ __device__
+                bool operator()(const int x)
+            {
+                return (x == 0);
+            }
+        };
+
+        int compact(int n, int* odata, const int* idata) {
+            thrust::device_vector<int> dev_in(idata, idata + n);
+            timer().startGpuTimer();
+            // TODO
+            thrust::detail::normal_iterator<thrust::device_ptr<int>> endIter = thrust::remove_if(dev_in.begin(), dev_in.end(), isZero());
+            timer().endGpuTimer();
+            thrust::copy(dev_in.begin(), endIter, odata);
+            return endIter - dev_in.begin();
         }
     }
 }
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index fe98206..d5c82ec 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+
+        int compact(int n, int* odata, const int* idata);
     }
 }