diff --git a/README.md b/README.md
index 0e38ddb1..da80adbb 100644
--- a/README.md
+++ b/README.md
@@ -1,14 +1,220 @@
-CUDA Stream Compaction
-======================
+# CUDA Stream Compaction
+Work-efficient scan, stream compaction, and radix sort in CUDA.
 
-**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
+**Author:** Lu M.
+- [LinkedIn](https://www.linkedin.com/in/lu-m-673425323/)
+- [Personal Site](https://lu-m-dev.github.io)
 
-* (TODO) YOUR NAME HERE
-  * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+**Tested System:**
+ - Windows 11 Home
+ - AMD Ryzen 7 5800HS @ 3.20GHz, 16GB RAM
+ - NVIDIA GeForce RTX 3060 Laptop GPU 6GB (Compute Capability 8.6)
 
-### (TODO: Your README)
+## Abstract
 
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project implements and compares several scan and stream compaction algorithms on both CPU and GPU. The goal is to evaluate the performance and scalability of:
+- CPU serial scan
+- GPU naive parallel scan
+- GPU work-efficient parallel scan
+- NVIDIA Thrust library (`thrust::exclusive_scan()`)
 
+<img src="./img/plot_scan.png" style="width:75%; height:auto;">
+
+Performance is measured for scan and stream compaction operations over large arrays, highlighting scalability and efficiency. Thrust provides the fastest professional implementation, while the work-efficient GPU scan significantly outperforms both CPU and naive GPU approaches at scale. The naive scan does not surpass CPU performance, likely due to inefficient thread utilization and memory access patterns.
+
+---
+
+## Build Instructions
+
+1. Clone the repository:
+   ```sh
+   git clone https://github.com/lu-m-dev/CUDA-stream-compaction.git
+   ```
+2. Navigate to the project directory:
+   ```sh
+   cd CUDA-stream-compaction
+   ```
+3. Build with CMake:
+   ```sh
+   cmake -B build -S . -G "Visual Studio 17 2022"
+   ```
+4. Open the solution in Visual Studio:
+   ```sh
+   cd build
+   start ./cis5650_stream_compaction_test.sln
+   ```
+
+---
+
+## Methods
+
+### CPU Sequential Scan
+
+
+The CPU scan is a simple prefix sum algorithm, iterating through the input array and accumulating the sum. It serves as a baseline for performance comparison.
+
+```c++
+StreamCompaction::CPU::scan(int n, int *out, const int *in)
+    out[0] = 0
+    for k = 1 to n:
+        out[k] = out[k-1] + in[k-1]
+```
+
+### Naive Parallel Scan (GPU)
+
+The naive parallel scan uses multiple threads to compute partial sums in a stepwise fashion. Each iteration doubles the offset, but threads may overwrite values needed by others, requiring careful buffer management.
+
+<img src="./img/naive.png" style="width:80%; height:auto;">
+
+---
+
+```c++
+StreamCompaction::Naive::scan(int n, int *out, const int *in)
+    for d = 1 to log2(n):
+        for all k in parallel:
+            if (k >= 2^(d-1)):
+                out[k] = out[k - 2^(d-1)] + x[k]
+            else:
+                out[k] = in[k]
+```
+
+This implementation is simple but suffers from inefficient memory access and thread divergence, which limits its scalability and performance on large arrays.
+
+### Work-Efficient Parallel Scan (GPU)
+
+The work-efficient scan improves parallelism and memory access by using an upsweep and downsweep phase. It operates in-place and is more suitable for large-scale data.
+
+#### Upsweep Phase
+
+<img src="./img/upsweep.png" style="width:72%; height:auto;">
+
+---
+
+```c++
+for d = 0 to log2(n) - 1:
+    for all k = 0 to (n-1) by 2^(d+1) in parallel:
+        x[k + 2^(d+1) - 1] += x[k + 2^d - 1]
+```
+
+#### Downsweep Phase
+<img src="./img/downsweep.png" style="width:80%; height:auto;">
+
+---
+
+```c++
+x[n-1] = 0
+for d = log2(n) - 1 down to 0:
+    for all k = 0 to n-1 by 2^(d+1) in parallel:
+        temp = x[k + 2^d - 1]
+        x[k + 2^d - 1] = x[k + 2^(d+1) - 1]
+        x[k + 2^(d+1) - 1] += temp
+```
+The complete work-efficient parallel scan is implemented in the function `StreamCompaction::Efficient::scan(int n, int *out, const int *in)`.
+
+### Stream Compaction
+
+Stream compaction removes unwanted elements (e.g., zeros) from an array. The process involves:
+1. Mapping the input array to a boolean array (1 for keep, 0 for discard).
+2. Performing a scan on the boolean array to compute the output indices.
+3. Scattering the valid elements to their new positions.
+
+<img src="./img/compaction.png" style="width:55%; height:auto;">
+
+Parallel stream compaction is implemented in the function `StreamCompaction::Efficient::compact(int n, int *out, const int *in)`. It calls the work-efficient parallel scan for **Step 2** described above.
+
+### Radix Sort
+
+Radix sort leverages scan operations to sort integers by processing each bit position. For each bit:
+1. Map input to a boolean array (true/false for bit value).
+2. Scan the negated boolean array to count zeros.
+3. Use the scan results to index and scatter elements into sorted positions.
+4. Repeat for each bit from least to most significant.
+
+<img src="./img/radix_1.png" style="width:80.6%; height:auto;">
+
+---
+
+<img src="./img/radix_2.png" style="width:63%; height:auto;">
+
+Radix sort is implemented in the function `StreamCompaction::Efficient::sort(int n, int *out, const int *in)`. It calls the work-efficient parallel scan for **Step 2** described above.
+
+### Thrust Library
+
+NVIDIA's Thrust library provides highly optimized parallel primitives, including `thrust::exclusive_scan()` and `thrust::sort()`. These serve as benchmarks for professional GPU performance.
+
+#### Thrust library functions are used in the following modules:
+- `StreamCompaction::Thrust::scan(int n, int *out, const int *in)`
+- `StreamCompaction::Thrust::thrustSort(int n, int *out, const int *in)`
+
+---
+
+## Results
+
+Performance was measured on Release builds for input arrays ranging from $2^{21}$ (~2 million) to $2^{27}$ (~134 million) elements. The following plots illustrate scalability and efficiency:
+
+<img src="./img/plot_scan.png" style="width:100%; height:auto;">
+
+**Scan Performance:**
+- At $2^{27}$ (134 million elements):
+  - Naive scan: 124 ms
+  - CPU sequential scan: 80 ms
+  - Work-efficient scan: 44 ms
+  - Thrust scan: 5 ms
+
+Thrust is the fastest, with work-efficient scan showing strong scalability. Naive scan is limited by memory and thread inefficiencies.
+
+<img src="./img/plot_compact.png" style="width:100%; height:auto;">
+
+**Compaction Performance:**
+- At $2^{27}$:
+  - CPU sequential (no scan): 220 ms
+  - CPU compact (with scan): 530 ms
+  - Parallel compact (work-efficient scan): 57 ms
+
+Parallel compaction is 4-8x faster than CPU approaches.
+
+**Power-of-Two vs Non-Power-of-Two:**
+CPU and naive algorithms are unaffected by array size alignment. Work-efficient scan pads arrays to the next power-of-two. Results suggest that array size has negligible impact on elapsed time and is thus not explicited discussed in this report.
+
+**Radix Sort:**
+Radix sort was implemented using work-efficient scan and compared to Thrust sort. Correctness was verified, but performance lags behind Thrust due to kernel launch overhead and array management.
+
+```
+**********************
+** RADIX SORT TESTS **
+**********************
+==== thrust sort, power-of-two ====
+   elapsed time: 74.9665ms    (CUDA Measured)
+==== thrust sort, non-power-of-two ====
+   elapsed time: 40.6305ms    (CUDA Measured)
+==== radix sort, power-of-two ====
+   elapsed time: 2170.37ms    (CUDA Measured)
+    passed
+==== radix sort, non-power-of-two ====
+   elapsed time: 2170.83ms    (CUDA Measured)
+    passed
+```
+
+Radix sort is correct but much slower than Thrust. My hypothesis is that my implementation involves repeated kernel launches and management of multiple temporary arrays, which can be inefficient. Future optimization is needed.
+
+---
+
+## Discussion & Bloopers
+
+### Buffer Management in Naive Scan
+In the naive parallel scan, ping-pong buffer management is critical. Initially, `cudaMemcpy()` was used to copy output to input between iterations, but this caused poor performance—even worse than the CPU. Switching to in-kernel buffer updates (`in[index] = out[index]`) improved performance significantly. This highlights the importance of minimizing host-device memory transfers and maximizing device-side computation.
+
+### Timing and Measurement
+Accurate timing is essential for fair benchmarking. Timers (`std::chrono` for CPU, CUDA timers for GPU) are placed around only the algorithmic code, excluding memory allocation and management. Initially, timers were embedded within scan functions, which conflicted with stream compaction timing. Refactoring the scan logic into helper functions allowed for modular timing and better organization.
+
+### Lessons Learned
+- Efficient memory access and thread utilization are crucial for GPU performance.
+- Professional libraries like Thrust are highly optimized and difficult to match with custom implementations.
+- Kernel launch overhead and array management can dominate runtime in complex algorithms like radix sort.
+- Modular code organization aids in benchmarking and debugging.
+
+---
+
+## References
+
+Figures and pseudocode adapted from [University of Pennsylvania CIS 5650](https://github.com/CIS5650-Fall-2025) course materials.
diff --git a/img/compaction.png b/img/compaction.png
new file mode 100644
index 00000000..e59b8652
Binary files /dev/null and b/img/compaction.png differ
diff --git a/img/demo.gif b/img/demo.gif
new file mode 100644
index 00000000..ff66d8ed
Binary files /dev/null and b/img/demo.gif differ
diff --git a/img/downsweep.png b/img/downsweep.png
new file mode 100644
index 00000000..95ad21e1
Binary files /dev/null and b/img/downsweep.png differ
diff --git a/img/example-1.png b/img/example-1.png
deleted file mode 100644
index 28633a6f..00000000
Binary files a/img/example-1.png and /dev/null differ
diff --git a/img/example-2.jpg b/img/example-2.jpg
deleted file mode 100644
index 984c2fd2..00000000
Binary files a/img/example-2.jpg and /dev/null differ
diff --git a/img/figure-39-2.jpg b/img/figure-39-2.jpg
deleted file mode 100644
index bc9f9da7..00000000
Binary files a/img/figure-39-2.jpg and /dev/null differ
diff --git a/img/figure-39-4.jpg b/img/figure-39-4.jpg
deleted file mode 100644
index 5888f20d..00000000
Binary files a/img/figure-39-4.jpg and /dev/null differ
diff --git a/img/naive.png b/img/naive.png
new file mode 100644
index 00000000..6280a049
Binary files /dev/null and b/img/naive.png differ
diff --git a/img/plot_compact.png b/img/plot_compact.png
new file mode 100644
index 00000000..33dab3a7
Binary files /dev/null and b/img/plot_compact.png differ
diff --git a/img/plot_scan.png b/img/plot_scan.png
new file mode 100644
index 00000000..78d75bf3
Binary files /dev/null and b/img/plot_scan.png differ
diff --git a/img/radix_1.png b/img/radix_1.png
new file mode 100644
index 00000000..ac68ad28
Binary files /dev/null and b/img/radix_1.png differ
diff --git a/img/radix_2.png b/img/radix_2.png
new file mode 100644
index 00000000..f5db532c
Binary files /dev/null and b/img/radix_2.png differ
diff --git a/img/upsweep.png b/img/upsweep.png
new file mode 100644
index 00000000..9176d8a1
Binary files /dev/null and b/img/upsweep.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 3d5c8820..ceff567d 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -6,149 +6,231 @@
  * @copyright University of Pennsylvania
  */
 
-#include <cstdio>
 #include <stream_compaction/cpu.h>
-#include <stream_compaction/naive.h>
 #include <stream_compaction/efficient.h>
+#include <stream_compaction/naive.h>
 #include <stream_compaction/thrust.h>
+
+#include <cstdio>
+
 #include "testing_helpers.hpp"
 
-const int SIZE = 1 << 8; // feel free to change the size of array
-const int NPOT = SIZE - 3; // Non-Power-Of-Two
+const int SIZE = 1 << 21;   // feel free to change the size of array
+const int NPOT = SIZE - 3;  // Non-Power-Of-Two
 int *a = new int[SIZE];
 int *b = new int[SIZE];
 int *c = new int[SIZE];
 
-int main(int argc, char* argv[]) {
-    // Scan tests
-
-    printf("\n");
-    printf("****************\n");
-    printf("** SCAN TESTS **\n");
-    printf("****************\n");
-
-    genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    // initialize b using StreamCompaction::CPU::scan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.
-    // At first all cases passed because b && c are all zeroes.
-    zeroArray(SIZE, b);
-    printDesc("cpu scan, power-of-two");
-    StreamCompaction::CPU::scan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(SIZE, b, true);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu scan, non-power-of-two");
-    StreamCompaction::CPU::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, power-of-two");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan
-    onesArray(SIZE, c);
-    printDesc("1s array for finding bugs");
-    StreamCompaction::Naive::scan(SIZE, c, a);
-    printArray(SIZE, c, true); */
-
-    zeroArray(SIZE, c);
-    printDesc("naive scan, non-power-of-two");
-    StreamCompaction::Naive::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, power-of-two");
-    StreamCompaction::Efficient::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient scan, non-power-of-two");
-    StreamCompaction::Efficient::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, power-of-two");
-    StreamCompaction::Thrust::scan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(SIZE, c, true);
-    printCmpResult(SIZE, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("thrust scan, non-power-of-two");
-    StreamCompaction::Thrust::scan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(NPOT, c, true);
-    printCmpResult(NPOT, b, c);
-
-    printf("\n");
-    printf("*****************************\n");
-    printf("** STREAM COMPACTION TESTS **\n");
-    printf("*****************************\n");
-
-    // Compaction tests
-
-    genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
-    a[SIZE - 1] = 0;
-    printArray(SIZE, a, true);
-
-    int count, expectedCount, expectedNPOT;
-
-    // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
-    // We use b for further comparison. Make sure your StreamCompaction::CPU::compactWithoutScan is correct.
-    zeroArray(SIZE, b);
-    printDesc("cpu compact without scan, power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedCount = count;
-    printArray(count, b, true);
-    printCmpLenResult(count, expectedCount, b, b);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact without scan, non-power-of-two");
-    count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    expectedNPOT = count;
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("cpu compact with scan");
-    count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
-    printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");
-    printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, power-of-two");
-    count = StreamCompaction::Efficient::compact(SIZE, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedCount, b, c);
-
-    zeroArray(SIZE, c);
-    printDesc("work-efficient compact, non-power-of-two");
-    count = StreamCompaction::Efficient::compact(NPOT, c, a);
-    printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
-    //printArray(count, c, true);
-    printCmpLenResult(count, expectedNPOT, b, c);
-
-    system("pause"); // stop Win32 console from closing on exit
-    delete[] a;
-    delete[] b;
-    delete[] c;
+int main(int argc, char *argv[]) {
+  // Scan tests
+
+  printf("\n");
+  printf("****************\n");
+  printf("** SCAN TESTS **\n");
+  printf("****************\n");
+
+  genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case
+  a[SIZE - 1] = 0;
+  // printArray(SIZE, a, true);
+
+  // initialize b using StreamCompaction::CPU::scan you implement
+  // We use b for further comparison. Make sure your StreamCompaction::CPU::scan
+  // is correct. At first all cases passed because b && c are all zeroes.
+  zeroArray(SIZE, b);
+  printDesc("cpu scan, power-of-two");
+  StreamCompaction::CPU::scan(SIZE, b, a);
+  printElapsedTime(
+      StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(),
+      "(std::chrono Measured)");
+  // printArray(SIZE, b, true);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu scan, non-power-of-two");
+  StreamCompaction::CPU::scan(NPOT, c, a);
+  printElapsedTime(
+      StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(),
+      "(std::chrono Measured)");
+  // printArray(NPOT, c, true);
+  printCmpResult(NPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("naive scan, power-of-two");
+  StreamCompaction::Naive::scan(SIZE, c, a);
+  printElapsedTime(
+      StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  /* For bug-finding only: Array of 1s to help find bugs in stream compaction or
+  scan onesArray(SIZE, c); printDesc("1s array for finding bugs");
+  StreamCompaction::Naive::scan(SIZE, c, a);
+  printArray(SIZE, c, true); */
+
+  zeroArray(SIZE, c);
+  printDesc("naive scan, non-power-of-two");
+  StreamCompaction::Naive::scan(NPOT, c, a);
+  printElapsedTime(
+      StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(NPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient scan, power-of-two");
+  StreamCompaction::Efficient::scan(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient scan, non-power-of-two");
+  StreamCompaction::Efficient::scan(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(NPOT, c, true);
+  printCmpResult(NPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("thrust scan, power-of-two");
+  StreamCompaction::Thrust::scan(SIZE, c, a);
+  printElapsedTime(
+      StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(SIZE, c, true);
+  printCmpResult(SIZE, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("thrust scan, non-power-of-two");
+  StreamCompaction::Thrust::scan(NPOT, c, a);
+  printElapsedTime(
+      StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(NPOT, c, true);
+  printCmpResult(NPOT, b, c);
+
+  printf("\n");
+  printf("*****************************\n");
+  printf("** STREAM COMPACTION TESTS **\n");
+  printf("*****************************\n");
+
+  // Compaction tests
+
+  genArray(SIZE - 1, a, 4);  // Leave a 0 at the end to test that edge case
+  a[SIZE - 1] = 0;
+  // printArray(SIZE, a, true);
+
+  int count, expectedCount, expectedNPOT;
+
+  // initialize b using StreamCompaction::CPU::compactWithoutScan you implement
+  // We use b for further comparison. Make sure your
+  // StreamCompaction::CPU::compactWithoutScan is correct.
+  zeroArray(SIZE, b);
+  printDesc("cpu compact without scan, power-of-two");
+  count = StreamCompaction::CPU::compactWithoutScan(SIZE, b, a);
+  printElapsedTime(
+      StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(),
+      "(std::chrono Measured)");
+  expectedCount = count;
+  // printArray(count, b, true);
+  printCmpLenResult(count, expectedCount, b, b);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu compact without scan, non-power-of-two");
+  count = StreamCompaction::CPU::compactWithoutScan(NPOT, c, a);
+  printElapsedTime(
+      StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(),
+      "(std::chrono Measured)");
+  expectedNPOT = count;
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedNPOT, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("cpu compact with scan");
+  count = StreamCompaction::CPU::compactWithScan(SIZE, c, a);
+  printElapsedTime(
+      StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(),
+      "(std::chrono Measured)");
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedCount, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient compact, power-of-two");
+  count = StreamCompaction::Efficient::compact(SIZE, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedCount, b, c);
+
+  zeroArray(SIZE, c);
+  printDesc("work-efficient compact, non-power-of-two");
+  count = StreamCompaction::Efficient::compact(NPOT, c, a);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(count, c, true);
+  printCmpLenResult(count, expectedNPOT, b, c);
+
+  // Radix sort tests (compare to thrust sort result)
+  printf("\n");
+  printf("**********************\n");
+  printf("** RADIX SORT TESTS **\n");
+  printf("**********************\n");
+
+  int *d = new int[SIZE];
+  int *d_gold = new int[SIZE];
+  int *d_test = new int[SIZE];
+  int *e = new int[NPOT];
+  int *e_gold = new int[NPOT];
+  int *e_test = new int[NPOT];
+
+  genArray(SIZE - 1, d, 50);  // Leave a 0 at the end to test that edge case
+  d[SIZE - 1] = 0;
+  // printArray(SIZE, d, true);
+  genArray(NPOT - 1, e, 50);
+  e[NPOT - 1] = 0;
+  // printArray(SIZE, e, true);
+
+  zeroArray(SIZE, d_gold);
+  printDesc("thrust sort, power-of-two");
+  StreamCompaction::Thrust::thrustSort(SIZE, d_gold, d);
+  printElapsedTime(
+      StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(SIZE, d_gold, true);
+
+  zeroArray(NPOT, e_gold);
+  printDesc("thrust sort, non-power-of-two");
+  StreamCompaction::Thrust::thrustSort(NPOT, e_gold, e);
+  printElapsedTime(
+      StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(),
+      "(CUDA Measured)");
+  // printArray(NPOT, e_gold, true);
+
+  zeroArray(SIZE, d_test);
+  printDesc("radix sort, power-of-two");
+  StreamCompaction::Efficient::sort(SIZE, d_test, d);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(SIZE, d_test, true);
+  printCmpResult(SIZE, d_gold, d_test);
+
+  zeroArray(SIZE, e_test);
+  printDesc("radix sort, non-power-of-two");
+  StreamCompaction::Efficient::sort(NPOT, e_test, e);
+  printElapsedTime(StreamCompaction::Efficient::timer()
+                       .getGpuElapsedTimeForPreviousOperation(),
+                   "(CUDA Measured)");
+  // printArray(SIZE, e_test, true);
+  printCmpResult(NPOT, e_gold, e_test);
+
+  system("pause");  // stop Win32 console from closing on exit
+  delete[] a;
+  delete[] b;
+  delete[] c;
 }
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d630..ed9cb3ef 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,39 +1,47 @@
 #include "common.h"
 
 void checkCUDAErrorFn(const char *msg, const char *file, int line) {
-    cudaError_t err = cudaGetLastError();
-    if (cudaSuccess == err) {
-        return;
-    }
+  cudaError_t err = cudaGetLastError();
+  if (cudaSuccess == err) {
+    return;
+  }
 
-    fprintf(stderr, "CUDA error");
-    if (file) {
-        fprintf(stderr, " (%s:%d)", file, line);
-    }
-    fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
-    exit(EXIT_FAILURE);
+  fprintf(stderr, "CUDA error");
+  if (file) {
+    fprintf(stderr, " (%s:%d)", file, line);
+  }
+  fprintf(stderr, ": %s: %s\n", msg, cudaGetErrorString(err));
+  exit(EXIT_FAILURE);
 }
 
-
 namespace StreamCompaction {
-    namespace Common {
-
-        /**
-         * Maps an array to an array of 0s and 1s for stream compaction. Elements
-         * which map to 0 will be removed, and elements which map to 1 will be kept.
-         */
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-            // TODO
-        }
+namespace Common {
 
-        /**
-         * Performs scatter on an array. That is, for each element in idata,
-         * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
-         */
-        __global__ void kernScatter(int n, int *odata,
-                const int *idata, const int *bools, const int *indices) {
-            // TODO
-        }
+/**
+ * Maps an array to two output arrays of 0s and 1s for stream compaction. Elements
+ * which map to 0 will be removed, and elements which map to 1 will be kept.
+ */
+__global__ void kernMapToBoolean(int n, int *bools1, int *bools2, const int *idata) {
+  // TODO
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    int val = (idata[index] != 0) ? 1 : 0;
+    bools1[index] = val;
+    bools2[index] = val;
+  }
+}
 
-    }
+/**
+ * Performs scatter on an array. That is, for each element in idata,
+ * if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
+ */
+__global__ void kernScatter(int n, int *odata, const int *idata,
+                            const int *bools, const int *indices) {
+  // TODO
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n && bools[index]) {
+    odata[indices[index]] = idata[index];
+  }
 }
+}  // namespace Common
+}  // namespace StreamCompaction
diff --git a/stream_compaction/common.h b/stream_compaction/common.h
index d2c1fed9..9e02878a 100644
--- a/stream_compaction/common.h
+++ b/stream_compaction/common.h
@@ -12,6 +12,7 @@
 
 #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__)
 #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__)
+#define blockSize 128
 
 /**
  * Check for CUDA errors; print and exit if there was a problem.
@@ -32,7 +33,7 @@ inline int ilog2ceil(int x) {
 
 namespace StreamCompaction {
     namespace Common {
-        __global__ void kernMapToBoolean(int n, int *bools, const int *idata);
+        __global__ void kernMapToBoolean(int n, int *bools1, int *bools2, const int *idata);
 
         __global__ void kernScatter(int n, int *odata,
                 const int *idata, const int *bools, const int *indices);
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa115..158b2b6c 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -1,50 +1,85 @@
 #include <cstdio>
-#include "cpu.h"
 
 #include "common.h"
+#include "cpu.h"
 
 namespace StreamCompaction {
-    namespace CPU {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
+namespace CPU {
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer &timer() {
+  static PerformanceTimer timer;
+  return timer;
+}
+
+/**
+ * CPU scan implementation without timer.
+ */
+void __scan(int n, int *odata, const int *idata) {
+  odata[0] = 0;
+  for (int i = 1; i < n; i++) {
+    odata[i] = odata[i - 1] + idata[i - 1];
+  }
+}
 
-        /**
-         * CPU scan (prefix sum).
-         * For performance analysis, this is supposed to be a simple for loop.
-         * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
-        }
+/**
+ * CPU scan (prefix sum).
+ * For performance analysis, this is supposed to be a simple for loop.
+ * (Optional) For better understanding before starting moving to GPU, you can
+ * simulate your GPU scan in this function first.
+ */
+void scan(int n, int *odata, const int *idata) {
+  timer().startCpuTimer();
+  // TODO
+  __scan(n, odata, idata);
+  timer().endCpuTimer();
+}
 
-        /**
-         * CPU stream compaction without using the scan function.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithoutScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
-            return -1;
-        }
+/**
+ * CPU stream compaction without using the scan function.
+ *
+ * @returns the number of elements remaining after compaction.
+ */
+int compactWithoutScan(int n, int *odata, const int *idata) {
+  timer().startCpuTimer();
+  // TODO
+  int count = 0;
+  for (int i = 0; i < n; i++) {
+    if (idata[i]) {
+      odata[count] = idata[i];
+      count++;
+    }
+  }
+  timer().endCpuTimer();
+  return count;
+}
 
-        /**
-         * CPU stream compaction using scan and scatter, like the parallel version.
-         *
-         * @returns the number of elements remaining after compaction.
-         */
-        int compactWithScan(int n, int *odata, const int *idata) {
-            timer().startCpuTimer();
-            // TODO
-            timer().endCpuTimer();
-            return -1;
-        }
+/**
+ * CPU stream compaction using scan and scatter, like the parallel version.
+ *
+ * @returns the number of elements remaining after compaction.
+ */
+int compactWithScan(int n, int *odata, const int *idata) {
+  // TODO
+  int *bArr = new int[n];
+  int *scanResult = new int[n];
+  int count = 0;
+  timer().startCpuTimer();
+  for (int i = 0; i < n; i++) {
+    bArr[i] = (idata[i] == 0) ? 0 : 1;
+  }
+  __scan(n, scanResult, bArr);
+  for (int i = 0; i < n; i++) {
+    if (bArr[i]) {
+      odata[scanResult[i]] = idata[i];
+      count++;
     }
+  }
+  timer().endCpuTimer();
+  delete[] bArr;
+  delete[] scanResult;
+  bArr = nullptr;
+  scanResult = nullptr;
+  return count;
 }
+}  // namespace CPU
+}  // namespace StreamCompaction
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346ee..71afbcc2 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -1,40 +1,284 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
+
 #include "common.h"
 #include "efficient.h"
 
 namespace StreamCompaction {
-    namespace Efficient {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
-
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-        }
-
-        /**
-         * Performs stream compaction on idata, storing the result into odata.
-         * All zeroes are discarded.
-         *
-         * @param n      The number of elements in idata.
-         * @param odata  The array into which to store elements.
-         * @param idata  The array of elements to compact.
-         * @returns      The number of elements remaining after compaction.
-         */
-        int compact(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-            return -1;
-        }
-    }
+namespace Efficient {
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer &timer() {
+  static PerformanceTimer timer;
+  return timer;
+}
+
+__global__ void kernUpSweep(const int N, const int numNodes, const int stepSize,
+                            int *odata) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index >= numNodes) return;
+  int k = stepSize * (index + 1) - 1;
+  int left = k - (stepSize >> 1);
+  odata[k] = odata[k] + odata[left];
+}
+
+__global__ void kernDownSweep(const int N, const int numNodes,
+                              const int stepSize, int *odata) {
+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  if (index >= numNodes) return;
+  int k = stepSize * (index + 1) - 1;
+  int left = k - (stepSize >> 1);
+  int temp = odata[left];
+  odata[left] = odata[k];
+  odata[k] = temp + odata[k];
+}
+
+/**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+void scan(int n, int *odata, const int *idata) {
+  // TODO
+  int N = (1 << ilog2ceil(n));
+  int *dev_odata;
+  cudaMalloc((void **)&dev_odata, N * sizeof(int));
+  checkCUDAError("efficientScan cudaMalloc dev_odata failed!");
+  cudaMemset(dev_odata, 0, N * sizeof(int));
+  checkCUDAError("efficientScan cudaMemset dev_odata failed!");
+  cudaMemcpy(dev_odata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+  checkCUDAError("efficientScan cudaMemcpy to device failed!");
+  cudaDeviceSynchronize();
+
+  timer().startGpuTimer();
+  for (int d = 0; d < ilog2ceil(n); d++) {
+    int numNodes = N >> (d + 1);
+    dim3 fullBlocksPerGrid((numNodes + blockSize - 1) / blockSize);
+    kernUpSweep<<<fullBlocksPerGrid, blockSize>>>(N, numNodes, 1 << (d + 1),
+                                                  dev_odata);
+    checkCUDAError("efficientScan kernUpSweep failed!");
+    cudaDeviceSynchronize();
+  }
+  cudaMemset(dev_odata + N - 1, 0, sizeof(int));  // set last element to 0
+  checkCUDAError("efficientScan cudaMemset after UpSweep failed!");
+  cudaDeviceSynchronize();
+  for (int d = ilog2ceil(n) - 1; d >= 0; d--) {
+    int numNodes = N >> (d + 1);
+    dim3 fullBlocksPerGrid((numNodes + blockSize - 1) / blockSize);
+    kernDownSweep<<<fullBlocksPerGrid, blockSize>>>(N, numNodes, 1 << (d + 1),
+                                                    dev_odata);
+    checkCUDAError("efficientScan kernDownSweep failed!");
+    cudaDeviceSynchronize();
+  }
+  cudaDeviceSynchronize();
+  timer().endGpuTimer();
+  cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+  checkCUDAError("efficientScan cudaMemcpy to host failed!");
+  cudaFree(dev_odata);
+}
+
+/**
+ * Performs scan on device array dev_odata in place.
+ */
+void dev_scan(int N, int *dev_odata) {
+  for (int d = 0; d < ilog2ceil(N); d++) {
+    int numNodes = N >> (d + 1);
+    dim3 fullBlocksPerGrid((numNodes + blockSize - 1) / blockSize);
+    kernUpSweep<<<fullBlocksPerGrid, blockSize>>>(N, numNodes, 1 << (d + 1),
+                                                  dev_odata);
+    checkCUDAError("efficientScan kernUpSweep failed!");
+    cudaDeviceSynchronize();
+  }
+  cudaMemset(dev_odata + N - 1, 0, sizeof(int));  // set last element to 0
+  checkCUDAError("efficientScan cudaMemset after UpSweep failed!");
+  cudaDeviceSynchronize();
+  for (int d = ilog2ceil(N) - 1; d >= 0; d--) {
+    int numNodes = N >> (d + 1);
+    dim3 fullBlocksPerGrid((numNodes + blockSize - 1) / blockSize);
+    kernDownSweep<<<fullBlocksPerGrid, blockSize>>>(N, numNodes, 1 << (d + 1),
+                                                    dev_odata);
+    checkCUDAError("efficientScan kernDownSweep failed!");
+    cudaDeviceSynchronize();
+  }
+}
+
+/**
+ * Performs stream compaction on idata, storing the result into odata.
+ * All zeroes are discarded.
+ *
+ * @param n      The number of elements in idata.
+ * @param odata  The array into which to store elements.
+ * @param idata  The array of elements to compact.
+ * @returns      The number of elements remaining after compaction.
+ */
+int compact(int n, int *odata, const int *idata) {
+  int *dev_idata;
+  int *dev_odata;
+  int *dev_bArr;
+  int *dev_scanResult;
+  int N = (1 << ilog2ceil(n));
+  cudaMalloc((void **)&dev_idata, N * sizeof(int));
+  checkCUDAError("efficientCompact cudaMalloc dev_idata failed!");
+  cudaMalloc((void **)&dev_odata, N * sizeof(int));
+  checkCUDAError("efficientCompact cudaMalloc dev_odata failed!");
+  cudaMalloc((void **)&dev_bArr, N * sizeof(int));
+  checkCUDAError("efficientCompact cudaMalloc dev_bArr failed!");
+  cudaMalloc((void **)&dev_scanResult, N * sizeof(int));
+  checkCUDAError("efficientCompact cudaMalloc dev_scanResult failed!");
+  cudaMemset(dev_idata, 0, N * sizeof(int));
+  checkCUDAError("efficientCompact cudaMemset dev_idata failed!");
+  cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+  checkCUDAError("efficientCompact cudaMemcpy dev_idata failed!");
+  cudaMemcpy(dev_odata, idata, N * sizeof(int), cudaMemcpyHostToDevice);
+  checkCUDAError("efficientCompact cudaMemcpy dev_odata failed!");
+  cudaDeviceSynchronize();
+
+  dim3 fullBlocksPerGrid = (N + blockSize - 1) / blockSize;
+
+  timer().startGpuTimer();  // start timer
+  // TODO
+  StreamCompaction::Common::kernMapToBoolean<<<fullBlocksPerGrid, blockSize>>>(
+      N, dev_bArr, dev_scanResult, dev_idata);
+  checkCUDAError("efficientCompact kernMapToBoolean failed!");
+  cudaDeviceSynchronize();
+  dev_scan(N, dev_scanResult);
+  cudaDeviceSynchronize();
+  checkCUDAError("efficientCompact scan failed!");
+
+  StreamCompaction::Common::kernScatter<<<fullBlocksPerGrid, blockSize>>>(
+      N, dev_odata, dev_idata, dev_bArr, dev_scanResult);
+  cudaDeviceSynchronize();
+  checkCUDAError("efficientCompact kernScatter failed!");
+
+  timer().endGpuTimer();  // end timer
+
+  int lastElement, numElements;
+  cudaMemcpy(&lastElement, dev_idata + n - 1, sizeof(int),
+             cudaMemcpyDeviceToHost);
+  checkCUDAError("efficientCompact cudaMemcpy lastElement failed!");
+  cudaMemcpy(&numElements, dev_scanResult + n - 1, sizeof(int),
+             cudaMemcpyDeviceToHost);
+  checkCUDAError("efficientCompact cudaMemcpy numElements failed!");
+  numElements += (lastElement == 0) ? 0 : 1;
+  cudaMemcpy(odata, dev_odata, numElements * sizeof(int),
+             cudaMemcpyDeviceToHost);
+  checkCUDAError("efficientCompact cudaMemcpy to host failed!");
+  cudaFree(dev_idata);
+  cudaFree(dev_odata);
+  cudaFree(dev_bArr);
+  cudaFree(dev_scanResult);
+  return numElements;
+}
+
+/**
+ * Compute the
+ * b array - bit value at bit position,
+ * e array - negation of b array,
+ * f array - index for false values,
+ * f array is now a copy of e array for subsequent in-place scan input.
+ */
+__global__ void kernComputeBEF(const int n, int *b, int *e, int *f,
+                               const int *idata, const int lsb) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    int bitmask = 1 << lsb;
+    int bVal = (idata[index] & bitmask) >> lsb;
+    int eVal = 1 - bVal;
+    b[index] = bVal;
+    e[index] = eVal;
+    f[index] = eVal;
+  }
+}
+
+/**
+ * Compute the t array - index for true values.
+ */
+__global__ void kernComputeT(const int n, int *t, const int *f,
+                             const int totalFalses) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    t[index] = index - f[index] + totalFalses;
+  }
+}
+
+/**
+ * Compute the d array - destination index.
+ */
+__global__ void kernComputeD(const int n, int *d, const int *b, const int *t,
+                             int *f) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    d[index] = b[index] ? t[index] : f[index];
+  }
+}
+
+__global__ void kernRadixScatter(const int n, int *odata, const int *idata,
+                                 const int *d) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    odata[d[index]] = idata[index];
+  }
+}
+
+/**
+ * Performs radix sort on idata, storing the result into odata.
+ */
+void sort(int n, int *odata, const int *idata) {
+  int *dev_idata, *dev_odata, *dev_b, *dev_e, *dev_f, *dev_t, *dev_d;
+  int N = (1 << ilog2ceil(n));
+  cudaMalloc((void **)&dev_idata, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_idata failed!");
+  cudaMalloc((void **)&dev_odata, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_odata failed!");
+  cudaMalloc((void **)&dev_b, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_b failed!");
+  cudaMalloc((void **)&dev_e, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_e failed!");
+  cudaMalloc((void **)&dev_f, N * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_f failed!");
+  cudaMalloc((void **)&dev_t, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_t failed!");
+  cudaMalloc((void **)&dev_d, n * sizeof(int));
+  checkCUDAError("efficientSort cudaMalloc dev_d failed!");
+  cudaMemset(dev_f, 0, N * sizeof(int));
+  checkCUDAError("efficientSort cudaMemset dev_f failed!");
+  cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+  checkCUDAError("efficientSort cudaMemcpy dev_idata failed!");
+  cudaDeviceSynchronize();
+  dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+  timer().startGpuTimer();  // start timer
+  for (int lsb = 0; lsb < 8 * sizeof(int); lsb++) {
+    kernComputeBEF<<<fullBlocksPerGrid, blockSize>>>(n, dev_b, dev_e, dev_f,
+                                                     dev_idata, lsb);
+    checkCUDAError("efficientSort kernMapToBit failed!");
+    cudaDeviceSynchronize();
+    dev_scan(N, dev_f);
+    cudaDeviceSynchronize();
+    int totalFalses, lastFalse;
+    cudaMemcpy(&totalFalses, dev_f + n - 1, sizeof(int),
+               cudaMemcpyDeviceToHost);
+    cudaMemcpy(&lastFalse, dev_e + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+    totalFalses += lastFalse;
+    kernComputeT<<<fullBlocksPerGrid, blockSize>>>(n, dev_t, dev_f,
+                                                   totalFalses);
+    checkCUDAError("efficientSort kernComputeT failed!");
+    cudaDeviceSynchronize();
+    kernComputeD<<<fullBlocksPerGrid, blockSize>>>(n, dev_d, dev_b, dev_t,
+                                                   dev_f);
+    checkCUDAError("efficientSort kernComputeD failed!");
+    cudaDeviceSynchronize();
+    kernRadixScatter<<<fullBlocksPerGrid, blockSize>>>(n, dev_odata, dev_idata,
+                                                       dev_d);
+    std::swap(dev_idata, dev_odata);
+  }
+  cudaDeviceSynchronize();
+  timer().endGpuTimer();  // end timer
+  cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+  checkCUDAError("efficientSort cudaMemcpy to host failed!");
+  cudaFree(dev_idata);
+  cudaFree(dev_odata);
+  cudaFree(dev_b);
+  cudaFree(dev_e);
+  cudaFree(dev_f);
+  cudaFree(dev_t);
+  cudaFree(dev_d);
 }
+}  // namespace Efficient
+}  // namespace StreamCompaction
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4fe..d54cb31b 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -9,5 +9,7 @@ namespace StreamCompaction {
         void scan(int n, int *odata, const int *idata);
 
         int compact(int n, int *odata, const int *idata);
+
+        void sort(int n, int *odata, const int *idata);
     }
 }
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 43088769..54e63d60 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -1,25 +1,62 @@
 #include <cuda.h>
 #include <cuda_runtime.h>
+
 #include "common.h"
 #include "naive.h"
 
 namespace StreamCompaction {
-    namespace Naive {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
-        // TODO: __global__
-
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO
-            timer().endGpuTimer();
-        }
+namespace Naive {
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer &timer() {
+  static PerformanceTimer timer;
+  return timer;
+}
+// TODO: __global__
+__global__ void kernNaiveScan(int n, int offset, int *odata, const int *idata) {
+  int index = blockDim.x * blockIdx.x + threadIdx.x;
+  if (index < n) {
+    odata[index] = idata[index];
+    if (index >= offset) {
+      odata[index] = idata[index - offset] + idata[index];
     }
+  }
+}
+
+/**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+void scan(int n, int *odata, const int *idata) {
+  int *dev_idata;
+  int *dev_odata;
+  cudaMalloc((void **)&dev_idata, n * sizeof(int));
+  checkCUDAError("naiveScan cudaMalloc dev_idata failed!");
+  cudaMalloc((void **)&dev_odata, n * sizeof(int));
+  checkCUDAError("naiveScan cudaMalloc dev_odata failed!");
+  cudaMemset(dev_idata, 0, n * sizeof(int));
+  checkCUDAError("naiveScan cudaMemset dev_idata failed!");
+  cudaMemcpy(dev_idata + 1, idata, (n - 1) * sizeof(int),
+             cudaMemcpyHostToDevice);
+  checkCUDAError("naiveScan cudaMemcpy to device failed!");
+  cudaDeviceSynchronize();
+
+  dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+  timer().startGpuTimer();
+  // TODO
+  for (int d = 1; d <= ilog2ceil(n); d++) {
+    kernNaiveScan<<<fullBlocksPerGrid, blockSize>>>(n, 1 << (d - 1), dev_odata,
+                                                    dev_idata);
+    // cudaDeviceSynchronize();
+    std::swap(dev_odata, dev_idata);
+  }
+  cudaDeviceSynchronize();
+  timer().endGpuTimer();
+
+  cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+  checkCUDAError("naiveScan cudaMemcpy to host failed!");
+  cudaDeviceSynchronize();
+  cudaFree(dev_odata);
+  cudaFree(dev_idata);
 }
+}  // namespace Naive
+}  // namespace StreamCompaction
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e7..e61f35fb 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -3,26 +3,39 @@
 #include <thrust/device_vector.h>
 #include <thrust/host_vector.h>
 #include <thrust/scan.h>
+#include <thrust/sort.h>
+
 #include "common.h"
 #include "thrust.h"
 
 namespace StreamCompaction {
-    namespace Thrust {
-        using StreamCompaction::Common::PerformanceTimer;
-        PerformanceTimer& timer()
-        {
-            static PerformanceTimer timer;
-            return timer;
-        }
-        /**
-         * Performs prefix-sum (aka scan) on idata, storing the result into odata.
-         */
-        void scan(int n, int *odata, const int *idata) {
-            timer().startGpuTimer();
-            // TODO use `thrust::exclusive_scan`
-            // example: for device_vectors dv_in and dv_out:
-            // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
-            timer().endGpuTimer();
-        }
-    }
+namespace Thrust {
+using StreamCompaction::Common::PerformanceTimer;
+PerformanceTimer &timer() {
+  static PerformanceTimer timer;
+  return timer;
+}
+/**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+void scan(int n, int *odata, const int *idata) {
+  thrust::device_vector<int> dev_idata(idata, idata + n);
+  thrust::device_vector<int> dev_odata(n);
+  timer().startGpuTimer();
+  // TODO use `thrust::exclusive_scan`
+  // example: for device_vectors dv_in and dv_out:
+  // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+  thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin());
+  timer().endGpuTimer();
+  thrust::copy(dev_odata.begin(), dev_odata.end(), odata);
+}
+
+void thrustSort(int n, int *odata, const int *idata) {
+  thrust::device_vector<int> dev_idata(idata, idata + n);
+  timer().startGpuTimer();
+  thrust::sort(thrust::device, dev_idata.begin(), dev_idata.end());
+  timer().endGpuTimer();
+  thrust::copy(dev_idata.begin(), dev_idata.end(), odata);
 }
+}  // namespace Thrust
+}  // namespace StreamCompaction
diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h
index fe98206b..a820543e 100644
--- a/stream_compaction/thrust.h
+++ b/stream_compaction/thrust.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
         StreamCompaction::Common::PerformanceTimer& timer();
 
         void scan(int n, int *odata, const int *idata);
+
+        void thrustSort(int n, int *odata, const int *idata);
     }
 }