diff --git a/README.md b/README.md
index 0e38ddb1..8e65c8bd 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,188 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Marcus Hedlund
+ * [LinkedIn](https://www.linkedin.com/in/marcushedlund/)
+* Tested on: Windows 11, Intel Core Ultra 9 185H @ 2.5 GHz 16GB, NVIDIA GeForce RTX 4070 Laptop GPU 8GB (Personal Computer)
-### (TODO: Your README)
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+# Overview
+In this project I implemented multiple versions of the exclusive scan (prefix sum) algorithm and used them to build stream compaction and radix sort. The implementations range from a baseline CPU version to a work efficient GPU version using shared memory and memory bank conflict avoidance. I also compared my implementations against Thrust's implementation of scan for performance benchmarking.
+
+First, a brief description of each algorithm:
+* Scan: Calculate all prefix sums of an array
+* Stream Compaction: Given an array, create a new array with all elements that fit a certain criteria
+* Radix Sort: Efficiently sort an integer array by processing individual bits
+
+||||
+|:--:|:--:|:--:|
+
+
Scan, Stream Compaction, and Radix Sort adapted from CIS 5650 slides
+
+# Implementations
+
+### CPU Baselines
+
+The CPU versions of all three algorithms are straightforward:
+* Scan: We use a for loop that iterates through the array while maintaining a running sum
+* Stream Compaction: We use a two-pointer approach while iterating through the array keeping track of both our current element and where the next desired element should go.
+* Sort: We simply use std::sort as a reference to check my GPU implementation of radix sort
+
+### Naive GPU Scan
+The naive scan works through the prefix sum problem in a series of kernel launches. In each pass, every thread updates its element by adding the value of an element at a certain offset earlier in the array. By updating the array in place and doubling the offset each pass, by the end of `ilog2(n)` total passes each element will hold the sum of all of the previous values.
+||
+|:--:|
+|GPU Naive Scan from CIS 5650 slides|
+
+### Work Efficient GPU Scan
+To improve on the naive approach, work efficient scan organizes the array as a balanced binary tree and computes the prefix sums in two phases. In the upsweep phase, threads do a parallel reduction that combines elements to create partial sums going up the tree until the root holds the total sum. The downsweep phase then traverses down the tree, piecing the sums together to compute the correct prefix sum values at each leaf.
+||
+|:--:|:--:|
+|Work Efficient Upsweep from CIS 5650 slides|Work Efficient Downsweep from CIS 5650 slides|
+
+
+### Shared Memory GPU Scan
+The shared memory GPU scan is based on the work efficient implementation but leverages shared memory over global memory to improve lookup speed. Each thread loads two elements from global memory into shared memory so that each block can process up to 2048 elements effectively. For arrays larger than 2048 elements, we first divide the original array into tiles of length 2048 that are then processed in parallel with the total sums of each tile being stored into a block sums array. This block sums array is then itself scanned using the shared memory implementations and the results are added back to each tile to compute the final prefix sums.
+
+
+
+
+ Shared Memory Scan from GPU Gems 39.2.4
+
+
+
+### Thrust Scan
+I also made a simple wrapper around Thrust's scan function to compare my implementations against.
+
+### GPU Stream Compaction
+The GPU implementation of stream compaction directly leverages scan. First, the input array is mapped to a boolean array based on which elements should be kept. Next, the boolean array is scanned to produce an array of target indices. Finally, a scatter step writes each desired element into its correct position in the compacted array by using both the boolean and scan outputs.
+
+|
+|:--:|
+|GPU Stream Compaction from CIS 5650 slides|
+
+### GPU Radix Sort
+Lastly, I implemented radix sort on the GPU leveraging both scan and scatter. Radix sort works by processing the binary representation of each integer starting from the least significant bit until the most significant. At each iteration, the algorithm partitions the array so that all elements with a zero in the current bit appear before any with a one, while preserving the relative order of elements in each of the partitions.
+|
+|:--:|
+|CPU Radix Sort from CIS 5650 slides|
+
+On the GPU each iteration is performed by first mapping each element to a boolean array based on the current bit. Then, in several steps including a scan pass we calculate the new positions to scatter each element to. Lastly, we scatter the elements based on the boolean and new position arrays. Repeating this process for each bit gives us the fully sorted array.
+|||
+|:--:|:--:|
+|GPU Radix Sort scan step from CIS 5650 slides| GPU Radix Sort scatter step from CIS 5650 slides|
+
+# Performance Analysis
+
+### Data Collection
+
+For performance measurement I used `std::chrono` to record CPU timings and CUDA events for GPU timings. All GPU measurements excluded initial and final memory operations such as copying data to and from the host and device for fair comparison. Additionally each data point was collected over ten trials and then averaged, and the complete data can be found in the [data](https://github.com/mhedlund7/Project2-Stream-Compaction/tree/main/data) folder.
+
+### Varying Block Count
+
+
+
+For all the implementations varying the block size did not have a large effect on performance. In general, a block size of 256 threads seemed to work pretty well across all implementations, though the work efficient scan showed slightly better performance at 128 threads and the naive scan was slightly better at 256 threads. Overall though the block size did not seem like a major factor towards increasing the speed of the implementations.
+
+### Varying Array Size
+
+|||
+|:--:|:--:|
+|Various Implementations: Time vs Array Size|Shared Memory vs Global Memory Comparison|
+
+When varying array size though the results were much clearer. The CPU and naive GPU scans performed the worst for large inputs, with the naive GPU scan surprisingly even falling behind the CPU likely because of how much global memory usage it needs each iteration. The work efficient GPU scan greatly improved over both the previous implementations and using shared memory optimized it even further with the largest gains at nearly a ten times speedup compared to the normal work efficient version. Thrust's scan however consistently outperformed all of my results although my shared memory implementation came quite close on smaller array sizes. By profiling with Nsight Compute I was able to look more into these differences:
+
+### Performance Comparisons
+
+|||
+|:--:|:--:|
+|Naive NSIGHT Compute Output|Work Efficient NSIGHT Compute Output|
+||||
+|Shared Memory NSIGHT Compute Output|Thrust NSIGHT Compute Output|
+
+In Nsight Compute each iteration of the naive implementation shows a memory throughput utilization of around 80% while the work efficient version also reaches around 80% early in the upsweep and late in the downsweep, but utilizes less memory when it launches fewer threads higher up in the tree. Even so it is clear that their performances were both primarily memory-bound. By contrast, the shared memory scan reduced memory throughput to around 25% during the tiled scan step, allowing it to have higher computation and greatly improve performance, but it was still restricted at around 84% when adding all of the block sums back to their respective tiles because it isn't leveraging shared memory during that step. Thrust's implementation also had lower memory throughput at around 65% and was able to better balance both its memory access and actual computation.
+
+### Radix Sort Array Size
+
+Lastly, for Radix Sort we can see a clear linear relationship as the size of the array grows, as expected (appears exponential here because of the log-scaled graph). I configured my radix sort to do 16 passes in order to support integers up to 1 million, but this number could be decreased to gain speed, or increased in order to handle larger integers if needed.
+
+### Sample Program Output
+I added additional tests for radix sort comparing against the result of std::sort and also tested my shared memory scan implementation. Radix sort can be called through `StreamCompaction::Radix::radix(int n, int *odata, const int *idata)`.
+```
+****************
+** SCAN TESTS **
+****************
+ [ 18 12 23 7 17 14 27 46 43 43 48 42 0 ... 24 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 0.0062ms (std::chrono Measured)
+ [ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100095 100119 ]
+==== cpu scan, non-power-of-two ====
+ elapsed time: 0.0042ms (std::chrono Measured)
+ [ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100026 100053 ]
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 0.3072ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 0.34816ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 0.841728ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 1.56262ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 1.9159ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 0.063488ms (CUDA Measured)
+ passed
+==== shared mem work-efficient scan, power-of-two ====
+ elapsed time: 0.176128ms (CUDA Measured)
+ [ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100095 100119 ]
+ passed
+==== shared mem work-efficient scan, non-power-of-two ====
+ elapsed time: 0.294912ms (CUDA Measured)
+ [ 0 18 30 53 60 77 91 118 164 207 250 298 340 ... 100026 100053 ]
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 0 1 3 0 0 1 3 3 3 1 2 2 3 ... 2 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 0.014ms (std::chrono Measured)
+ [ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 1 2 ]
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 0.0136ms (std::chrono Measured)
+ [ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 2 1 ]
+ passed
+==== cpu compact with scan ====
+ elapsed time: 0.0253ms (std::chrono Measured)
+ [ 1 3 1 3 3 3 1 2 2 3 3 2 1 ... 1 2 ]
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 0.90112ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 0.359424ms (CUDA Measured)
+ passed
+
+*****************************
+******** RADIX TESTS ********
+*****************************
+==== cpu sort, power-of-two ====
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
+==== radix, power-of-two ====
+ elapsed time: 10.2533ms (CUDA Measured)
+ passed
+==== cpu sort, non-power-of-two ====
+ [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 199 199 ]
+==== radix, non-power-of-two ====
+ elapsed time: 9.64915ms (CUDA Measured)
+ passed
+
+```
\ No newline at end of file
diff --git a/data/array_size_vs_time_data.csv b/data/array_size_vs_time_data.csv
new file mode 100644
index 00000000..3dcd839d
--- /dev/null
+++ b/data/array_size_vs_time_data.csv
@@ -0,0 +1,46 @@
+Array Size,CPUCompactWithScan,CPUCompactWithoutuScan,CPUScan,EfficientBankOptimizedSharedMemScan,EfficientCompact,EfficientScan,EfficientSharedMemScan,NaiveScan,RadixSort,ThrustScan
+256,0.00312,0.00018,0.0001,0.0139264,0.1664,0.330854,0.0183296,0.042496,0.0299008,0.224973
+1024,0.0041,0.00078,0.00024,0.0106496,0.179814,0.420966,0.0118784,0.0838656,0.0769024,0.0295936
+8192,0.03728,0.00944,0.00291,0.0196608,0.234394,0.443187,0.022528,0.0986112,0.108851,0.0304128
+32768,0.13038,0.05933,0.00854,0.0489184,0.262406,0.532586,0.0482336,0.177744,0.213395,0.0769472
+131072,0.49113,0.21795,0.03016,0.0634368,0.362787,0.607254,0.0605696,0.155574,0.168806,0.0520704
+1048576,4.091855,2.256526667,0.3241733333,0.111488,0.5056456667,1.014022667,0.137469,0.4068615,0.4617236667,0.146208
+4194304,16.8136,8.59004,2.07581,0.353306,1.32138,1.75386,0.422819,1.706,2.1252,0.203578
+16777216,68.0669,34.5437,8.79466,0.0025472,8.11009,7.4386,0.0025632,17.7327,15.6761,0.677619
+67108864,439.386,151.334,33.1646,0.0027392,34.6541,25.291,0.0029728,79.3013,93.8617,2.72899
+,,,,,,,,,,
+,,,,,,,,,,
+Array Size,CPUScan,NaiveScan,EfficientScan,ThrustScan,,,,,,
+256,0.0001,0.042496,0.330854,0.224973,,,,,,
+1024,0.00024,0.0838656,0.420966,0.0295936,,,,,,
+8192,0.00291,0.0986112,0.443187,0.0304128,,,,,,
+32768,0.00854,0.177744,0.532586,0.0769472,,,,,,
+131072,0.03016,0.155574,0.607254,0.0520704,,,,,,
+1048576,0.3241733333,0.4068615,1.014022667,0.146208,,,,,,
+4194304,2.07581,1.706,1.75386,0.203578,,,,,,
+16777216,8.79466,17.7327,7.4386,0.677619,,,,,,
+67108864,33.1646,79.3013,25.291,2.72899,,,,,,
+,,,,,,,,,,
+,,,,,,,,,,
+Array Size,EfficientScan,EfficientBankOptimizedSharedMemScan,,,,,,,,
+256,0.330854,0.0139264,,,,,,,,
+1024,0.420966,0.0106496,,,,,,,,
+8192,0.443187,0.0196608,,,,,,,,
+32768,0.532586,0.0489184,,,,,,,,
+131072,0.607254,0.0634368,,,,,,,,
+1048576,1.014022667,0.111488,,,,,,,,
+,,,,,,,,,,
+,,,,,,,,,,
+,,,,,,,,,,
+,,,,,,,,,,
+,,,,,,,,,,
+Array Size,RadixSort,,,,,,,,,
+256,0.0299008,,,,,,,,,
+1024,0.0769024,,,,,,,,,
+8192,0.108851,,,,,,,,,
+32768,0.213395,,,,,,,,,
+131072,0.168806,,,,,,,,,
+1048576,0.4617236667,,,,,,,,,
+4194304,2.1252,,,,,,,,,
+16777216,15.6761,,,,,,,,,
+67108864,93.8617,,,,,,,,,
\ No newline at end of file
diff --git a/data/block_size_vs_time_data.csv b/data/block_size_vs_time_data.csv
new file mode 100644
index 00000000..8ccf61af
--- /dev/null
+++ b/data/block_size_vs_time_data.csv
@@ -0,0 +1,6 @@
+Block Size,EfficientCompact,EfficientScan,NaiveScan,RadixSort
+64,0.481744,0.951005,0.547853,0.638733
+128,0.490736,0.943344,0.384598,0.373843
+256,0.48271,0.978797,0.344522,0.35759
+512,0.559203,1.09746,0.340048,0.595139
+1024,0.533606,1.0962,0.499978,0.465843
\ No newline at end of file
diff --git a/data/raw_data.csv b/data/raw_data.csv
new file mode 100644
index 00000000..a62566b3
--- /dev/null
+++ b/data/raw_data.csv
@@ -0,0 +1,216 @@
+implementation,block_size,array_size,power_of_two,time_ms
+CPUScan,64,1048576,1,0.29676
+NaiveScan,64,1048576,1,0.547853
+EfficientScan,64,1048576,1,0.951005
+EfficientCompact,64,1048576,1,0.481744
+CPUCompactWithScan,64,1048576,1,3.9045
+CPUCompactWithoutuScan,64,1048576,1,1.82854
+RadixSort,64,1048576,1,0.638733
+CPUScan,128,1048576,1,0.28057
+NaiveScan,128,1048576,1,0.384598
+EfficientScan,128,1048576,1,0.943344
+EfficientCompact,128,1048576,1,0.490736
+CPUCompactWithScan,128,1048576,1,3.67721
+CPUCompactWithoutuScan,128,1048576,1,2.09129
+RadixSort,128,1048576,1,0.373843
+CPUScan,256,1048576,1,0.31144
+NaiveScan,256,1048576,1,0.344522
+EfficientScan,256,1048576,1,0.978797
+EfficientCompact,256,1048576,1,0.48271
+CPUCompactWithScan,256,1048576,1,3.88545
+CPUCompactWithoutuScan,256,1048576,1,2.4543
+RadixSort,256,1048576,1,0.35759
+CPUScan,512,1048576,1,0.28999
+NaiveScan,512,1048576,1,0.340048
+EfficientScan,512,1048576,1,1.09746
+EfficientCompact,512,1048576,1,0.559203
+CPUCompactWithScan,512,1048576,1,4.62702
+CPUCompactWithoutuScan,512,1048576,1,2.53397
+RadixSort,512,1048576,1,0.595139
+CPUScan,1024,1048576,1,0.41882
+NaiveScan,1024,1048576,1,0.499978
+EfficientScan,1024,1048576,1,1.0962
+EfficientCompact,1024,1048576,1,0.533606
+CPUCompactWithScan,1024,1048576,1,4.39755
+CPUCompactWithoutuScan,1024,1048576,1,1.98094
+RadixSort,1024,1048576,1,0.465843
+CPUScan,256,256,1,0.0001
+NaiveScan,256,256,1,0.042496
+ThrustScan,256,256,1,0.224973
+EfficientScan,256,256,1,0.330854
+EfficientSharedMemScan,256,256,1,0.0183296
+EfficientBankOptimizedSharedMemScan,256,256,1,0.0139264
+EfficientCompact,256,256,1,0.1664
+CPUCompactWithScan,256,256,1,0.00312
+CPUCompactWithoutuScan,256,256,1,0.00018
+RadixSort,256,256,1,0.0299008
+CPUScan,256,1024,1,0.00024
+NaiveScan,256,1024,1,0.0838656
+ThrustScan,256,1024,1,0.0295936
+EfficientScan,256,1024,1,0.420966
+EfficientSharedMemScan,256,1024,1,0.0118784
+EfficientBankOptimizedSharedMemScan,256,1024,1,0.0106496
+EfficientCompact,256,1024,1,0.179814
+CPUCompactWithScan,256,1024,1,0.0041
+CPUCompactWithoutuScan,256,1024,1,0.00078
+RadixSort,256,1024,1,0.0769024
+CPUScan,256,8192,1,0.00291
+NaiveScan,256,8192,1,0.0986112
+ThrustScan,256,8192,1,0.0304128
+EfficientScan,256,8192,1,0.443187
+EfficientSharedMemScan,256,8192,1,0.022528
+EfficientBankOptimizedSharedMemScan,256,8192,1,0.0196608
+EfficientCompact,256,8192,1,0.234394
+CPUCompactWithScan,256,8192,1,0.03728
+CPUCompactWithoutuScan,256,8192,1,0.00944
+RadixSort,256,8192,1,0.108851
+CPUScan,256,32768,1,0.00854
+NaiveScan,256,32768,1,0.177744
+ThrustScan,256,32768,1,0.0769472
+EfficientScan,256,32768,1,0.532586
+EfficientSharedMemScan,256,32768,1,0.0482336
+EfficientBankOptimizedSharedMemScan,256,32768,1,0.0489184
+EfficientCompact,256,32768,1,0.262406
+CPUCompactWithScan,256,32768,1,0.13038
+CPUCompactWithoutuScan,256,32768,1,0.05933
+RadixSort,256,32768,1,0.213395
+CPUScan,256,131072,1,0.03016
+NaiveScan,256,131072,1,0.155574
+ThrustScan,256,131072,1,0.0520704
+EfficientScan,256,131072,1,0.607254
+EfficientSharedMemScan,256,131072,1,0.0605696
+EfficientBankOptimizedSharedMemScan,256,131072,1,0.0634368
+EfficientCompact,256,131072,1,0.362787
+CPUCompactWithScan,256,131072,1,0.49113
+CPUCompactWithoutuScan,256,131072,1,0.21795
+RadixSort,256,131072,1,0.168806
+CPUScan,256,1048576,1,0.34746
+NaiveScan,256,1048576,1,0.32417
+ThrustScan,256,1048576,1,0.146208
+EfficientScan,256,1048576,1,1.01733
+EfficientSharedMemScan,256,1048576,1,0.137469
+EfficientBankOptimizedSharedMemScan,256,1048576,1,0.111488
+EfficientCompact,256,1048576,1,0.485875
+CPUCompactWithScan,256,1048576,1,4.0594
+CPUCompactWithoutuScan,256,1048576,1,2.65012
+RadixSort,256,1048576,1,0.339194
+CPUScan,256,4194304,1,2.07581
+NaiveScan,256,4194304,1,1.706
+ThrustScan,256,4194304,1,0.203578
+EfficientScan,256,4194304,1,1.75386
+EfficientSharedMemScan,256,4194304,1,0.422819
+EfficientBankOptimizedSharedMemScan,256,4194304,1,0.353306
+EfficientCompact,256,4194304,1,1.32138
+CPUCompactWithScan,256,4194304,1,16.8136
+CPUCompactWithoutuScan,256,4194304,1,8.59004
+RadixSort,256,4194304,1,2.1252
+CPUScan,256,16777216,1,8.79466
+NaiveScan,256,16777216,1,17.7327
+ThrustScan,256,16777216,1,0.677619
+EfficientScan,256,16777216,1,7.4386
+EfficientSharedMemScan,256,16777216,1,0.0025632
+EfficientBankOptimizedSharedMemScan,256,16777216,1,0.0025472
+EfficientCompact,256,16777216,1,8.11009
+CPUCompactWithScan,256,16777216,1,68.0669
+CPUCompactWithoutuScan,256,16777216,1,34.5437
+RadixSort,256,16777216,1,15.6761
+CPUScan,256,67108864,1,33.1646
+NaiveScan,256,67108864,1,79.3013
+ThrustScan,256,67108864,1,2.72899
+EfficientScan,256,67108864,1,25.291
+EfficientSharedMemScan,256,67108864,1,0.0029728
+EfficientBankOptimizedSharedMemScan,256,67108864,1,0.0027392
+EfficientCompact,256,67108864,1,34.6541
+CPUCompactWithScan,256,67108864,1,439.386
+CPUCompactWithoutuScan,256,67108864,1,151.334
+RadixSort,256,67108864,1,93.8617
+CPUScan,256,253,0,0.0001
+NaiveScan,256,253,0,0.0405504
+ThrustScan,256,253,0,0.223738
+EfficientScan,256,253,0,0.29911
+EfficientSharedMemScan,256,253,0,0.0098304
+EfficientBankOptimizedSharedMemScan,256,253,0,0.0100352
+EfficientCompact,256,253,0,0.195891
+CPUCompactWithScan,256,253,0,0.00254
+CPUCompactWithoutuScan,256,253,0,0.0002
+RadixSort,256,253,0,0.0438272
+CPUScan,256,1021,0,0.00024
+NaiveScan,256,1021,0,0.073216
+ThrustScan,256,1021,0,0.0304128
+EfficientScan,256,1021,0,0.314061
+EfficientSharedMemScan,256,1021,0,0.0139264
+EfficientBankOptimizedSharedMemScan,256,1021,0,0.0156672
+EfficientCompact,256,1021,0,0.163526
+CPUCompactWithScan,256,1021,0,0.00671
+CPUCompactWithoutuScan,256,1021,0,0.00095
+RadixSort,256,1021,0,0.0693248
+CPUScan,256,8189,0,0.00176
+NaiveScan,256,8189,0,0.0780288
+ThrustScan,256,8189,0,0.0247808
+EfficientScan,256,8189,0,0.340275
+EfficientSharedMemScan,256,8189,0,0.0169984
+EfficientBankOptimizedSharedMemScan,256,8189,0,0.0134144
+EfficientCompact,256,8189,0,0.176026
+CPUCompactWithScan,256,8189,0,0.03127
+CPUCompactWithoutuScan,256,8189,0,0.01134
+RadixSort,256,8189,0,0.0753664
+CPUScan,256,32765,0,0.00675
+NaiveScan,256,32765,0,0.11104
+ThrustScan,256,32765,0,0.0371392
+EfficientScan,256,32765,0,0.388339
+EfficientSharedMemScan,256,32765,0,0.042592
+EfficientBankOptimizedSharedMemScan,256,32765,0,0.0389504
+EfficientCompact,256,32765,0,0.196758
+CPUCompactWithScan,256,32765,0,0.15536
+CPUCompactWithoutuScan,256,32765,0,0.05429
+RadixSort,256,32765,0,0.113571
+CPUScan,256,131069,0,0.02977
+NaiveScan,256,131069,0,0.118579
+ThrustScan,256,131069,0,0.0477472
+EfficientScan,256,131069,0,0.447466
+EfficientSharedMemScan,256,131069,0,0.0447392
+EfficientBankOptimizedSharedMemScan,256,131069,0,0.0427392
+EfficientCompact,256,131069,0,0.26376
+CPUCompactWithScan,256,131069,0,0.91422
+CPUCompactWithoutuScan,256,131069,0,0.57563
+RadixSort,256,131069,0,0.1316
+CPUScan,256,1048573,0,0.34776
+NaiveScan,256,1048573,0,0.24305
+ThrustScan,256,1048573,0,0.127475
+EfficientScan,256,1048573,0,0.603056
+EfficientSharedMemScan,256,1048573,0,0.0968768
+EfficientBankOptimizedSharedMemScan,256,1048573,0,0.0834528
+EfficientCompact,256,1048573,0,0.42945
+CPUCompactWithScan,256,1048573,0,5.33821
+CPUCompactWithoutuScan,256,1048573,0,1.92851
+RadixSort,256,1048573,0,0.236144
+CPUScan,256,4194301,0,1.91732
+NaiveScan,256,4194301,0,1.40028
+ThrustScan,256,4194301,0,0.179446
+EfficientScan,256,4194301,0,1.0084
+EfficientSharedMemScan,256,4194301,0,0.257606
+EfficientBankOptimizedSharedMemScan,256,4194301,0,0.216342
+EfficientCompact,256,4194301,0,0.955834
+CPUCompactWithScan,256,4194301,0,19.3956
+CPUCompactWithoutuScan,256,4194301,0,8.84729
+RadixSort,256,4194301,0,1.42202
+CPUScan,256,16777213,0,8.18109
+NaiveScan,256,16777213,0,15.0227
+ThrustScan,256,16777213,0,0.702128
+EfficientScan,256,16777213,0,7.0073
+EfficientSharedMemScan,256,16777213,0,0.0029728
+EfficientBankOptimizedSharedMemScan,256,16777213,0,0.0029952
+EfficientCompact,256,16777213,0,8.16237
+CPUCompactWithScan,256,16777213,0,78.1366
+CPUCompactWithoutuScan,256,16777213,0,35.9248
+RadixSort,256,16777213,0,16.7439
+CPUScan,256,67108861,0,41.9802
+NaiveScan,256,67108861,0,70.7106
+ThrustScan,256,67108861,0,2.96872
+EfficientScan,256,67108861,0,25.8569
+EfficientSharedMemScan,256,67108861,0,0.002896
+EfficientBankOptimizedSharedMemScan,256,67108861,0,0.0027552
+EfficientCompact,256,67108861,0,34.7935
+CPUCompactWithScan,256,67108861,0,602.122
+CPUCompactWithoutuScan,256,67108861,0,166.968
+RadixSort,256,67108861,0,70.2384
\ No newline at end of file
diff --git a/img/RadixSort.png b/img/RadixSort.png
new file mode 100644
index 00000000..c35d6dc1
Binary files /dev/null and b/img/RadixSort.png differ
diff --git a/img/Scan.png b/img/Scan.png
new file mode 100644
index 00000000..8bd3cdc0
Binary files /dev/null and b/img/Scan.png differ
diff --git a/img/StreamCompaction.png b/img/StreamCompaction.png
new file mode 100644
index 00000000..0bc07cc3
Binary files /dev/null and b/img/StreamCompaction.png differ
diff --git a/img/figure1.png b/img/figure1.png
new file mode 100644
index 00000000..24194f49
Binary files /dev/null and b/img/figure1.png differ
diff --git a/img/figure10.png b/img/figure10.png
new file mode 100644
index 00000000..39388abc
Binary files /dev/null and b/img/figure10.png differ
diff --git a/img/figure11.png b/img/figure11.png
new file mode 100644
index 00000000..cec111b1
Binary files /dev/null and b/img/figure11.png differ
diff --git a/img/figure12.png b/img/figure12.png
new file mode 100644
index 00000000..fddfa8d5
Binary files /dev/null and b/img/figure12.png differ
diff --git a/img/figure13.png b/img/figure13.png
new file mode 100644
index 00000000..587c79e1
Binary files /dev/null and b/img/figure13.png differ
diff --git a/img/figure14.png b/img/figure14.png
new file mode 100644
index 00000000..247479bf
Binary files /dev/null and b/img/figure14.png differ
diff --git a/img/figure15.png b/img/figure15.png
new file mode 100644
index 00000000..d7e07f71
Binary files /dev/null and b/img/figure15.png differ
diff --git a/img/figure16.png b/img/figure16.png
new file mode 100644
index 00000000..84d5dcde
Binary files /dev/null and b/img/figure16.png differ
diff --git a/img/figure17.png b/img/figure17.png
new file mode 100644
index 00000000..33146020
Binary files /dev/null and b/img/figure17.png differ
diff --git a/img/figure2.png b/img/figure2.png
new file mode 100644
index 00000000..2c6e4c1f
Binary files /dev/null and b/img/figure2.png differ
diff --git a/img/figure3.png b/img/figure3.png
new file mode 100644
index 00000000..8f6560fd
Binary files /dev/null and b/img/figure3.png differ
diff --git a/img/figure4.png b/img/figure4.png
new file mode 100644
index 00000000..dda95f6b
Binary files /dev/null and b/img/figure4.png differ
diff --git a/img/figure5.png b/img/figure5.png
new file mode 100644
index 00000000..56dd012c
Binary files /dev/null and b/img/figure5.png differ
diff --git a/img/figure6.png b/img/figure6.png
new file mode 100644
index 00000000..ccf14443
Binary files /dev/null and b/img/figure6.png differ
diff --git a/img/figure7.png b/img/figure7.png
new file mode 100644
index 00000000..9df5369a
Binary files /dev/null and b/img/figure7.png differ
diff --git a/img/figure8.png b/img/figure8.png
new file mode 100644
index 00000000..3a693639
Binary files /dev/null and b/img/figure8.png differ
diff --git a/img/figure9.png b/img/figure9.png
new file mode 100644
index 00000000..929948a8
Binary files /dev/null and b/img/figure9.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 3d5c8820..4e1409b3 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -11,15 +11,330 @@
#include
#include
#include
+#include
#include "testing_helpers.hpp"
+#include
+#include
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 12; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
int *c = new int[SIZE];
+
+// Functions for data collection
+static void setAllBlockSizes(int blockSize) {
+ StreamCompaction::Naive::setBlockSize(blockSize);
+ StreamCompaction::Efficient::setBlockSize(blockSize);
+ StreamCompaction::Radix::setBlockSize(blockSize);
+}
+
+static void fillScanArray(int n, int* a) {
+ genArray(n - 1, a, 50);
+ a[n - 1] = 0;
+}
+
+static void fillCompactArray(int n, int* a) {
+ genArray(n - 1, a, 4);
+ a[n - 1] = 0;
+}
+
+static void fillRadixArray(int n, int* a) {
+ genArray(n - 1, a, 2048);
+}
+
+static double getAvgCPUScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::CPU::scan(n, o, in);
+ sum += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgThrustScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Thrust::scan(n, o, in);
+ sum += StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgNaiveScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Naive::scan(n, o, in);
+ sum += StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgEfficientScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Efficient::scan(n, o, in);
+ sum += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgEfficientSharedMemScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Efficient::sharedMemScan(n, o, in);
+ sum += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgCPUCompactWithoutScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::CPU::compactWithoutScan(n, o, in);
+ sum += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+
+ }
+ return sum / numRuns;
+}
+
+static double getAvgCPUCompactWithScanData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::CPU::compactWithScan(n, o, in);
+ sum += StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgEfficientCompactData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Efficient::compact(n, o, in);
+ sum += StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+static double getAvgRadixnData(int n, int* o, int* in) {
+ const int numRuns = 10;
+ double sum = 0.0;
+ for (int i = 0; i < numRuns; i++) {
+ StreamCompaction::Radix::radix(n, o, in);
+ sum += StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation();
+ }
+ return sum / numRuns;
+}
+
+// Actually collect the data
+void collectData() {
+ const char* dataPath = "data.csv";
+ std::ofstream dataFile(dataPath);
+ if (!dataFile) {
+ return;
+ }
+ dataFile << "implementation,block_size,array_size,power_of_two,time_ms\n";
+
+ const int MAXN = (1 << 27) +1;
+ int* a = new int[MAXN];
+ int* b = new int[MAXN];
+ int* c = new int[MAXN];
+
+ // Block Size vs Time data using N = 20
+
+ const int blockSizes[] = { 64, 128, 256, 512, 1024 };
+ int powerOfTwoFlag = 1;
+ int currSize = 1 << 20;
+
+ for (int blockSize : blockSizes) {
+ setAllBlockSizes(blockSize);
+
+ // CPUScan
+ fillScanArray(currSize, a);
+ zeroArray(currSize, c);
+ double t = getAvgCPUScanData(currSize, c, a);
+ dataFile << "CPUScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ // NaiveScan
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "NaiveScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientScan
+ zeroArray(currSize, c);
+ t = getAvgEfficientScanData(currSize, c, a);
+ dataFile << "EfficientScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientStreamCompaction
+ fillCompactArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgEfficientCompactData(currSize, c, a);
+ dataFile << "EfficientCompact," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithScanData(currSize, c, a);
+ dataFile << "CPUCompactWithScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithoutuScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithoutScanData(currSize, c, a);
+ dataFile << "CPUCompactWithoutuScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //Radix
+ fillRadixArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "RadixSort," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+ }
+
+ // Array Size vs Time
+
+ int blockSize = 256;
+ setAllBlockSizes(blockSize);
+ const int arraySizes[] = { 1 << 8, 1 << 10, 1 << 13, 1 << 15, 1 << 17, 1 << 20, 1 << 22, 1 << 24, 1 << 26};
+ const int numSizes = 9;
+
+ // Powers of 2
+ for (int i = 0; i < numSizes; i++) {
+
+ int currSize = arraySizes[i];
+
+ // CPUScan
+ fillScanArray(currSize, a);
+ zeroArray(currSize, c);
+ double t = getAvgCPUScanData(currSize, c, a);
+ dataFile << "CPUScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ // NaiveScan
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "NaiveScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ // ThrustScan
+ zeroArray(currSize, c);
+ t = getAvgThrustScanData(currSize, c, a);
+ dataFile << "ThrustScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientScan
+ zeroArray(currSize, c);
+ t = getAvgEfficientScanData(currSize, c, a);
+ dataFile << "EfficientScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientSharedMemScan
+ zeroArray(currSize, c);
+ StreamCompaction::Efficient::setMemoryBankOptimized(0);
+ t = getAvgEfficientSharedMemScanData(currSize, c, a);
+ dataFile << "EfficientSharedMemScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientBankOptimizedSharedMemScan
+ zeroArray(currSize, c);
+ StreamCompaction::Efficient::setMemoryBankOptimized(1);
+ t = getAvgEfficientSharedMemScanData(currSize, c, a);
+ dataFile << "EfficientBankOptimizedSharedMemScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientStreamCompaction
+ fillCompactArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgEfficientCompactData(currSize, c, a);
+ dataFile << "EfficientCompact," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithScanData(currSize, c, a);
+ dataFile << "CPUCompactWithScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithoutuScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithoutScanData(currSize, c, a);
+ dataFile << "CPUCompactWithoutuScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //Radix
+ fillRadixArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "RadixSort," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+ }
+
+ // Non Powers of 2
+ powerOfTwoFlag = 0;
+ for (int i = 0; i < numSizes; i++) {
+
+ int currSize = arraySizes[i] - 3;
+
+ // CPUScan
+ fillScanArray(currSize, a);
+ zeroArray(currSize, c);
+ double t = getAvgCPUScanData(currSize, c, a);
+ dataFile << "CPUScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ // NaiveScan
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "NaiveScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ // ThrustScan
+ zeroArray(currSize, c);
+ t = getAvgThrustScanData(currSize, c, a);
+ dataFile << "ThrustScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientScan
+ zeroArray(currSize, c);
+ t = getAvgEfficientScanData(currSize, c, a);
+ dataFile << "EfficientScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientSharedMemScan
+ zeroArray(currSize, c);
+ StreamCompaction::Efficient::setMemoryBankOptimized(0);
+ t = getAvgEfficientSharedMemScanData(currSize, c, a);
+ dataFile << "EfficientSharedMemScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientBankOptimizedSharedMemScan
+ zeroArray(currSize, c);
+ StreamCompaction::Efficient::setMemoryBankOptimized(1);
+ t = getAvgEfficientSharedMemScanData(currSize, c, a);
+ dataFile << "EfficientBankOptimizedSharedMemScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //EfficientStreamCompaction
+ fillCompactArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgEfficientCompactData(currSize, c, a);
+ dataFile << "EfficientCompact," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithScanData(currSize, c, a);
+ dataFile << "CPUCompactWithScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //CPUCompactWithoutuScan
+ zeroArray(currSize, c);
+ t = getAvgCPUCompactWithoutScanData(currSize, c, a);
+ dataFile << "CPUCompactWithoutuScan," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+
+ //Radix
+ fillRadixArray(currSize, a);
+ zeroArray(currSize, c);
+ t = getAvgNaiveScanData(currSize, c, a);
+ dataFile << "RadixSort," << blockSize << "," << currSize << "," << powerOfTwoFlag << "," << t << "\n";
+ }
+
+ dataFile.close();
+}
+
int main(int argc, char* argv[]) {
+
+ // CollectData
+ //collectData();
+
// Scan tests
printf("\n");
@@ -95,6 +410,22 @@ int main(int argc, char* argv[]) {
//printArray(NPOT, c, true);
printCmpResult(NPOT, b, c);
+ StreamCompaction::Efficient::setMemoryBankOptimized(1);
+
+ zeroArray(SIZE, c);
+ printDesc("shared mem work-efficient scan, power-of-two");
+ StreamCompaction::Efficient::sharedMemScan(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, c);
+ printDesc("shared mem work-efficient scan, non-power-of-two");
+ StreamCompaction::Efficient::sharedMemScan(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ printArray(NPOT, c, true);
+ printCmpResult(NPOT, b, c);
+
printf("\n");
printf("*****************************\n");
printf("** STREAM COMPACTION TESTS **\n");
@@ -147,8 +478,41 @@ int main(int argc, char* argv[]) {
//printArray(count, c, true);
printCmpLenResult(count, expectedNPOT, b, c);
+ printf("\n");
+ printf("*****************************\n");
+ printf("******** RADIX TESTS ********\n");
+ printf("*****************************\n");
+
+ genArray(SIZE - 1, a, 200);
+
+ // At first all cases passed because b && c are all zeroes.
+ zeroArray(SIZE, b);
+ printDesc("cpu sort, power-of-two");
+ StreamCompaction::CPU::sort(SIZE, b, a);
+ printArray(SIZE, b, true);
+
+ zeroArray(SIZE, c);
+ printDesc("radix, power-of-two");
+ StreamCompaction::Radix::radix(SIZE, c, a);
+ printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(SIZE, b, c);
+
+ zeroArray(SIZE, b);
+ printDesc("cpu sort, non-power-of-two");
+ StreamCompaction::CPU::sort(NPOT, b, a);
+ printArray(NPOT, b, true);
+
+ zeroArray(SIZE, c);
+ printDesc("radix, non-power-of-two");
+ StreamCompaction::Radix::radix(NPOT, c, a);
+ printElapsedTime(StreamCompaction::Radix::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
+ //printArray(SIZE, c, true);
+ printCmpResult(NPOT, b, c);
+
system("pause"); // stop Win32 console from closing on exit
delete[] a;
delete[] b;
delete[] c;
}
+
diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt
index 19511caa..5f82013c 100644
--- a/stream_compaction/CMakeLists.txt
+++ b/stream_compaction/CMakeLists.txt
@@ -4,6 +4,7 @@ set(headers
"naive.h"
"efficient.h"
"thrust.h"
+ "radix.h"
)
set(sources
@@ -12,6 +13,7 @@ set(sources
"naive.cu"
"efficient.cu"
"thrust.cu"
+ "radix.cu"
)
list(SORT headers)
@@ -19,10 +21,11 @@ list(SORT sources)
source_group(Headers FILES ${headers})
source_group(Sources FILES ${sources})
-
+find_package(CCCL REQUIRED)
add_library(stream_compaction ${sources} ${headers})
+target_link_libraries(stream_compaction CCCL::Thrust)
if(CMAKE_VERSION VERSION_LESS "3.23.0")
- set_target_properties(stream_compaction} PROPERTIES CUDA_ARCHITECTURES OFF)
+ set_target_properties(stream_compaction PROPERTIES CUDA_ARCHITECTURES OFF)
elseif(CMAKE_VERSION VERSION_LESS "3.24.0")
set_target_properties(stream_compaction PROPERTIES CUDA_ARCHITECTURES all-major)
else()
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d630..28e2d3e4 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,8 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ bools[index] = idata[index] == 0 ? 0 : 1;
}
/**
@@ -33,6 +35,10 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (bools[index]) {
+ odata[indices[index]] = idata[index];
+ }
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa115..48c6b787 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -20,6 +20,11 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+ int sum = 0;
+ for (int i = 0; i < n; i++) {
+ odata[i] = sum;
+ sum += idata[i];
+ }
timer().endCpuTimer();
}
@@ -30,9 +35,16 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
+ int writeOffset = 0;
+ for (int i = 0; i < n; i++) {
+ if (idata[i] != 0) {
+ odata[writeOffset] = idata[i];
+ writeOffset++;
+ }
+ }
// TODO
timer().endCpuTimer();
- return -1;
+ return writeOffset;
}
/**
@@ -43,8 +55,38 @@ namespace StreamCompaction {
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+ int* temp = new int[n];
+ // create temp array
+ int* scanned = new int[n];
+ for (int i = 0; i < n; i++) {
+ temp[i] = idata[i] == 0 ? 0 : 1;
+ }
+
+ // scan temp array
+ int sum = 0;
+ for (int i = 0; i < n; i++) {
+ scanned[i] = sum;
+ sum += temp[i];
+ }
+
+ // scatter
+ for (int i = 0; i < n; i++) {
+ if (temp[i]) {
+ odata[scanned[i]] = idata[i];
+ }
+ }
+ int count = scanned[n - 1] + temp[n - 1];
+ timer().endCpuTimer();
+ return count;
+ }
+
+ // CPU sort for testing GPU radix sort
+ void sort(int n, int *odata, const int *idata) {
+ timer().startCpuTimer();
+ // TODO
+ memcpy(odata, idata, n * sizeof(int));
+ std::sort(odata, odata + n);
timer().endCpuTimer();
- return -1;
}
}
}
diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h
index 873c0476..222b77a3 100644
--- a/stream_compaction/cpu.h
+++ b/stream_compaction/cpu.h
@@ -11,5 +11,7 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata);
int compactWithScan(int n, int *odata, const int *idata);
+
+ void sort(int n, int* odata, const int* idata);
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346ee..8bd44324 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -5,6 +5,35 @@
namespace StreamCompaction {
namespace Efficient {
+
+ // Block variables
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+
+ bool MEMORY_BANK_OPTIMIIZED = 1;
+
+ // Data buffers
+ int* dev_idata;
+ int* dev_bools;
+ int* dev_scanned;
+ int* dev_indices;
+ int* dev_odata;
+ int* dev_blockSums;
+
+ // Macros for avoiding shared memory bank conflicts
+ #define NUM_BANKS 32
+ #define LOG_NUM_BANKS 5
+ #define CONFLICT_FREE_OFFSET(n) ((n) >> LOG_NUM_BANKS)
+
+ void setBlockSize(int newBlockSize) {
+ blockSize = newBlockSize;
+ threadsPerBlock = dim3(blockSize);
+ }
+
+ void setMemoryBankOptimized(bool memBankOptimized) {
+ MEMORY_BANK_OPTIMIIZED = memBankOptimized;
+ }
+
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
@@ -12,15 +41,268 @@ namespace StreamCompaction {
return timer;
}
+
+ __global__ void kernUpSweep(int nearestPow2, int currOffset, int* data) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ int parentIdx = (index + 1) * currOffset * 2 - 1;
+ int leftChildIdx = parentIdx - currOffset;
+ if (parentIdx >= nearestPow2 || leftChildIdx < 0 || leftChildIdx >= nearestPow2 || parentIdx < 0) {
+ return;
+ }
+ data[parentIdx] += data[leftChildIdx];
+ }
+
+ __global__ void kernDownSweep(int nearestPow2, int currOffset, int* data) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ int parentIdx = (index + 1) * currOffset * 2 - 1;
+ int leftChildIdx = parentIdx - currOffset;
+ if (parentIdx >= nearestPow2 || leftChildIdx < 0 || leftChildIdx >= nearestPow2 || parentIdx < 0) {
+ return;
+ }
+ int temp = data[leftChildIdx];
+ data[leftChildIdx] = data[parentIdx];
+ data[parentIdx] += temp;
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
// TODO
+ int iters = ilog2ceil(n);
+ int nearestPow2 = 1 << iters;
+
+ // set up device arrays to the nearest power of 2
+ cudaMalloc((void**)&dev_indices, (nearestPow2 + 1) * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaMemset(dev_indices, 0, (nearestPow2 + 1) * sizeof(int));
+ checkCUDAError("cudaMemset dev_indices failed!");
+ cudaMemcpy(dev_indices, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+
+ timer().startGpuTimer();
+
+ // upsweep
+ for (int d = 0; d < iters; d++) {
+ // calculate power of 2 offset with bitshift
+ int currOffset = 1 << d;
+ // Only call the number of threads that actually need to write no values in the current sweep level
+ dim3 fullBlocksPerGrid((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernUpSweep<<>> (nearestPow2, currOffset, dev_indices);
+ checkCUDAError("kernUpSweep failed");
+ cudaError_t e = cudaDeviceSynchronize(); // runtime errors
+ if (e != cudaSuccess) { fprintf(stderr, "upsweep error: %s\n", cudaGetErrorString(e)); }
+ }
+
+ // Set last value after upsweep to 0
+ cudaMemset(dev_indices + nearestPow2 - 1, 0, sizeof(int));
+ // downsweep
+ for (int d = iters - 1; d >= 0; d--) {
+ // calculate power of 2 offset with bitshift
+ int currOffset = 1 << d;
+ // Only call the number of threads that actually need to write no values in the current sweep level
+ dim3 fullBlocksPerGrid((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernDownSweep<<>> (nearestPow2, currOffset, dev_indices);
+ checkCUDAError("kernDownSweep failed");
+ cudaError_t e = cudaDeviceSynchronize(); // runtime errors
+ if (e != cudaSuccess) { fprintf(stderr, "downsweep error: %s\n", cudaGetErrorString(e)); }
+ }
+
timer().endGpuTimer();
+
+ cudaMemcpy(odata, dev_indices, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy to host failed!");
+
+ // Free device arrays
+ cudaFree(dev_indices);
+ checkCUDAError("cudaFree dev_indices failed!");
+
+ }
+
+ // bank conflict unoptimized version
+ __global__ void kernSharedMemScan(int n, int* odata, int* idata, int* blockSums) {
+ extern __shared__ int temp[];
+ // Only one block to maintain shared memory
+ int index = threadIdx.x;
+ int blockStartIndex = blockIdx.x * 2048;
+ int offset = 1;
+ //load entire input into shared mem
+ temp[2 * index] = idata[2 * index + blockStartIndex];
+ temp[2 * index + 1] = idata[2 * index + blockStartIndex + 1];
+ // upsweep
+ for (int d = n >> 1; d > 0; d >>= 1) {
+ __syncthreads();
+ if (index < d) {
+ int leftChild = offset * (2 * index + 1) - 1;
+ int parent = offset * (2 * index + 2) - 1;
+ temp[parent] += temp[leftChild];
+ }
+ offset *= 2;
+ }
+ __syncthreads();
+ // capture last elem in block sums then zero it out zero out last element of temp array
+ if (index == 0) {
+ blockSums[blockIdx.x] = temp[n - 1];
+ temp[n - 1] = 0;
+ }
+ //downsweep
+ for (int d = 1; d < n; d *= 2) {
+ offset >>= 1;
+ __syncthreads();
+ if (index < d) {
+ int leftChild = offset * (2 * index + 1) - 1;
+ int parent = offset * (2 * index + 2) - 1;
+ int saved = temp[leftChild];
+ temp[leftChild] = temp[parent];
+ temp[parent] += saved;
+ }
+ }
+ __syncthreads();
+ odata[2 * index + blockStartIndex] = temp[2 * index];
+ odata[2 * index + blockStartIndex + 1] = temp[2 * index + 1];
+
+ }
+
+ // bank optimized version
+ __global__ void kernSharedMemBankOptimizedScan(int n, int* odata, int* idata, int* blockSums) {
+ extern __shared__ int temp[];
+ // Only one block to maintain shared memory
+ int index = threadIdx.x;
+ int blockStartIndex = blockIdx.x * 2048;
+ int offset = 1;
+ //load entire input into shared mem
+ int dataToLoadA = index;
+ int dataToLoadB = index + (n / 2);
+ int bankOffsetA = CONFLICT_FREE_OFFSET(dataToLoadA);
+ int bankOffsetB = CONFLICT_FREE_OFFSET(dataToLoadB);
+
+ temp[dataToLoadA + bankOffsetA] = idata[dataToLoadA + blockStartIndex];
+ temp[dataToLoadB + bankOffsetB] = idata[dataToLoadB + blockStartIndex];
+
+ // upsweep
+ for (int d = n >> 1; d > 0; d >>= 1) {
+ __syncthreads();
+ if (index < d) {
+ int leftChild = offset * (2 * index + 1) - 1;
+ int parent = offset * (2 * index + 2) - 1;
+ leftChild += CONFLICT_FREE_OFFSET(leftChild);
+ parent += CONFLICT_FREE_OFFSET(parent);
+ temp[parent] += temp[leftChild];
+ }
+ offset *= 2;
+ }
+ __syncthreads();
+ // capture last elem in block sums then zero it out zero out last element of temp array
+ if (index == 0) {
+ blockSums[blockIdx.x] = temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)];
+ temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0;
+ }
+ //downsweep
+ for (int d = 1; d < n; d *= 2) {
+ offset >>= 1;
+ __syncthreads();
+ if (index < d) {
+ int leftChild = offset * (2 * index + 1) - 1;
+ int parent = offset * (2 * index + 2) - 1;
+ leftChild += CONFLICT_FREE_OFFSET(leftChild);
+ parent += CONFLICT_FREE_OFFSET(parent);
+ int saved = temp[leftChild];
+ temp[leftChild] = temp[parent];
+ temp[parent] += saved;
+ }
+ }
+ __syncthreads();
+ odata[dataToLoadA + blockStartIndex] = temp[dataToLoadA + bankOffsetA];
+ odata[dataToLoadB + blockStartIndex] = temp[dataToLoadB + bankOffsetB];
+
+ }
+
+ __global__ void kernAddBlockSums(int n, int* odata, int* blockSums) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+ int blockSumIndex = index / 2048;
+ odata[index] += blockSums[blockSumIndex];
}
+ void sharedMemScan(int n, int* odata, const int* idata) {
+ // max allowed in shared memory of one block
+ if (n > 1 << 22) {
+ timer().startGpuTimer();
+ timer().endGpuTimer();
+ return;
+ }
+ int iters = ilog2ceil(n);
+ int nearestPow2 = 1 << iters;
+
+ // each individual block can handle 2048 data points
+ const int SPLIT = 2048;
+
+ // set up device arrays to the nearest power of 2
+ cudaMalloc((void**)&dev_indices, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaMalloc((void**)&dev_odata, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+ cudaMalloc((void**)&dev_blockSums, (((nearestPow2 + SPLIT - 1) / SPLIT)) * sizeof(int));
+ checkCUDAError("cudaMalloc dev_blockSums failed!");
+ cudaMalloc((void**)&dev_scanned, (((nearestPow2 + SPLIT - 1) / SPLIT)) * sizeof(int));
+ checkCUDAError("cudaMalloc dev_scanned failed!");
+
+ cudaMemset(dev_indices, 0, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMemset dev_indices failed!");
+ cudaMemcpy(dev_indices, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+
+ timer().startGpuTimer();
+
+ const int maxThreadsPerBlock = 1024;
+ int scannedSoFar = 0;
+ int blocksNeeded = (nearestPow2 + 2048 - 1) / 2048;
+
+ if (blocksNeeded == 1) {
+ // only need one block
+ if (MEMORY_BANK_OPTIMIIZED) {
+ kernSharedMemBankOptimizedScan << <1, nearestPow2 / 2, (nearestPow2 + CONFLICT_FREE_OFFSET(nearestPow2))* sizeof(int) >> > (nearestPow2, dev_odata, dev_indices, dev_blockSums);
+ }
+ else {
+ kernSharedMemScan << <1, nearestPow2 / 2, nearestPow2 * sizeof(int) >> > (nearestPow2, dev_odata, dev_indices, dev_blockSums);
+ }
+ }
+ else {
+ // need multiple blocks and to scan block sums
+
+ if (MEMORY_BANK_OPTIMIIZED) {
+ kernSharedMemBankOptimizedScan<<>> (2048, dev_odata, dev_indices, dev_blockSums);
+ kernSharedMemBankOptimizedScan<<<1, (blocksNeeded + 1) / 2, (blocksNeeded + CONFLICT_FREE_OFFSET(blocksNeeded)) * sizeof(int) >>> (blocksNeeded, dev_scanned, dev_blockSums, dev_blockSums);
+
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ kernAddBlockSums<<> > (n, dev_odata, dev_scanned);
+ }
+ else {
+ kernSharedMemScan<<>> (2048, dev_odata, dev_indices, dev_blockSums);
+ kernSharedMemScan<<< 1, (blocksNeeded + 1) / 2, blocksNeeded * sizeof(int) >>> (blocksNeeded, dev_scanned, dev_blockSums, dev_blockSums);
+
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ kernAddBlockSums<<>> (n, dev_odata, dev_scanned);
+ }
+
+ }
+
+ timer().endGpuTimer();
+
+ // copy data back over
+ cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy to host failed!");
+
+ // free
+ cudaFree(dev_indices);
+ cudaFree(dev_odata);
+ cudaFree(dev_blockSums);
+ }
+
+
+
/**
* Performs stream compaction on idata, storing the result into odata.
* All zeroes are discarded.
@@ -31,10 +313,85 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
// TODO
+
+ int iters = ilog2ceil(n);
+ int nearestPow2 = 1 << iters;
+ // set up device arrays to the nearest power of 2
+ cudaMalloc((void**)&dev_bools, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_bools failed!");
+ cudaMalloc((void**)&dev_idata, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_idata failed!");
+ cudaMalloc((void**)&dev_indices, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaMalloc((void**)&dev_odata, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+
+ cudaMemset(dev_indices, 0, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMemset dev_indices failed!");
+ cudaMemset(dev_bools, 0, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMemset dev_bools failed!");
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+
+ timer().startGpuTimer();
+
+ // map
+ dim3 fullBlocksPerGrid((nearestPow2 + blockSize - 1) / blockSize);
+ Common::kernMapToBoolean<<<(n + blockSize - 1) / blockSize, threadsPerBlock>>> (n, dev_bools, dev_idata);
+
+ // scan
+
+ // Copy the bools to the indices array for scan
+ cudaMemcpy(dev_indices, dev_bools, nearestPow2 * sizeof(int), cudaMemcpyDeviceToDevice);
+ checkCUDAError("cudaMemcpy dev_bools to dev_indices failed!");
+
+ // upsweep
+ for (int d = 0; d < iters; d++) {
+ // calculate power of 2 offset with bitshift
+ int currOffset = 1 << d;
+ // Only call the number of threads that actually need to write no values in the current sweep level
+ dim3 fullBlocksPerGrid((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernUpSweep<<>> (nearestPow2, currOffset, dev_indices);
+ }
+
+ // Set last value after upsweep to 0
+ cudaMemset(dev_indices + nearestPow2 - 1, 0, sizeof(int));
+
+ //downsweep
+ for (int d = iters - 1; d >= 0; d--) {
+ // calculate power of 2 offset with bitshift
+ int currOffset = 1 << d;
+ // Only call the number of threads that actually need to write no values in the current sweep level
+ dim3 fullBlocksPerGrid((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernDownSweep<<>> (nearestPow2, currOffset, dev_indices);
+ }
+
+ // scatter
+ Common::kernScatter<<<(n + blockSize - 1) / blockSize, threadsPerBlock >>> (n, dev_odata, dev_idata, dev_bools, dev_indices);
+
timer().endGpuTimer();
- return -1;
+
+ // figure out num elements
+ int lastIndex;
+ int lastBool;
+ cudaMemcpy(&lastIndex, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ cudaMemcpy(&lastBool, dev_bools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ int size = lastIndex + lastBool;
+ cudaMemcpy(odata, dev_odata, size * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy to host failed!");
+
+ // Free device arrays
+ cudaFree(dev_idata);
+ checkCUDAError("cudaFree dev_idata failed!");
+ cudaFree(dev_bools);
+ checkCUDAError("cudaFree dev_bools failed!");
+ cudaFree(dev_indices);
+ checkCUDAError("cudaFree dev_indices failed!");
+ cudaFree(dev_odata);
+ checkCUDAError("cudaFree dev_odata failed!");
+
+ return size;
}
}
}
diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h
index 803cb4fe..a9ba0032 100644
--- a/stream_compaction/efficient.h
+++ b/stream_compaction/efficient.h
@@ -8,6 +8,12 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata);
+ void sharedMemScan(int n, int* odata, const int* idata);
+
int compact(int n, int *odata, const int *idata);
+
+ void setBlockSize(int newBlockSize);
+
+ void setMemoryBankOptimized(bool memBankOptimized);
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 43088769..49383d8d 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -6,20 +6,83 @@
namespace StreamCompaction {
namespace Naive {
using StreamCompaction::Common::PerformanceTimer;
+
+ // Block variables
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+
+ // Data buffers to swap between each itertion
+ int* dev_dataBuf1;
+ int* dev_dataBuf2;
+
+ void setBlockSize(int newBlockSize) {
+ blockSize = newBlockSize;
+ threadsPerBlock = dim3(blockSize);
+ }
+
PerformanceTimer& timer()
{
static PerformanceTimer timer;
return timer;
}
// TODO: __global__
+
+ __global__ void kernNaiveScan(int n, int currOffset, int *odata, const int *idata) {
+ int index = threadIdx.x + (blockIdx.x * blockDim.x);
+ if (index >= n) {
+ return;
+ }
+ if (index >= currOffset) {
+ odata[index] = idata[index - currOffset] + idata[index];
+ }
+ else {
+ odata[index] = idata[index];
+ }
+ }
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
// TODO
+ int iters = ceil(log2(n));
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+ // set up device arrays
+ cudaMalloc((void**)&dev_dataBuf1, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_dataBuf1 failed!");
+ cudaMalloc((void**)&dev_dataBuf2, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_dataBuf2 failed!");
+ cudaMemcpy(dev_dataBuf1, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+
+ timer().startGpuTimer();
+
+ for (int d = 1; d <= iters; d++) {
+ // calculate power of 2 offset with bitshift
+ int currOffset = 1 << d - 1;
+ kernNaiveScan<<>> (n, currOffset, dev_dataBuf2, dev_dataBuf1);
+ int* temp = dev_dataBuf1;
+ dev_dataBuf1 = dev_dataBuf2;
+ dev_dataBuf2 = temp;
+ }
+
timer().endGpuTimer();
+
+ // Copy to host
+ if (n > 0) {
+ odata[0] = 0;
+ }
+ // dev_dataBuf1 is inclusive scan, so shift to make exclusive
+ cudaMemcpy(odata + 1, dev_dataBuf1, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy to host failed!");
+
+ // Free device arrays
+ cudaFree(dev_dataBuf1);
+ checkCUDAError("cudaFree dev_dataBuf1 failed!");
+ cudaFree(dev_dataBuf2);
+ checkCUDAError("cudaFree dev_dataBuf2 failed!");
+
}
}
}
diff --git a/stream_compaction/naive.h b/stream_compaction/naive.h
index 37dcb064..7e2e2e8a 100644
--- a/stream_compaction/naive.h
+++ b/stream_compaction/naive.h
@@ -7,5 +7,7 @@ namespace StreamCompaction {
StreamCompaction::Common::PerformanceTimer& timer();
void scan(int n, int *odata, const int *idata);
+
+ void setBlockSize(int newBlockSize);
}
}
diff --git a/stream_compaction/radix.cu b/stream_compaction/radix.cu
new file mode 100644
index 00000000..e0b7a713
--- /dev/null
+++ b/stream_compaction/radix.cu
@@ -0,0 +1,198 @@
+#include
+#include "cpu.h"
+
+#include "common.h"
+
+namespace StreamCompaction {
+ namespace Radix {
+ using StreamCompaction::Common::PerformanceTimer;
+ PerformanceTimer& timer()
+ {
+ static PerformanceTimer timer;
+ return timer;
+ }
+
+ // Block variables
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+
+ // Data buffers
+ int* dev_idata;
+ int* dev_bools;
+ int* dev_negbools;
+ int* dev_scanned;
+ int* dev_writingIndices;
+ int* dev_indices;
+ int* dev_scatterBools;
+ int* dev_odata;
+
+ void setBlockSize(int newBlockSize) {
+ blockSize = newBlockSize;
+ threadsPerBlock = dim3(blockSize);
+ }
+
+ __global__ void kernUpSweep(int nearestPow2, int currOffset, int* data) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ int parentIdx = (index + 1) * currOffset * 2 - 1;
+ int leftChildIdx = parentIdx - currOffset;
+ if (parentIdx >= nearestPow2) {
+ return;
+ }
+ data[parentIdx] += data[leftChildIdx];
+ }
+
+ __global__ void kernDownSweep(int nearestPow2, int currOffset, int* data) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ int parentIdx = (index + 1) * currOffset * 2 - 1;
+ int leftChildIdx = parentIdx - currOffset;
+ if (parentIdx >= nearestPow2) {
+ return;
+ }
+ int temp = data[leftChildIdx];
+ data[leftChildIdx] = data[parentIdx];
+ data[parentIdx] += temp;
+ }
+
+ __global__ void kernMapToBits(int n, int bit, const int* idata, int* bools) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+ // isolate the desired bit
+ bools[index] = (idata[index] >> bit) & 1;
+ }
+
+ __global__ void kernNegateArray(int n, const int* bools, int* negBools) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+ negBools[index] = 1 - bools[index];
+ }
+
+ __global__ void kernGetWritingIndices(int n, const int* scanned, int* writingIndices, int totalFalses) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+ writingIndices[index] = index - scanned[index] + totalFalses;
+ }
+
+ __global__ void kernGetScatterIndices(int n, const int* bools, const int* writingIndices, const int* scanned, int* indices) {
+ int index = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (index >= n) {
+ return;
+ }
+ indices[index] = bools[index] ? writingIndices[index] : scanned[index];
+ }
+
+
+ void radix(int n, int* odata, const int* idata) {
+ int iters = ilog2ceil(n);
+ int nearestPow2 = 1 << ilog2ceil(n);
+ // set up device arrays
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_idata failed!");
+ cudaMalloc((void**)&dev_bools, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_bools failed!");
+ cudaMalloc((void**)&dev_negbools, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_negbools failed!");
+ cudaMalloc((void**)&dev_scanned, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMalloc dev_scanned failed!");
+ cudaMalloc((void**)&dev_writingIndices, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_writingIndices failed!");
+ cudaMalloc((void**)&dev_indices, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+ cudaMalloc((void**)&dev_scatterBools, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_scatterBools failed!");
+
+ cudaMemset(dev_scanned, 0, nearestPow2 * sizeof(int));
+ checkCUDAError("cudaMemset dev_scanned failed!");
+
+ cudaMemset(dev_scatterBools, 1, n * sizeof(int));
+ checkCUDAError("cudaMemset dev_scatterBools failed!");
+
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+
+ timer().startGpuTimer();
+
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+
+ // Cap at values at integer values up to million, can increase up to 32 for larger integers
+ const int numBits = 20;
+ for (int i = 0; i < numBits; i++) {
+ // split based on the ith least significant bit
+ // isolate current bit
+ kernMapToBits<<>>(n, i, dev_idata, dev_bools);
+ checkCUDAError("kernMapToBits failed!");
+
+ // Negate the bit values as bools
+ kernNegateArray<<>>(n, dev_bools, dev_negbools);
+ checkCUDAError("kernNegateArray failed!");
+
+ // Scan negated bools
+ cudaMemcpy(dev_scanned, dev_negbools, n * sizeof(int), cudaMemcpyDeviceToDevice);
+ checkCUDAError("cudaMemcpy to device failed!");
+ // upsweep
+ for (int d = 0; d < iters; d++) {
+ int currOffset = 1 << d;
+ dim3 fullBlocksPerGridScan((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernUpSweep<<>> (nearestPow2, currOffset, dev_scanned);
+ }
+
+ // Set last value after upsweep to 0
+ cudaMemset(dev_scanned + nearestPow2 - 1, 0, sizeof(int));
+
+ //downsweep
+ for (int d = iters - 1; d >= 0; d--) {
+ int currOffset = 1 << d;
+ dim3 fullBlocksPerGridScan((nearestPow2 / (currOffset * 2) + blockSize - 1) / blockSize);
+ kernDownSweep<<>> (nearestPow2, currOffset, dev_scanned);
+ }
+
+ // Compute total number of falses
+ int totalFalses;
+ cudaMemcpy(&totalFalses, dev_scanned + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ int lastNegBool;
+ cudaMemcpy(&lastNegBool, dev_negbools + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ totalFalses += lastNegBool;
+
+ // Compute writing indices
+ kernGetWritingIndices<<>>(n, dev_scanned, dev_writingIndices, totalFalses);
+ checkCUDAError("kernGetWritingIndices failed!");
+
+ // Get scatter indices
+ kernGetScatterIndices<<>>(n, dev_bools, dev_writingIndices, dev_scanned, dev_indices);
+ checkCUDAError("kernGetScatterIndices failed!");
+
+ // Scatter
+ Common::kernScatter <<>>(n, dev_odata, dev_idata, dev_scatterBools, dev_indices);
+ checkCUDAError("kernScatter failed!");
+
+ // Swap idata and odata
+ int* temp = dev_idata;
+ dev_idata = dev_odata;
+ dev_odata = temp;
+ }
+
+ timer().endGpuTimer();
+
+ // Copy back to host
+ cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy to host failed!");
+
+ // Free all device arrays
+ cudaFree(dev_idata);
+ cudaFree(dev_bools);
+ cudaFree(dev_negbools);
+ cudaFree(dev_scanned);
+ cudaFree(dev_writingIndices);
+ cudaFree(dev_indices);
+ cudaFree(dev_odata);
+ cudaFree(dev_scatterBools);
+ }
+ }
+}
diff --git a/stream_compaction/radix.h b/stream_compaction/radix.h
new file mode 100644
index 00000000..025bf921
--- /dev/null
+++ b/stream_compaction/radix.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include "common.h"
+
+namespace StreamCompaction {
+ namespace Radix {
+ StreamCompaction::Common::PerformanceTimer& timer();
+
+ void radix(int n, int *odata, const int *idata);
+
+ void setBlockSize(int newBlockSize);
+ }
+}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e7..224ae2ab 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,18 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ thrust::host_vector host_idata(idata, idata + n);
+ thrust::device_vector dev_idata = host_idata;
+ thrust::host_vector host_odata(idata, idata + n);
+ thrust::device_vector dev_odata = host_idata;
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+ thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin());
+
timer().endGpuTimer();
+ thrust::copy(dev_odata.begin(), dev_odata.end(), odata);
}
}
}