diff --git a/README.md b/README.md
index 0e38ddb..50dc804 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,118 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Zhenzhong Tang
+ * [LinkedIn](https://www.linkedin.com/in/zhenzhong-anthony-tang-82334a210), [Instagram](https://instagram.com/toytag12), [personal website](https://toytag.net/)
+* Tested on: Windows 11 Pro 22H2, AMD EPYC 7V12 64-Core Processor (4 vCPU cores) @ 2.44GHz 28GiB, Tesla T4 16GiB (Azure)
-### (TODO: Your README)
+## Implementations
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+We have five different implementations to compare:
+- CPU Scan: iterates through the array and add the previous element to the current element.
+- GPU Naive Scan: each thread add its element to an element at some offset.
+- GPU Work Efficient Scan from Lecture: up-sweep and down-sweep.
+- GPU Work Efficient Scan from GPU Gems3
+ - Rearranged threads to prevent **warp partitioning**.
+ - Implemented **shared memory** acceleration.
+ - Optimized shared memory access to avoid **bank conflicts**.
+ - Recursive call to handle arbitrary array size without rounding to the next power of 2.
+- Thrust Scan: optimized implementation from NVIDIA.
+### Sample Output
+Tested with `int[2^29]` and CUDA Block Size 128.
+
+```
+****************
+** SCAN TESTS **
+****************
+ [ 39 29 12 48 27 43 42 11 9 8 5 1 5 ... 11 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 1370.71ms (std::chrono Measured)
+ [ 0 39 68 80 128 155 198 240 251 260 268 273 274 ... 263761477 263761488 ]
+==== cpu scan, non-power-of-two ====
+ elapsed time: 1369.71ms (std::chrono Measured)
+ [ 0 39 68 80 128 155 198 240 251 260 268 273 274 ... 263761350 263761388 ]
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 508.908ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 506.161ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 48.7929ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 48.7834ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 17.1458ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 16.7649ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 3 0 0 0 1 3 1 3 3 1 3 0 1 ... 0 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 1345.11ms (std::chrono Measured)
+ [ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 2 2 ]
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 1300.07ms (std::chrono Measured)
+ [ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 1 2 ]
+ passed
+==== cpu compact with scan ====
+ elapsed time: 2031.72ms (std::chrono Measured)
+ [ 3 1 3 1 3 3 1 3 1 3 2 3 3 ... 2 2 ]
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 183.325ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 180.469ms (CUDA Measured)
+ passed
+```
+
+
+## Performance Analysis
+Tests done with CUDA block size as 128 unless otherwise specified. And we choose to test array size with non-power-of-2 to represent more general use cases.
+
+### Array Size
+
+First, not shown in the graph, with less than or around a few thousands of elements, the CPU sequential scan is actually faster than the GPU implementations. This is because the overhead of copying data to GPU and back is too large compared to the actual computation time. The GPU implementations are all faster than the CPU sequential scan when the array size is large enough.
+
+
+
+Now let's look at the graph. With over 1 million elements, the CPU sequential scan starts to show inefficiency. Naive GPU implementation is actually a bit faster than the work efficient version from lecture. Based on GPU Gems3 Scan, the optimized work efficient scan is much faster than any previous methods. Thrust scan is the fastest and optimized implementation from NVIDIA.
+
+
+
+In conclusion, using thrust implementation as a baseline, the work efficient scan from lecture is around 10x to 20x slower. The optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, which is a big leap.
+
+### Block Size
+
+The influence of block size on performance is generally associated with hardware. It usually does not have a significant effect on the outcome.
+
+In our case, especially with work efficient scan based on gems3, the block size has a relatively large impact on performance. The optimal point is around 128, and the runtime is half as block size 16 and two thirds as block size 1024. This is because, the work efficient scan based on gems3 runs a full up-sweep and down-sweep scan in one kernel call. If block size is too small, then the kernel would be called recursively many times to complete the task. If block size is too large, then in the middle of the scan, a number of threads would be idle, waiting for other threads to finish. This is a waste of resources.
+
+
+
+All in All, the three GPU scan implementations follow the same trend, showing that the block size around 128 is the optimal point.
+
+### Why is My GPU Approach So Slow?
+
+This is true for relatively small size array and implementations from lecture slides. Copying to and from GPU, synchronizing threads, and other overheads are too large compared to the actual computation time. With a few thousand elements, the CPU could even store most of them in cache and run the computation even faster.
+
+However, when the array size is significantly bigger, to beyond 1 million, CPU starts to fall short. All GPU implementations are faster than CPU sequential scan. Thrust absolutely dominates the game. Our optimized work efficient scan from gems3 is only around 1.5x to 3x slower than thrust, not bad.
+
+So, fortunately, we did not encounter such issue. :D
+
+### Trace of the Thrust Scan
+
+
+
+
+We see the that thrust kernel call is very short. Our kernel is much slower. Thrust used static shared memory and more shared memory in total, and used more registers per thread. All these factors contribute to the performance difference.
diff --git a/img/Runtime Comparison for CPU and GPU Scan Bar.svg b/img/Runtime Comparison for CPU and GPU Scan Bar.svg
new file mode 100644
index 0000000..8a4a37c
--- /dev/null
+++ b/img/Runtime Comparison for CPU and GPU Scan Bar.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/Runtime Comparison for CPU and GPU Scan.svg b/img/Runtime Comparison for CPU and GPU Scan.svg
new file mode 100644
index 0000000..9caed2f
--- /dev/null
+++ b/img/Runtime Comparison for CPU and GPU Scan.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/Runtime Comparison for GPU Scan with different CUDA Block Size.svg b/img/Runtime Comparison for GPU Scan with different CUDA Block Size.svg
new file mode 100644
index 0000000..39e3ab8
--- /dev/null
+++ b/img/Runtime Comparison for GPU Scan with different CUDA Block Size.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/trace-gems3.png b/img/trace-gems3.png
new file mode 100644
index 0000000..6246a82
Binary files /dev/null and b/img/trace-gems3.png differ
diff --git a/img/trace-thrust.png b/img/trace-thrust.png
new file mode 100644
index 0000000..bd4bc2e
Binary files /dev/null and b/img/trace-thrust.png differ
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..c059448 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,7 +13,7 @@
#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 28; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
@@ -71,7 +71,6 @@ int main(int argc, char* argv[]) {
printDesc("work-efficient scan, power-of-two");
StreamCompaction::Efficient::scan(SIZE, c, a);
printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");
- //printArray(SIZE, c, true);
printCmpResult(SIZE, b, c);
zeroArray(SIZE, c);
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..698feb9 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -24,6 +24,10 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
+ int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (idx >= n) return;
+
+ bools[idx] = idata[idx] == 0 ? 0 : 1;
}
/**
@@ -33,6 +37,12 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
+ int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (idx >= n) return;
+
+ if (bools[idx] == 1) {
+ odata[indices[idx]] = idata[idx];
+ }
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..c5f4c13 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -5,6 +5,7 @@
namespace StreamCompaction {
namespace CPU {
+ bool disableScanTimer = false;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
@@ -18,9 +19,13 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startCpuTimer();
+ if (!disableScanTimer) timer().startCpuTimer();
// TODO
- timer().endCpuTimer();
+ odata[0] = 0;
+ for (int i = 1; i < n; i++) {
+ odata[i] = odata[i - 1] + idata[i - 1];
+ }
+ if (!disableScanTimer) timer().endCpuTimer();
}
/**
@@ -29,10 +34,16 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
+ int count = 0;
timer().startCpuTimer();
// TODO
+ for (int i = 0; i < n; i++) {
+ if (idata[i] != 0) {
+ odata[count++] = idata[i];
+ }
+ }
timer().endCpuTimer();
- return -1;
+ return count;
}
/**
@@ -41,10 +52,25 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
+ int* tmp = new int[n];
+ disableScanTimer = true;
timer().startCpuTimer();
// TODO
+ for (int i = 0; i < n; i++) {
+ tmp[i] = idata[i] == 0 ? 0 : 1;
+ }
+ scan(n, odata, tmp);
+ for (int i = 0; i < n; i++) {
+ if (tmp[i] != 0) {
+ // odata[i] <= i, so there is no race condition
+ odata[odata[i]] = idata[i];
+ }
+ }
timer().endCpuTimer();
- return -1;
+ disableScanTimer = false;
+ int count = odata[n - 1] + tmp[n - 1];
+ delete[] tmp;
+ return count;
}
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..d68003d 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -3,8 +3,19 @@
#include "common.h"
#include "efficient.h"
+/** default naive implementation from lecture (disable all defs)
+ * also three improved implementations of work-efficient scan
+ * 1. GEMS3 base
+ * 2. GEMS3 full impl
+ * 3. GEMS3 full impl with recursive scan (experimental)
+ */
+#define GEMS3 // tested, 5 ~ 10 times slower than thrust
+#define FULL // tested, 1.5 ~ 3 times slower than thrust
+//#define RECUR // works but not fully tested, this should work well for extremely large array, like size >= 2^32
+
namespace StreamCompaction {
namespace Efficient {
+ bool disableScanTimer = false;
using StreamCompaction::Common::PerformanceTimer;
PerformanceTimer& timer()
{
@@ -12,14 +23,263 @@ namespace StreamCompaction {
return timer;
}
+#ifdef GEMS3 // ref: https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
+
+#define BLOCK_SIZE 128
+#define THREADS_PER_BLOCK BLOCK_SIZE
+#define NUM_BLOCKS(n) (((n) + BLOCK_SIZE - 1) / BLOCK_SIZE)
+
+ __global__ void kernGEMS3WarpReorgWorkEfficientScanUpSweep(int n, int d, int offset, int* data) {
+ int thid = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (thid >= n || thid >= d) return;
+
+ int thidx2 = thid << 1;
+ int ai = offset * (thidx2 + 1) - 1;
+ int bi = ai + offset;
+
+ data[bi] += data[ai];
+ }
+
+ __global__ void kernGEMS3WarpReorgWorkEfficientScanDownSweep(int n, int d, int offset, int* data) {
+ int thid = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (thid >= n || thid >= d) return;
+
+ int thidx2 = thid << 1;
+ int ai = offset * (thidx2 + 1) - 1;
+ int bi = ai + offset;
+
+ int t = data[bi];
+ data[bi] += data[ai];
+ data[ai] = t;
+ }
+
+ void warpReorgScan(int n, int* dev_data) {
+ // upsweep
+ int offset = 1;
+ for (int d = n >> 1; d > 0; d >>= 1) {
+ kernGEMS3WarpReorgWorkEfficientScanUpSweep<<>>(n, d, offset, dev_data);
+ offset <<= 1;
+ }
+ cudaMemset(dev_data + n - 1, 0, sizeof(int));
+ // downsweep
+ for (int d = 1; d < n; d <<= 1) {
+ offset >>= 1;
+ kernGEMS3WarpReorgWorkEfficientScanDownSweep<<>>(n, d, offset, dev_data);
+ }
+ }
+
+#ifndef FULL
+ /**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+ void scan(int n, int* odata, const int* idata) {
+ int nextPow2_n = 1 << ilog2ceil(n);
+ int* dev_data;
+ cudaMalloc((void**)&dev_data, nextPow2_n * sizeof(int));
+ cudaMemset(dev_data, 0, nextPow2_n * sizeof(int));
+ cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc, cudaMemset, cudaMemcpy dev_data failed!");
+ cudaDeviceSynchronize();
+ if (!disableScanTimer) timer().startGpuTimer();
+ // ----------------------------------
+ // TODO
+ warpReorgScan(nextPow2_n, dev_data);
+ // ----------------------------------
+ if (!disableScanTimer) timer().endGpuTimer();
+ // dev_data now contains an exclusive scan
+ cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_data failed!");
+ cudaFree(dev_data);
+ }
+#endif // !FULL
+#ifdef FULL
+
+#define ELEMENTS_PER_TILE (2 * BLOCK_SIZE)
+#define SHARED_MEM_SIZE (ELEMENTS_PER_TILE * sizeof(int))
+#define NUM_BANKS 32
+#define LOG_NUM_BANKS 5
+#define CONFLICT_FREE_OFFSET(n) ((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS))
+
+ __global__ void kernGEMS3FullWorkEfficientFixedSizeScan(int* odata, int* idata, int* incr) {
+ extern __shared__ int temp[];
+
+ int n = ELEMENTS_PER_TILE; // fixed size scan
+
+ int thid = threadIdx.x;
+ int thidx2 = thid << 1;
+ int blockOffset = n * blockIdx.x;
+
+ int ai = thid;
+ int bi = thid + (n >> 1);
+ int bankOffsetA = CONFLICT_FREE_OFFSET(ai);
+ int bankOffsetB = CONFLICT_FREE_OFFSET(bi);
+
+ temp[ai + bankOffsetA] = idata[blockOffset + ai];
+ temp[bi + bankOffsetB] = idata[blockOffset + bi];
+
+ int offset = 1;
+
+ for (int d = n >> 1; d > 0; d >>= 1) {
+ __syncthreads();
+ if (thid < d) {
+ int ai = offset * (thidx2 + 1) - 1;
+ int bi = ai + offset;
+ ai += CONFLICT_FREE_OFFSET(ai);
+ bi += CONFLICT_FREE_OFFSET(bi);
+
+ temp[bi] += temp[ai];
+ }
+ offset <<= 1;
+ }
+
+ __syncthreads();
+ if (thid == 0) {
+ incr[blockIdx.x] = temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)];
+ temp[n - 1 + CONFLICT_FREE_OFFSET(n - 1)] = 0;
+ }
+
+ for (int d = 1; d < n; d <<= 1) {
+ offset >>= 1;
+ __syncthreads();
+ if (thid < d) {
+ int ai = offset * (thidx2 + 1) - 1;
+ int bi = ai + offset;
+ ai += CONFLICT_FREE_OFFSET(ai);
+ bi += CONFLICT_FREE_OFFSET(bi);
+
+ int t = temp[ai];
+ temp[ai] = temp[bi];
+ temp[bi] += t;
+ }
+ }
+ __syncthreads();
+
+ odata[blockOffset + ai] = temp[ai + bankOffsetA];
+ odata[blockOffset + bi] = temp[bi + bankOffsetB];
+ }
+
+ __global__ void addIncr(int n, int* dev_data, int* dev_incr) {
+ int thid = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (thid >= n) return;
+ if (blockIdx.x > 0) dev_data[thid] += dev_incr[blockIdx.x >> 1];
+ }
+
+#ifdef RECUR // best for extremely large array, like size >= 2^32
+#define THRESHOULD 1 << 16
+ // recursive
+ void recurScan(int n, int* dev_odata, int* dev_idata, int* dev_incr) {
+ int numTiles = (n + ELEMENTS_PER_TILE - 1) / ELEMENTS_PER_TILE;
+
+ if (numTiles > THRESHOULD) {
+ kernGEMS3FullWorkEfficientFixedSizeScan<<>>(dev_odata, dev_idata, dev_incr);
+ recurScan(numTiles, dev_incr, dev_incr, dev_incr + numTiles);
+ addIncr<<>>(n, dev_odata, dev_incr);
+ } else {
+ // only once
+ cudaMemcpy(dev_odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToDevice);
+ checkCUDAError("cudaMemcpy dev_odata failed!");
+ int nextPow2_n = 1 << ilog2ceil(n);
+ warpReorgScan(nextPow2_n, dev_odata);
+ }
+ }
+#endif // RECUR
+
+ /**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+ void scan(int n, int* odata, const int* idata) {
+ int numTiles = (n + ELEMENTS_PER_TILE - 1) / ELEMENTS_PER_TILE;
+ int nPad = numTiles * ELEMENTS_PER_TILE;
+ int* dev_data;
+ cudaMalloc((void**)&dev_data, nPad * sizeof(int));
+ cudaMemset(dev_data, 0, nPad * sizeof(int));
+ cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc, cudaMemset, cudaMemcpy dev_data failed!");
+ int* dev_incr;
+ // geometric series guarantees 2 * numTiles * sizeof(int) is enough
+ cudaMalloc((void**)&dev_incr, 2 * numTiles * sizeof(int));
+ checkCUDAError("cudaMalloc dev_incr failed!");
+ cudaDeviceSynchronize();
+ if (!disableScanTimer) timer().startGpuTimer();
+ // ----------------------------------
+ // TODO
+#ifndef RECUR
+ kernGEMS3FullWorkEfficientFixedSizeScan<<>>(dev_data, dev_data, dev_incr);
+ warpReorgScan(numTiles, dev_incr);
+ addIncr<<>>(n, dev_data, dev_incr);
+#else
+ recurScan(n, dev_data, dev_data, dev_incr);
+#endif // !RECUR
+ // ----------------------------------
+ if (!disableScanTimer) timer().endGpuTimer();
+ // dev_data now contains an exclusive scan
+ cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_data failed!");
+ cudaFree(dev_data);
+ cudaFree(dev_incr);
+ }
+#endif // FULL
+#else // naive from lecture
+ __global__ void kernNaiveWorkEfficientScanUpSweep(int n, int d, int* data) {
+ int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (k >= n) return;
+
+ int pow2_d = 1 << d;
+ int pow2_dp1 = pow2_d << 1;
+
+ if ((k & (pow2_dp1 - 1)) == 0) {
+ data[k + pow2_dp1 - 1] += data[k + pow2_d - 1];
+ }
+ }
+
+ __global__ void kernNaiveWorkEfficientScanDownSweep(int n, int d, int* data) {
+ int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (k >= n) return;
+
+ int pow2_d = 1 << d;
+ int pow2_dp1 = pow2_d << 1;
+
+ if ((k & (pow2_dp1 - 1)) == 0) {
+ int t = data[k + pow2_d - 1];
+ data[k + pow2_d - 1] = data[k + pow2_dp1 - 1];
+ data[k + pow2_dp1 - 1] += t;
+ }
+ }
+
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
+ int nextPow2_n = 1 << ilog2ceil(n);
+ int* dev_data;
+ cudaMalloc((void**)&dev_data, nextPow2_n * sizeof(int));
+ cudaMemset(dev_data, 0, nextPow2_n * sizeof(int));
+ cudaMemcpy(dev_data, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc, cudaMemset, cudaMemcpy dev_data failed!");
+ cudaDeviceSynchronize();
+ if (!disableScanTimer) timer().startGpuTimer();
+ // ----------------------------------
// TODO
- timer().endGpuTimer();
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+ dim3 fullBlocksPerGrid((nextPow2_n + blockSize - 1) / blockSize);
+ // upsweep
+ for (int d = 0; d < ilog2ceil(n); d++) {
+ kernNaiveWorkEfficientScanUpSweep<<>>(nextPow2_n, d, dev_data);
+ }
+ cudaMemset(dev_data + nextPow2_n - 1, 0, sizeof(int));
+ // downsweep
+ for (int d = ilog2ceil(n) - 1; d >= 0; d--) {
+ kernNaiveWorkEfficientScanDownSweep<<>>(nextPow2_n, d, dev_data);
+ }
+ // ----------------------------------
+ if (!disableScanTimer) timer().endGpuTimer();
+ // dev_data now contains an exclusive scan
+ cudaMemcpy(odata, dev_data, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_data failed!");
+ cudaFree(dev_data);
}
+#endif // GEMS3
/**
* Performs stream compaction on idata, storing the result into odata.
@@ -31,10 +291,43 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
+ int* dev_idata;
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc, cudaMemcpy dev_idata failed!");
+ int* dev_odata;
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+ int* dev_bools;
+ cudaMalloc((void**)&dev_bools, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_bools failed!");
+ int* dev_indices;
+ cudaMalloc((void**)&dev_indices, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_indices failed!");
+ cudaDeviceSynchronize();
+ disableScanTimer = true;
timer().startGpuTimer();
+ // ----------------------------------
// TODO
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ StreamCompaction::Common::kernMapToBoolean<<>>(n, dev_bools, dev_idata);
+ scan(n, dev_indices, dev_bools);
+ StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_bools, dev_indices);
+ // ----------------------------------
timer().endGpuTimer();
- return -1;
+ disableScanTimer = false;
+ int count;
+ cudaMemcpy(&count, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ count += idata[n - 1] == 0 ? 0 : 1;
+ cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy failed!");
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ cudaFree(dev_bools);
+ cudaFree(dev_indices);
+ return count;
}
}
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..724a3aa 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -12,14 +12,46 @@ namespace StreamCompaction {
return timer;
}
// TODO: __global__
+ __global__ void kernNaiveScan(int n, int d, int *odata, const int *idata) {
+ int k = (blockIdx.x * blockDim.x) + threadIdx.x;
+ if (k >= n) return;
+
+ int pow2_dm1 = 1 << (d - 1);
+ if (k >= pow2_dm1) {
+ odata[k] = idata[k - pow2_dm1] + idata[k];
+ } else {
+ odata[k] = idata[k];
+ }
+ }
/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ int* dev_odata;
+ cudaMalloc((void**)&dev_odata, n * sizeof(int));
+ checkCUDAError("cudaMalloc dev_odata failed!");
+ int* dev_idata;
+ cudaMalloc((void**)&dev_idata, n * sizeof(int));
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("cudaMalloc and cudaMemcpy dev_idata failed!");
+ cudaDeviceSynchronize();
timer().startGpuTimer();
// TODO
+ int blockSize = 128;
+ dim3 threadsPerBlock(blockSize);
+ dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
+ for (int d = 1; d <= ilog2ceil(n); d++) {
+ kernNaiveScan<<>>(n, d, dev_odata, dev_idata);
+ std::swap(dev_odata, dev_idata);
+ }
timer().endGpuTimer();
+ // dev_idata (not dev_odata) now contains an inclusive scan
+ cudaMemcpy(odata + 1, dev_idata, (n - 1) * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("cudaMemcpy dev_odata failed!");
+ odata[0] = 0;
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
}
}
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..5b8894a 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -18,11 +18,16 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ thrust::device_vector dev_idata(idata, idata + n);
+ thrust::device_vector dev_odata(odata, odata + n);
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+ thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin());
+ // thrust::exclusive_scan(idata, idata + n, odata);
timer().endGpuTimer();
+ thrust::copy(dev_odata.begin(), dev_odata.end(), odata);
}
}
}