diff --git a/README.md b/README.md
index 0e38ddb..2984858 100644
--- a/README.md
+++ b/README.md
@@ -3,12 +3,152 @@ CUDA Stream Compaction
**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
-* (TODO) YOUR NAME HERE
- * (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Yuanqi Wang
+ * [LinkedIn](https://www.linkedin.com/in/yuanqi-wang-414b26106/), [GitHub](https://github.com/plasmas).
+* Tested on: Windows 11, i5-11600K @ 3.91GHz 32GB, RTX 2060 6GB (Personal Desktop)
-### (TODO: Your README)
+## Description
-Include analysis, etc. (Remember, this is public, so don't put
-anything here that you don't want to share with the world.)
+This project contains the following implementations:
+* CPU implementation of scan and string compaction.
+* Naive GPU implementation of scan.
+* Work-Efficient GPU implementation of scan and string compaction.
+* EC(Part 5) Thread number optimization on work-efficient GPU implementation
+
+The scan and string compaction algorithm implemented are based on [GPU Gems 3 Chap.39 Parallel Prefix Sum with CUDA](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda).
+
+## Analysis
+
+### 1. Block Size Optimization
+
+To determine the optimum block size for both naive and efficient GPU scan, I performed a exponential scan on block size and measure the time cost at different block size.
+
+
+
+| block size | 1 | 2 | 4 | 8 | 16 | 32 | 64 | 128 | 256 | 512 | 1024 |
+| -------------- | ------- | ------- | ------- | ------- | ------- | ------ | ------- | ------- | ------- | ------- | ------- |
+| naive (ms) | 1077.99 | 608.779 | 357.073 | 197.336 | 110.639 | 58.122 | 30.6088 | 27.9214 | 27.9087 | 28.1457 | 29.4622 |
+| efficient (ms) | 109.593 | 64.3612 | 32.4774 | 20.6583 | 13.6459 | 13.546 | 13.7829 | 13.4237 | 13.3781 | 13.716 | 13.6309 |
+
+We can see that the time cost of both implementations drop as block sizes increase. The time cost of the naive implementation reaches a plateau at round 64, while the efficient implementation reaches a plateau at round 16. When block size is 256, both implementation reach their optimum. Therefore, we will choose 256 as the block size for all comparisons.
+
+An interesting phenomenon observed is that the performance of the efficient implementation stops improving when block size is 16, which is smaller than the maximum number of threads allowed in a wrap. Without further profiling, I might hit a point where memory coarsening is optimal for a block size of 16.
+
+### 2. Comparison of GPU & CPU Implementations of Scan
+
+With optimum block size, all implementations are tested against the length of the input array.
+
+
+
+| log(array length) | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 |
+| -------------------- | -------- | -------- | ------- | -------- | ------- | ------- | ------- | ------- | ------- | ------- |
+| CPU | 0.8013 | 1.5922 | 3.1945 | 6.4031 | 12.6507 | 25.3825 | 51.2388 | 98.7369 | 234.998 | 394.874 |
+| GPU (Naive) | 1.48966 | 4.04874 | 6.53651 | 14.7808 | 28.2829 | 56.4302 | 121.065 | 250.27 | 526.096 | 14634.9 |
+| GPU (Work-Efficient) | 1.03702 | 1.69731 | 3.36416 | 7.0199 | 13.6784 | 27.2429 | 52.4784 | 104.398 | 209.717 | 418.342 |
+| GPU (Thrust) | 0.485376 | 0.519936 | 0.72192 | 0.979072 | 1.81453 | 2.75165 | 4.78301 | 9.24336 | 16.9555 | 516.865 |
+
+We can see that on average, Thrust implementation is significantly faster than any other implementations. CPU and work-efficient implementations are about the same performance, while the naive GPU implementation has the worst performance.
+
+The fact that the naive GPU implementation is slower than the GPU implementation is anticipated, since there are much more algorithmic computation involved in the naive GPU implementation, while no shared memory is used in the naive implementation.
+
+The work-efficient GPU implementation, however, is much better than the naive implementation in a sense that it lowers the algorithmic complexity down to $O(n)$ and can finish in $O(\log n)$ given unlimited parallelism.
+
+### 3. Result & Bottleneck Analysis
+
+To identify the bottlenecks for GPU implementations, we use NVIDIA Nsight Compute to profile our tests and gathered metrics for each kernel call. We also used NVIDIA Nsight Systems to view the trace for all imeplemtations.
+
+* Nsight System Trace for Naive & Work-Efficient Implementation
+
+
+
+* Nsight System Trace for Thrust Implementation
+
+
+
+#### Naive GPU Implementation
+
+[Nsight Compute Profile Result](./img/naive.png)
+
+Judging from the profile analysis of a single kernel invocation. The naive GPU implementation suffers the following problems:
+1. High memory usage and low SM usage. The SM throughput is only 21.86% while the memory throughput is at 88.41%. This means the naive implementation is likely to be bounded by memory, and experiences insufficient compute resource utilization.
+2. Inefficient memory access. Memory workload analysis shows only 29.27% hit rate on L1/TEX cache and the bandwidth between L2 Cache and Device Memory is over 40%. Since no shared memory is used for this implementation, frequent access to the device memory may cause performance loss.
+
+#### Efficient GPU Implementation
+
+[Nsight Compute Profile Result on Up-Sweep Kernel](./img/efficient-upsweep.png)
+
+[Nsight Compute Profile Result on Down-Sweep Kernel](./img/efficient-upsweep.png)
+
+The up-sweep and down-sweep kernels all suffer the following problems:
+1. Low SM and Memory access. The reason Nsight Compute gives is that the grid is too small to fill the available resources. This is true in the sense that threads dispatched in each kernel invocation are doubled or halved, meaning that in many invocations, the number of threads is particularly small, and combined with a fixed block size that is 256, a large portion of threads are actually idle.
+
+#### Thrust GPU Implementation
+
+Based on Nsight System's trace on memory / SM usage for the Thrust Implementation, The scan function of Thrust is very optimized with maximum SMs active and also very high SM warp occupancy. The naive and work-efficient implementations, however, dispite also having a high SM warp occupancy, has over 20% unallocated warps in SM.
+
+There are several optimization aspects I can think of that Thrust uses to boost performance:
+
+* Since there are gaps between SM usage within Thrust's timeline, it might be possible that Thrust uses divide and conquer technique to break up the problem into smaller pieces to enchance performance.
+* Optimal Block Sizes: Thrust might choose the optimum block size based on GPU and workload types, which maximizes hardware usage and occupancy.
+* Optimal Memory Access: Thrust might use better memory coarsening to make sure access to global memory is coalesced, while also make use of shared memory and prefetching to boost memory efficiency.
+
+
+### 4. Log Sample
+
+Below is a log sample running on array of size $2^{30}$.
+
+```
+****************
+** SCAN TESTS **
+****************
+ [ 11 12 37 10 15 24 35 11 24 21 49 5 17 ... 0 0 ]
+==== cpu scan, power-of-two ====
+ elapsed time: 394.874ms (std::chrono Measured)
+ [ 0 11 23 60 70 85 109 144 155 179 200 249 254 ... 527930117 527930117 ]
+==== cpu scan, non-power-of-two ====
+ elapsed time: 448.757ms (std::chrono Measured)
+ [ 0 11 23 60 70 85 109 144 155 179 200 249 254 ... 527930053 527930081 ]
+ passed
+==== naive scan, power-of-two ====
+ elapsed time: 14634.9ms (CUDA Measured)
+ passed
+==== naive scan, non-power-of-two ====
+ elapsed time: 14747.1ms (CUDA Measured)
+ passed
+==== work-efficient scan, power-of-two ====
+ elapsed time: 418.342ms (CUDA Measured)
+ passed
+==== work-efficient scan, non-power-of-two ====
+ elapsed time: 417.047ms (CUDA Measured)
+ passed
+==== thrust scan, power-of-two ====
+ elapsed time: 516.865ms (CUDA Measured)
+ passed
+==== thrust scan, non-power-of-two ====
+ elapsed time: 477.864ms (CUDA Measured)
+ passed
+
+*****************************
+** STREAM COMPACTION TESTS **
+*****************************
+ [ 2 2 3 2 0 1 2 1 3 0 1 2 1 ... 1 0 ]
+==== cpu compact without scan, power-of-two ====
+ elapsed time: 1982.29ms (std::chrono Measured)
+ [ 2 2 3 2 1 2 1 3 1 2 1 1 2 ... 3 1 ]
+ passed
+==== cpu compact without scan, non-power-of-two ====
+ elapsed time: 1984.63ms (std::chrono Measured)
+ [ 2 2 3 2 1 2 1 3 1 2 1 1 2 ... 1 3 ]
+ passed
+==== cpu compact with scan ====
+ elapsed time: 5014.15ms (std::chrono Measured)
+ [ 2 2 3 2 1 2 1 3 1 2 1 1 2 ... 3 1 ]
+ passed
+==== work-efficient compact, power-of-two ====
+ elapsed time: 31080.5ms (CUDA Measured)
+ passed
+==== work-efficient compact, non-power-of-two ====
+ elapsed time: 31348.9ms (CUDA Measured)
+ passed
+```
\ No newline at end of file
diff --git a/img/efficient-downsweep.png b/img/efficient-downsweep.png
new file mode 100644
index 0000000..6a93793
Binary files /dev/null and b/img/efficient-downsweep.png differ
diff --git a/img/efficient-upsweep.png b/img/efficient-upsweep.png
new file mode 100644
index 0000000..d1ed77d
Binary files /dev/null and b/img/efficient-upsweep.png differ
diff --git a/img/naive.png b/img/naive.png
new file mode 100644
index 0000000..87b93ec
Binary files /dev/null and b/img/naive.png differ
diff --git a/img/naive_efficient.png b/img/naive_efficient.png
new file mode 100644
index 0000000..9b0ce60
Binary files /dev/null and b/img/naive_efficient.png differ
diff --git a/img/thrust.png b/img/thrust.png
new file mode 100644
index 0000000..dfda11e
Binary files /dev/null and b/img/thrust.png differ
diff --git a/img/time_vs_block_size.svg b/img/time_vs_block_size.svg
new file mode 100644
index 0000000..facce78
--- /dev/null
+++ b/img/time_vs_block_size.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/img/time_vs_size.svg b/img/time_vs_size.svg
new file mode 100644
index 0000000..c3f031a
--- /dev/null
+++ b/img/time_vs_size.svg
@@ -0,0 +1 @@
+
\ No newline at end of file
diff --git a/src/main.cpp b/src/main.cpp
index 896ac2b..d4a3150 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -13,7 +13,7 @@
#include
#include "testing_helpers.hpp"
-const int SIZE = 1 << 8; // feel free to change the size of array
+const int SIZE = 1 << 25; // feel free to change the size of array
const int NPOT = SIZE - 3; // Non-Power-Of-Two
int *a = new int[SIZE];
int *b = new int[SIZE];
diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu
index 2ed6d63..61b0dfb 100644
--- a/stream_compaction/common.cu
+++ b/stream_compaction/common.cu
@@ -1,4 +1,5 @@
#include "common.h"
+#include "device_launch_parameters.h"
void checkCUDAErrorFn(const char *msg, const char *file, int line) {
cudaError_t err = cudaGetLastError();
@@ -24,6 +25,11 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
+ int idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (idx >= n) {
+ return;
+ }
+ bools[idx] = idata[idx] == 0 ? 0 : 1;
}
/**
@@ -33,6 +39,13 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
+ int idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (idx >= n) {
+ return;
+ }
+ if (bools[idx] == 1) {
+ odata[indices[idx]] = idata[idx];
+ }
}
}
diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu
index 719fa11..0b989d5 100644
--- a/stream_compaction/cpu.cu
+++ b/stream_compaction/cpu.cu
@@ -20,6 +20,11 @@ namespace StreamCompaction {
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+ int running_sum = 0;
+ for (int i = 0; i < n; i++) {
+ odata[i] = running_sum;
+ running_sum += idata[i];
+ }
timer().endCpuTimer();
}
@@ -31,8 +36,15 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
+ int l_ptr = 0;
+ for (int r_ptr = 0; r_ptr < n; r_ptr++) {
+ if (idata[r_ptr] != 0) {
+ odata[l_ptr] = idata[r_ptr];
+ l_ptr++;
+ }
+ }
timer().endCpuTimer();
- return -1;
+ return l_ptr;
}
/**
@@ -41,10 +53,30 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
+ int* mask = new int[n];
timer().startCpuTimer();
// TODO
+ // 1/0 mask on idata
+ int running_sum = 0;
+ for (int i = 0; i < n; i++) {
+ mask[i] = idata[i] == 0 ? 0 : 1;
+ }
+ // scanning, exclusive prefix sum saved to odata
+ for (int i = 0; i < n; i++) {
+ odata[i] = running_sum;
+ running_sum += mask[i];
+ }
+ // relocate data
+ int remain_cnt = 0;
+ for (int i = 0; i < n; i++) {
+ if (mask[i] == 1) {
+ odata[odata[i]] = idata[i];
+ remain_cnt++;
+ }
+ }
timer().endCpuTimer();
- return -1;
+ delete[] mask;
+ return remain_cnt;
}
}
}
diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu
index 2db346e..1f6bf5a 100644
--- a/stream_compaction/efficient.cu
+++ b/stream_compaction/efficient.cu
@@ -2,39 +2,166 @@
#include
#include "common.h"
#include "efficient.h"
+#include "device_launch_parameters.h"
+#include "nvtx3/nvToolsExt.h"
+
+#define BLOCK_SIZE 265
namespace StreamCompaction {
- namespace Efficient {
- using StreamCompaction::Common::PerformanceTimer;
- PerformanceTimer& timer()
- {
- static PerformanceTimer timer;
- return timer;
- }
-
- /**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
- void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
- }
-
- /**
- * Performs stream compaction on idata, storing the result into odata.
- * All zeroes are discarded.
- *
- * @param n The number of elements in idata.
- * @param odata The array into which to store elements.
- * @param idata The array of elements to compact.
- * @returns The number of elements remaining after compaction.
- */
- int compact(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
- return -1;
- }
+ namespace Efficient {
+ using StreamCompaction::Common::PerformanceTimer;
+ PerformanceTimer& timer()
+ {
+ static PerformanceTimer timer;
+ return timer;
+ }
+
+ // padded_n must be a power of 2
+ __global__ void up_sweep(int* data, int d, int num_thds) {
+ int thd_idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (thd_idx >= num_thds) {
+ // block not full, terminate threads early
+ // number of early terminated threads < BLOCK_SIZE always
+ return;
+ }
+ // index of the element in the array that will be updated
+ int arr_idx = ((thd_idx + 1) << (d + 1)) - 1;
+ // index of the element whose value will be added to data[arr_idx]
+ int add_idx = arr_idx - (1 << d);
+ // update element
+ data[arr_idx] += data[add_idx];
+ }
+
+ __global__ void down_sweep(int* data, int d, int num_thds) {
+ int thd_idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (thd_idx >= num_thds) {
+ // block not full, terminate threads early
+ // number of early terminated threads < BLOCK_SIZE always
+ return;
+ }
+ // index of the left cell, which will inherit value from r_idx
+ int l_idx = (thd_idx << (d + 1)) + (1 << d) - 1;
+ int r_idx = l_idx + (1 << d);
+ int tmp = data[l_idx];
+ data[l_idx] = data[r_idx];
+ data[r_idx] += tmp;
+ }
+
+ __global__ void nullify_last_elem(int padded_n, int* data) {
+ data[padded_n - 1] = 0;
+ }
+
+ /**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+ void scan(int n, int *odata, const int *idata) {
+ int layer = ilog2ceil(n);
+ int padded_n = 1 << layer;
+ int num_thds = padded_n;
+ int *dev_buffer;
+ cudaMalloc((void**)&dev_buffer, padded_n * sizeof(int));
+ checkCUDAError("failed to cudaMalloc buffer");
+ cudaMemcpy(dev_buffer, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("failed to copy idata to buffer");
+ nvtxRangePushA("Work-Efficient Scan");
+ timer().startGpuTimer();
+ // TODO
+ for (int d = 0; d < layer; d++) {
+ // update #threads needed
+ num_thds >>= 1;
+ int grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ up_sweep<<>>(dev_buffer, d, num_thds);
+ }
+ nullify_last_elem<<<1, 1>>>(padded_n, dev_buffer);
+ for (int d = layer - 1; d >= 0; d--) {
+ int grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ down_sweep<<>>(dev_buffer, d, num_thds);
+ num_thds <<= 1;
+ }
+ timer().endGpuTimer();
+ cudaDeviceSynchronize();
+ nvtxRangePop();
+ cudaMemcpy(odata, dev_buffer, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("failed to copy buffer to odata");
+ cudaFree(dev_buffer);
+ checkCUDAError("failed to free dev_buffer");
+ }
+
+ /**
+ * Performs stream compaction on idata, storing the result into odata.
+ * All zeroes are discarded.
+ *
+ * @param n The number of elements in idata.
+ * @param odata The array into which to store elements.
+ * @param idata The array of elements to compact.
+ * @returns The number of elements remaining after compaction.
+ */
+ int compact(int n, int *odata, const int *idata) {
+ int layer = ilog2ceil(n);
+ int padded_n = 1 << layer;
+ int num_thds = padded_n;
+ int *dev_bool, *dev_idata, *dev_indices, *dev_odata;
+
+ // malloc all memory
+ cudaMalloc((void**)&dev_idata, padded_n * sizeof(int));
+ checkCUDAError("failed to cudaMalloc dev_idata");
+ cudaMalloc((void**)&dev_bool, padded_n * sizeof(int));
+ checkCUDAError("failed to malloc dev_bool");
+ cudaMalloc((void**)&dev_indices, padded_n * sizeof(int));
+ checkCUDAError("failed to malloc dev_indices");
+ cudaMalloc((void**)&dev_odata, padded_n * sizeof(int));
+ checkCUDAError("failed to malloc dev_odata");
+
+ // copy input to dev_idata
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("failed to copy idata to buffer");
+
+ timer().startGpuTimer();
+ // TODO
+ // create mask array in dev_bool
+ int grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ Common::kernMapToBoolean<<>>(padded_n, dev_bool, dev_idata);
+ // copy mask to dev_indices for in-place scan
+ cudaMemcpy(dev_indices, dev_bool, padded_n * sizeof(int), cudaMemcpyDeviceToDevice);
+
+ // in-place scan in dev_indices
+ for (int d = 0; d < layer; d++) {
+ // update #threads needed
+ num_thds >>= 1;
+ grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ up_sweep<<>>(dev_indices, d, num_thds);
+ }
+ nullify_last_elem<<<1, 1>>>(padded_n, dev_indices);
+ for (int d = layer - 1; d >= 0; d--) {
+ grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ down_sweep<<>>(dev_indices, d, num_thds);
+ num_thds <<= 1;
+ }
+
+ // scatter
+ grid_size = (num_thds + BLOCK_SIZE - 1) / BLOCK_SIZE;
+ Common::kernScatter<<>>(padded_n, dev_odata, dev_idata, dev_bool, dev_indices);
+ timer().endGpuTimer();
+
+ // copy result back to odata
+ cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("failed to copy dev_odata to odata");
+
+ // compute number of remaining elements
+ int last_bool, last_index;
+ cudaMemcpy(&last_bool, dev_bool + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("failed to fetch the last element in dev_bool");
+ cudaMemcpy(&last_index, dev_indices + n - 1, sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("failed to fetch the last element in dev_indices");
+ int remaining_cnt = last_bool + last_index;
+
+ // cleanup
+ cudaFree(dev_idata);
+ cudaFree(dev_odata);
+ cudaFree(dev_bool);
+ cudaFree(dev_indices);
+ checkCUDAError("failed to free memories");
+ return remaining_cnt;
}
+ }
}
diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu
index 4308876..0851e04 100644
--- a/stream_compaction/naive.cu
+++ b/stream_compaction/naive.cu
@@ -2,24 +2,73 @@
#include
#include "common.h"
#include "naive.h"
+#include "device_launch_parameters.h"
+#include "nvtx3/nvToolsExt.h"
+
+#define BLOCK_SIZE 256
namespace StreamCompaction {
- namespace Naive {
- using StreamCompaction::Common::PerformanceTimer;
- PerformanceTimer& timer()
- {
- static PerformanceTimer timer;
- return timer;
- }
- // TODO: __global__
+ namespace Naive {
+ using StreamCompaction::Common::PerformanceTimer;
+ PerformanceTimer& timer()
+ {
+ static PerformanceTimer timer;
+ return timer;
+ }
+ // TODO: __global__
+ __global__ void scan_single_aggregate(int n, int* odata, const int* idata, int offset) {
+ int idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (idx >= n) {
+ return;
+ }
+ odata[idx] = idx < offset ? idata[idx] : idata[idx] + idata[idx - offset];
+ }
+
+ __global__ void inclusive_to_exclusive(int n, int* incl, int* excl) {
+ int idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (idx >= n) {
+ return;
+ }
+ excl[idx] = idx == 0 ? 0 : incl[idx - 1];
+ }
+
+ __global__ void exclusive_to_inclusive(int n, int* excl, int* incl, int last_num) {
+ int idx = threadIdx.x + blockIdx.x * blockDim.x;
+ if (idx >= n) {
+ return;
+ }
+ incl[idx] = idx == n - 1 ? excl[n - 1] + last_num : excl[idx + 1];
+ }
- /**
- * Performs prefix-sum (aka scan) on idata, storing the result into odata.
- */
- void scan(int n, int *odata, const int *idata) {
- timer().startGpuTimer();
- // TODO
- timer().endGpuTimer();
- }
+ /**
+ * Performs prefix-sum (aka scan) on idata, storing the result into odata.
+ */
+ void scan(int n, int *odata, const int *idata) {
+ dim3 gridDim((n + BLOCK_SIZE - 1) / BLOCK_SIZE);
+ int *dev_odata, *dev_idata;
+ cudaMalloc((void**)& dev_odata, n * sizeof(int));
+ checkCUDAError("failed to malloc dev_odata");
+ cudaMalloc((void**)& dev_idata, n * sizeof(int));
+ checkCUDAError("failed to malloc dev_idata");
+ cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice);
+ checkCUDAError("failed to copy idata to dev_idata");
+ nvtxRangePushA("Naive Scan");
+ timer().startGpuTimer();
+ // TODO
+ for (int d = 1; d <= ilog2ceil(n); d++) {
+ int offset = 1 << (d - 1);
+ scan_single_aggregate<<>>(n, dev_odata, dev_idata, offset);
+ std::swap(dev_odata, dev_idata);
+ }
+ inclusive_to_exclusive<<>>(n, dev_idata, dev_odata);
+ timer().endGpuTimer();
+ cudaDeviceSynchronize();
+ nvtxRangePop();
+ cudaMemcpy(odata, dev_odata, n * sizeof(int), cudaMemcpyDeviceToHost);
+ checkCUDAError("failed to copy dev_idata to odata");
+ cudaFree(dev_odata);
+ cudaFree(dev_idata);
+ checkCUDAError("cudaFree failed");
}
+ }
}
diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu
index 1def45e..31c9f09 100644
--- a/stream_compaction/thrust.cu
+++ b/stream_compaction/thrust.cu
@@ -5,6 +5,7 @@
#include
#include "common.h"
#include "thrust.h"
+#include "nvtx3/nvToolsExt.h"
namespace StreamCompaction {
namespace Thrust {
@@ -18,11 +19,18 @@ namespace StreamCompaction {
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
+ thrust::device_vector dev_idata(idata, idata + n);
+ thrust::device_vector dev_odata(n);
+ nvtxRangePushA("Thrust Scan");
timer().startGpuTimer();
// TODO use `thrust::exclusive_scan`
// example: for device_vectors dv_in and dv_out:
// thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin());
+ thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin());
timer().endGpuTimer();
+ cudaDeviceSynchronize();
+ nvtxRangePop();
+ thrust::copy(dev_odata.begin(), dev_odata.end(), odata);
}
}
}