diff --git a/README.md b/README.md index 0e38ddb1..8222ea2f 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,149 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +![](img/stream_compaction_diagram.png) -### (TODO: Your README) +* Oliver Hendrych + * [LinkedIn](https://www.linkedin.com/in/oliver-hendrych/) +* Tested on: Windows 11, i7-14700K @ 3.4GHz 32GB, RTX 4070 SUPER 12GB (Personal) -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +## Description +This project implements exclusive scan (prefix sum) on the CPU, with a naive GPU implementation, with a work-efficient GPU implementation, and finally by wrapping thrust calls. Leveraging these scan implementations, it also implements stream compaction, which is the process of removing certain (in this case, 0) elements from an array. Stream compaction is achieved through CPU implementations with and without using scan, and a GPU implementation, using the work-efficient scan. All implementations scale to any size of input array, including non-powers of two (efficiently). + +The work-efficient GPU implemenation achieves a 6.89x speedup over the CPU implementation at input sizes of 2^26 (3.98858ms vs 27.4916ms). + +## Performance + +### Block Size + +The runtime for various block sizes affecting the GPU implementations are recorded below. CPU and Thrust implementations were averaged and are included for comparison. + +#### Runtime (ms) vs Block Size +| Block Size | Scan | | | | Compaction | | | +|------------|---------|-------------|----------------------|--------------|--------------------|-----------------|----------------------| +| Block Size | CPU | GPU - Naive | GPU - Work-efficient | GPU - Thrust | CPU - without scan | CPU - with scan | GPU - Work-efficient | +| 4 | 27.4916 | 231.784 | 28.3375 | 2.0969 | 123.318 | 241.689 | 47.2288 | +| 8 | 27.4916 | 142.151 | 17.6827 | 2.0969 | 123.318 | 241.689 | 29.7216 | +| 16 | 27.4916 | 70.1123 | 7.82448 | 2.0969 | 123.318 | 241.689 | 13.8916 | +| 32 | 27.4916 | 37.9731 | 4.7431 | 2.0969 | 123.318 | 241.689 | 9.0921 | +| 64 | 27.4916 | 37.5656 | 3.98858 | 2.0969 | 123.318 | 241.689 | 8.68864 | +| 128 | 27.4916 | 37.524 | 4.54752 | 2.0969 | 123.318 | 241.689 | 8.55347 | +| 256 | 27.4916 | 36.7617 | 4.5695 | 2.0969 | 123.318 | 241.689 | 8.95078 | +| 512 | 27.4916 | 38.0105 | 4.38698 | 2.0969 | 123.318 | 241.689 | 8.64563 | +| 1024 | 27.4916 | 35.8252 | 5.87482 | 2.0969 | 123.318 | 241.689 | 10.196 | + +Below is a graph of the scan runtime across various block sizes (from the table above). Best (lowest) runtime for GPU implementations is marked. + +![](img/scan_runtime_vs_block_size.png) + +Below is a graph of the stream compaction runtime across various block sizes (from the table above). Best (lowest) runtime for GPU implementation is marked. + +![](img/stream_compaction_runtime_vs_block_size.png) + +#### Optimal + +The optimal block size for the naive GPU scan implementation was 1024, with block sizes from 32 to 1024 having similar runtimes. The optimal block size for the work-efficient GPU scan implementation was 64, and 128 for the work-efficient stream compaction (with 64 behaving similarly). The work-efficient GPU implementation should do better with increasing block sizes, as that reduces the number of layers (and therefore kernel invokations) that the algorithm must use. However, each thread requires a fair amount of resources to complete its operations. Therefore, it makes sense that having a smaller block size would increase performance, as more blocks could be scheduled on the same SM. + +### Input Length + +#### Runtime (ms) vs Input Length +| Input Length | CPU | GPU - Naive | GPU - Work-efficient | GPU - Thrust | +|--------------|---------|-------------|----------------------|--------------| +| 2^0 | 0.0001 | 0.002048 | 0.190464 | 0.131392 | +| 2^2 | 0.0001 | 0.216064 | 0.177152 | 0.083904 | +| 2^4 | 0.0001 | 0.079872 | 0.132096 | 0.053568 | +| 2^6 | 0.0002 | 0.120832 | 0.180224 | 0.055328 | +| 2^8 | 0.0002 | 0.073728 | 0.117792 | 0.079584 | +| 2^10 | 0.0006 | 0.195776 | 0.18432 | 0.134144 | +| 2^12 | 0.0018 | 0.183296 | 0.351232 | 0.07168 | +| 2^14 | 0.0069 | 0.12288 | 0.280576 | 0.405504 | +| 2^16 | 0.0223 | 0.311008 | 0.29504 | 0.130784 | +| 2^18 | 0.1128 | 0.460416 | 0.169536 | 0.347808 | +| 2^20 | 0.4782 | 0.363488 | 0.194592 | 0.374752 | +| 2^22 | 1.6566 | 0.812992 | 0.696896 | 0.564224 | +| 2^24 | 8.907 | 8.42656 | 1.42976 | 0.881376 | +| 2^26 | 27.4916 | 35.8252 | 3.98858 | 2.0969 | + +![](img/scan_runtime_vs_input_length.png) + +Note: both axes are on a logarithmic scale. + +As can be seen in the table and graph, the CPU implementation remains faster until the 2^20 (approximately 1 million) elements mark. We can also see that the GPU implementations remain largerly constant in their runtimes up to this point. This likely means that there is overhead involved in the GPU implementations that are not worth the cost until 2^20 elements are in the array (on this machine). As such, in actual deployments, it would make sense for the scan function to determine which implementation to use based on the size and on profiling done on the machine. Smaller arrays would be scanned on the CPU, while large ones would be processed on the GPU. + +#### Thrust Timeline + +![](img/nsight_systems_thrust.png) + +First, the device vector for the input is created and its contents are copied (`cub::DeviceFor::Bulk`), and the device vector for the output is created. Between the two `cudaEventRecord` (which happens after the first pictured `cudaStreamSynchronize`), the thrust invocation takes place, and we can see a few cuda calls. First, a `cudaMalloc` call appears. A couple helper library calls appear to get the kernel `DeviceScanInitKernel` and its name, then that kernel is run. Similarly, a couple cuda calls appear before a call to the actual `DeviceScanKernel`. Then, the device syncronizes, a `cudaFree` occurs (presumably to free the previously allocated memory), then the `cudaEventRecord` appears, marking the end of the thrust invokation. After that, a cudaMemcpy takes place (to copy the information back to the host). The allocation (and later free) are likely to create some internal buffers necessary for the scan implementation. The init kernel takes almost no time at all, especially compared to the actual scan kernel. + +## Observations + +![](img/nsight_compute.png) + +Looking at Nsight Compute, it would appear that the main bottleneck in all implementations is the memory. The naive implementation (`inclusiveScanAdd`) is consistently around 90% memory throughput, while at only 10% compute throughput. Similarly, while the work-efficient implementation (`kernBlockScanStoreSum`) achieves a much higher compute throughput (especially in the first layers/invocations), it reaches 97.49% memory throughput, so it would again appear to be limited by memory throughput. The final `kernAddSums` (the second longest invokation in the work-efficient implementation) also reaches 86% memory throughput, meaning it is also likely bandwidth limited. + +## Test Output + +Below is an output of the tests with a power-of-two size of `2^26` and a non-power-of-two size of `2^26-3`, with samples being from 0 to 50. This is near the limit of the size of the input arrays, as larger results in integer overflow near the ends of the array. A block size of 1024 was chosen for the naive implementation, and a block size of 64 was chosen for the work-efficient implementation, as +``` +**************** +** SCAN TESTS ** +**************** + [ 47 4 39 1 24 9 35 15 30 20 24 19 21 ... 12 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 27.1716ms (std::chrono Measured) + [ 0 47 51 90 91 115 124 159 174 204 224 248 267 ... 1643558162 1643558174 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 27.2493ms (std::chrono Measured) + [ 0 47 51 90 91 115 124 159 174 204 224 248 267 ... 1643558094 1643558110 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 35.9803ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 36.0246ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 4.49821ms (CUDA Measured) + [ 0 47 51 90 91 115 124 159 174 204 224 248 267 ... 1643558162 1643558174 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 3.91286ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 2.12493ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 2.87011ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 0 1 3 2 2 2 2 3 2 0 1 3 2 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 122.684ms (std::chrono Measured) + [ 1 3 2 2 2 2 3 2 1 3 2 2 1 ... 1 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 121.005ms (std::chrono Measured) + [ 1 3 2 2 2 2 3 2 1 3 2 2 1 ... 1 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 283.972ms (std::chrono Measured) + [ 1 3 2 2 2 2 3 2 1 3 2 2 1 ... 1 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 8.92416ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 8.2688ms (CUDA Measured) + passed +``` + +### Mistakes + +Original profiling was done including the `cudaMalloc`, `cudaMemcpy`, `cudaFree`, and other initial/final memory operations for the GPU implementations. This had a drastic impact on perceived performance, with the work-efficient implementation especially appearing to perform much worse. Originally, the allocations were included in the work-efficient scan due to their dynamic nature as a part of the dynamic number and size of the layers. + +A bug was encountered once when switching to the Release configuration and having large memory sizes, but did not appear in the debug builds. The first instinct was then that it was a race condition, and, indeed, a `__syncthreads()` was missing in the downsweep implementation, after the root of the tree was set to 0 but before the rest of the downsweep took place. \ No newline at end of file diff --git a/img/nsight_compute.png b/img/nsight_compute.png new file mode 100644 index 00000000..da2b5599 Binary files /dev/null and b/img/nsight_compute.png differ diff --git a/img/nsight_systems_thrust.png b/img/nsight_systems_thrust.png new file mode 100644 index 00000000..2db80f9e Binary files /dev/null and b/img/nsight_systems_thrust.png differ diff --git a/img/scan_runtime_vs_block_size.png b/img/scan_runtime_vs_block_size.png new file mode 100644 index 00000000..e0c10816 Binary files /dev/null and b/img/scan_runtime_vs_block_size.png differ diff --git a/img/scan_runtime_vs_input_length.png b/img/scan_runtime_vs_input_length.png new file mode 100644 index 00000000..bc5e603e Binary files /dev/null and b/img/scan_runtime_vs_input_length.png differ diff --git a/img/stream_compaction_diagram.png b/img/stream_compaction_diagram.png new file mode 100644 index 00000000..accc844a Binary files /dev/null and b/img/stream_compaction_diagram.png differ diff --git a/img/stream_compaction_runtime_vs_block_size.png b/img/stream_compaction_runtime_vs_block_size.png new file mode 100644 index 00000000..12dee9ba Binary files /dev/null and b/img/stream_compaction_runtime_vs_block_size.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..4ecbb083 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 26; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -29,6 +29,9 @@ int main(int argc, char* argv[]) { genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case a[SIZE - 1] = 0; + /*for (int i = 0; i < SIZE; i++) { + a[i] = 1; + }*/ printArray(SIZE, a, true); // initialize b using StreamCompaction::CPU::scan you implement @@ -71,7 +74,7 @@ int main(int argc, char* argv[]) { printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d630..f4632198 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,11 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx > n) { + return; + } + bools[idx] = idata[idx] == 0 ? 0 : 1; } /** @@ -32,7 +36,13 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int idx = blockDim.x * blockIdx.x + threadIdx.x; + if (idx > n) { + return; + } + if (bools[idx]) { + odata[indices[idx]] = idata[idx]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed9..836d21e2 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -30,6 +30,10 @@ inline int ilog2ceil(int x) { return x == 1 ? 0 : ilog2(x - 1) + 1; } +inline int divup(int num, int denom) { + return (num + denom - 1) / denom; +} + namespace StreamCompaction { namespace Common { __global__ void kernMapToBoolean(int n, int *bools, const int *idata); diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa115..77737a1d 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -20,6 +20,11 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int sum = 0; + for (int i = 0; i < n; i++) { + odata[i] = sum; + sum += idata[i]; + } timer().endCpuTimer(); } @@ -31,8 +36,15 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int index = 0; + for (int i = 0; i < n; i++) { + if (idata[i]) + { + odata[index++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return index; } /** @@ -41,10 +53,28 @@ namespace StreamCompaction { * @returns the number of elements remaining after compaction. */ int compactWithScan(int n, int *odata, const int *idata) { + int* bool_array = new int[n]; + int* scanned_array = new int[n]; timer().startCpuTimer(); // TODO + for (int i = 0; i < n; i++) { + bool_array[i] = idata[i] == 0 ? 0 : 1; + } + int sum = 0; + for (int i = 0; i < n; i++) { + scanned_array[i] = sum; + sum += bool_array[i]; + } + for (int i = 0; i < n; i++) { + if (bool_array[i] == 1) { + odata[scanned_array[i]] = idata[i]; + } + } + int length = scanned_array[n - 1] + bool_array[n - 1]; timer().endCpuTimer(); - return -1; + delete [] bool_array; + delete [] scanned_array; + return length; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346ee..9e7f9d7b 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -1,5 +1,6 @@ #include #include +#include #include "common.h" #include "efficient.h" @@ -12,13 +13,159 @@ namespace StreamCompaction { return timer; } + using StreamCompaction::Common::kernMapToBoolean; + using StreamCompaction::Common::kernScatter; + + const int maxBlockSize = 64; + + // assumes padding to block size; + __global__ void kernBlockScan(int blockLog2Ceil, int* data) { + int blockOffset = blockDim.x * blockIdx.x; + int idx = threadIdx.x + 1; + // upsweep + for (int i = 1; i <= blockLog2Ceil; i++) { + int pos = idx * (1 << i) - 1; + if (pos < blockDim.x) { + pos += blockOffset; + int offset = 1 << (i - 1); + data[pos] = data[pos] + data[pos - offset]; + } + __syncthreads(); + } + // downsweep + // set root to 0 + + if (threadIdx.x == blockDim.x - 1) { + data[blockOffset + blockDim.x - 1] = 0; + } + __syncthreads(); + + for (int i = blockLog2Ceil; i > 0; i--) { + int pos = idx * (1 << i) - 1; + if (pos < blockDim.x) { + pos += blockOffset; + int offset = 1 << (i - 1); + int t = data[pos - offset]; + data[pos - offset] = data[pos]; + data[pos] = t + data[pos]; + } + + __syncthreads(); + } + } + + // assumes padding to block size; + __global__ void kernBlockScanStoreSum(int blockLog2Ceil, int* data, int* sums) { + int blockOffset = blockDim.x * blockIdx.x; + int idx = threadIdx.x + 1; + // upsweep + for (int i = 1; i <= blockLog2Ceil; i++) { + int pos = idx * (1 << i) - 1; + if (pos < blockDim.x) { + pos += blockOffset; + int offset = 1 << (i - 1); + data[pos] = data[pos] + data[pos - offset]; + } + __syncthreads(); + } + // store sum for cross block sum later + if (threadIdx.x == 0) { + sums[blockIdx.x] = data[blockOffset + blockDim.x - 1]; + } + __syncthreads(); + // downsweep + // set root to 0 + + if (threadIdx.x == blockDim.x - 1) { + data[blockOffset + blockDim.x - 1] = 0; + } + __syncthreads(); + + for (int i = blockLog2Ceil; i > 0; i--) { + int pos = idx * (1 << i) - 1; + if (pos < blockDim.x) { + pos += blockOffset; + int offset = 1 << (i - 1); + int t = data[pos - offset]; + data[pos - offset] = data[pos]; + data[pos] = t + data[pos]; + } + + __syncthreads(); + } + } + + // assumes padding to block size; + __global__ void kernAddSums(int* data, const int* sums) { + int idx = blockDim.x * (blockIdx.x) + threadIdx.x; + data[idx] += sums[blockIdx.x]; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + + if (n <= 0) { + return; + } + + std::vector scanArrays{}; + std::vector scanArrayLens{}; + + // ceil to next maxBlockSize + int scanArrayLen = divup(n, maxBlockSize) * maxBlockSize; + while (scanArrayLen > maxBlockSize) { + int* d_array; + cudaMalloc(&d_array, scanArrayLen * sizeof(int)); + checkCUDAError("cudaMalloc scanArrayLen failed"); + cudaMemset(d_array, 0, scanArrayLen * sizeof(int)); + scanArrays.push_back(d_array); + scanArrayLens.push_back(scanArrayLen); + //fprintf(stderr, "Size %i\n", scanArrayLen); + // divide by maxBlockSize then ceil to it + scanArrayLen = divup(scanArrayLen / maxBlockSize, maxBlockSize) * maxBlockSize; + } + { + // scanArrayLen = maxBlockSize now + int* d_array; + cudaMalloc(&d_array, scanArrayLen * sizeof(int)); + checkCUDAError("cudaMalloc scanArrayLen failed"); + cudaMemset(d_array, 0, scanArrayLen * sizeof(int)); + scanArrays.push_back(d_array); + scanArrayLens.push_back(scanArrayLen); + } + + cudaMemcpy(scanArrays[0], idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy idata failed"); + timer().startGpuTimer(); - // TODO + + for (int i = 0; i < scanArrays.size() - 1; i++) { + int arrayLen = scanArrayLens[i]; + kernBlockScanStoreSum << < arrayLen / maxBlockSize, maxBlockSize >> > (ilog2ceil(maxBlockSize), scanArrays[i], scanArrays[i + 1]); + checkCUDAError("kernBlockScanStoreSum failed"); + } + kernBlockScan << <1, maxBlockSize >> > (ilog2ceil(maxBlockSize), scanArrays.back()); + checkCUDAError("kernBlockScan failed"); + + for (int i = scanArrays.size() - 2; i >= 0; i--) { + int arrayLen = scanArrayLens[i]; + kernAddSums << < arrayLen / maxBlockSize, maxBlockSize >> > (scanArrays[i], scanArrays[i + 1]); + checkCUDAError("kernAddSums failed"); + } + timer().endGpuTimer(); + + cudaMemcpy(odata, scanArrays[0], n * sizeof(int), cudaMemcpyDeviceToHost); + //cudaMemcpy(odata, scanArrays.back(), scanArrayLens.back() * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata failed"); + + for (int i = 0; i < scanArrays.size(); i++) { + cudaFree(scanArrays[i]); + checkCUDAError("cudaFree scanArrays[i] failed"); + } } /** @@ -31,10 +178,112 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + if (n <= 0) { + return -1; + } + + int topArrayLen = divup(n, maxBlockSize) * maxBlockSize; + + int* d_idata; + cudaMalloc(&d_idata, topArrayLen * sizeof(int)); + checkCUDAError("cudaMalloc d_idata failed"); + cudaMemset(d_idata, 0, topArrayLen * sizeof(int)); + checkCUDAError("cudaMemset d_idata failed"); + cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy d_idata failed"); + + int* d_bools; + cudaMalloc(&d_bools, topArrayLen * sizeof(int)); + checkCUDAError("cudaMalloc d_bools failed"); + + + int* d_odata; + cudaMalloc(&d_odata, n * sizeof(int)); + checkCUDAError("cudaMalloc d_odata failed"); + + std::vector scanArrays{}; + std::vector scanArrayLens{}; + + // ceil to next maxBlockSize + int scanArrayLen = topArrayLen; + while (scanArrayLen > maxBlockSize) { + int* d_array; + cudaMalloc(&d_array, scanArrayLen * sizeof(int)); + checkCUDAError("kernMapToBoolean failed"); + cudaMemset(d_array, 0, scanArrayLen * sizeof(int)); + checkCUDAError("cudaMemset d_array failed"); + scanArrays.push_back(d_array); + scanArrayLens.push_back(scanArrayLen); + //fprintf(stderr, "Size %i\n", scanArrayLen); + // divide by maxBlockSize then ceil to it + scanArrayLen = divup(scanArrayLen / maxBlockSize, maxBlockSize) * maxBlockSize; + } + { + // scanArrayLen = maxBlockSize now + int* d_array; + cudaMalloc(&d_array, scanArrayLen * sizeof(int)); + checkCUDAError("cudaMalloc scanArrayLen failed"); + cudaMemset(d_array, 0, scanArrayLen * sizeof(int)); + checkCUDAError("cudaMemset d_array failed"); + scanArrays.push_back(d_array); + scanArrayLens.push_back(scanArrayLen); + } + timer().startGpuTimer(); - // TODO + + kernMapToBoolean << > > (topArrayLen, d_bools, d_idata); + checkCUDAError("kernMapToBoolean failed"); + + cudaMemcpy(scanArrays[0], d_bools, topArrayLen * sizeof(int), cudaMemcpyDeviceToDevice); + checkCUDAError("cudaMemcpy d_bools failed"); + + for (int i = 0; i < scanArrays.size() - 1; i++) { + int arrayLen = scanArrayLens[i]; + kernBlockScanStoreSum << < arrayLen / maxBlockSize, maxBlockSize >> > (ilog2ceil(maxBlockSize), scanArrays[i], scanArrays[i + 1]); + checkCUDAError("kernBlockScanStoreSum failed"); + } + kernBlockScan << <1, maxBlockSize >> > (ilog2ceil(maxBlockSize), scanArrays.back()); + checkCUDAError("kernBlockScan failed"); + + for (int i = scanArrays.size() - 2; i >= 0; i--) { + int arrayLen = scanArrayLens[i]; + kernAddSums << < arrayLen / maxBlockSize, maxBlockSize >> > (scanArrays[i], scanArrays[i + 1]); + checkCUDAError("kernAddSums failed"); + } + + int* d_indices = scanArrays[0]; + + + kernScatter << > > (n, d_odata, d_idata, d_bools, d_indices); + checkCUDAError("kernScatter failed"); + timer().endGpuTimer(); - return -1; + + cudaDeviceSynchronize(); + + int compactLen; + cudaMemcpy(&compactLen, &d_indices[topArrayLen - 1], sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy compactLen failed"); + int lastBool; + cudaMemcpy(&lastBool, &d_bools[topArrayLen - 1], sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy lastBool failed"); + compactLen += lastBool; + + cudaMemcpy(odata, d_odata, compactLen * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy d_odata failed"); + + for (int i = 0; i < scanArrays.size(); i++) { + cudaFree(scanArrays[i]); + checkCUDAError("cudaFree scanArrays[i] failed"); + } + cudaFree(d_bools); + checkCUDAError("cudaFree d_bools failed"); + cudaFree(d_idata); + checkCUDAError("cudaFree d_idata failed"); + cudaFree(d_odata); + checkCUDAError("cudaFree d_odata failed"); + + return compactLen; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 43088769..bb559f37 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -11,15 +11,62 @@ namespace StreamCompaction { static PerformanceTimer timer; return timer; } - // TODO: __global__ + + __global__ void inclusiveScanAdd(const int n, const int stride, const int* idata, int* odata) { + int idx = blockIdx.x * blockDim.x + threadIdx.x; + if (idx >= n) { + return; + } + if (stride > idx) { + odata[idx] = idata[idx]; + } + else + { + odata[idx] = idata[idx] + idata[idx - stride]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + + size_t sizeInBytes = n * sizeof(int); + + int* d_odata; + int* d_idata; + cudaMalloc(&d_odata, sizeInBytes); + checkCUDAError("cudaMalloc d_odata failed"); + cudaMalloc(&d_idata, sizeInBytes); + checkCUDAError("cudaMalloc d_idata failed"); + + cudaMemcpy(d_idata, idata, sizeInBytes, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy d_idata failed"); + + int blockSize = 1024; + int gridSize = divup(n, blockSize); + timer().startGpuTimer(); - // TODO + + for (int i = 0; i < ilog2ceil(n); i++) { + int stride = 1 << i; + inclusiveScanAdd << > > (n, stride, d_idata, d_odata); + checkCUDAError("inclusiveScanAdd failed"); + std::swap(d_odata, d_idata); + } + timer().endGpuTimer(); + + // due to the fact that we want to make the inclusive scan exclusive instead + odata[0] = 0; + cudaMemcpy(odata + 1, d_idata, sizeInBytes - sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy odata failed"); + + cudaFree(d_odata); + checkCUDAError("cudaFree d_odata failed"); + cudaFree(d_idata); + checkCUDAError("cudaFree d_idata failed"); + } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e7..fd3cc711 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,19 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::device_vector dv_idata(idata, idata + n); + thrust::device_vector dv_odata(n); + + timer().startGpuTimer(); + + thrust::exclusive_scan(dv_idata.begin(), dv_idata.end(), dv_odata.begin()); + timer().endGpuTimer(); + + thrust::copy(dv_odata.begin(), dv_odata.end(), odata); } } }