diff --git a/README.md b/README.md index 0e38ddb..372b3e5 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,196 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Utkarsh Dwivedi + * [LinkedIn](https://www.linkedin.com/in/udwivedi/), [personal website](https://utkarshdwivedi.com/) +* Tested on: Windows 11 Home, AMD Ryzen 7 5800H @ 3.2GHz 16 GB, Nvidia GeForce RTX 3060 Laptop GPU 6 GB -### (TODO: Your README) +# Introduction -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project is an implementation of the scan and stream compaction algorithms, both sequentially on the CPU as well as in parallel on the GPU using CUDA, and a performance analysis and comparison of different implementations of the algorithm. +## Algorithms + +### Scan +Scan computes the prefix sum of the array so that each element in the resulting array is the sum of all the elements occuring before it. This project contains the following implementations of scan: + +1. **Sequential CPU scan**: A simple scan algorithm that runs sequentially on the CPU using a for loop. +2. **Naive GPU parallel scan**: A scan algorithm that runs in parallel on the GPU and sums over adjacent elements while halving the array size each iteration +3. **Work efficient GPU parallel scan**: A more performant parallel scan on the GPU that treats the array like a binary tree and runs in two steps: an "up-sweep" and then a "down-sweep" + +**Note:** For both of the GPU scans, the algorithms only work for arrays with power of 2 size. For non-power of 2 size arrays, the data is padded with 0s until the array size becomes the next power of 2. This has potential to be improved, as at the current moment it has the potential of launching a large number of threads that compute unnecessary data. For an array of size **2n+1**, the array will be expanded to size **216**, padding the last (**2n+1 - 2n+1**) elements with 0s. This is acceptable for arrays with small values of **n**, but say for arrays of size **4097**, the array will be padded to **8192** elements, leading to **4095** unnecessary computations. + +### Stream Compaction +Stream compaction is an algorithm that, given an array and some condition, creates and returns a new array that contains only those elements from the original array that satisfy the given condition, and the order of the elements in the resulting array is preserved. In this project, stream compaction simply removes all `0s` from an array of integers, but it can easily be extended to work as a template on any given conditions. This project contains the following implementations of stream compaction: + +1. **Simple CPU stream compaction**: A simple algorithm that runs sequentially on the CPU using a for loop and simply removes all elements that are 0 +2. **CPU stream compaction using CPU scan**: A CPU side imitation of stream compaction in parallel, using the scan algorithm +3. **GPU stream compaction using GPU work efficient scan**: Proper implementation of stream compaction in parallel using the parallel scan algorithm on the GPU + +**Note**: The power of 2 note from the scan algorithms is still valid here, as the GPU stream compaction uses the GPU scan algorithm. + +# Performance Analysis + +## Time(ms) vs Block Size (optimizing block size) + +This project was first executed on an array of size **220** with different block sizes to roughly optimize them. + +| ![](./img/timeVsBlockSize.png) | +|:--:| +| **Scan**: time (ms) vs block size (# of threads per block) + +There is little gain in performance beyond 128 threads, so further performance analysis was done with block size set to **128**. + +**Note**: It is shocking to see the GPU work-efficient scan algorithm perform worse than the GPU naive scan algorithm. More info on that below. + +## Time (ms) vs Array Size + +### Scan + +Each of the scan implementations (CPU, GPU Naive, GPU Work-Efficient) was compared against each other with increasing array sizes. Additionally, the results were also compared against Nvidia's thrust library's `exclusive_scan`. + +| ![](./img/scanPerformance.png) | +|:--:| +| *Graph 1*. **Scan**: time (ms) vs array size, Block Size: 128, time scales linearly | + +In the above graph, time scales linearly on the Y-axis. For smaller arrays, there seems to be not much difference in performance in all four cases, but with increasing array size, thrust is the clear winner in terms of performance. + +| ![](./img/scanPerformanceLog.png) | +|:--:| +| *Graph 2*. **Scan**: time (ms) vs array size, Block Size: 128, time scales logarithmically | + +In the above graph, time scales logarithmically on the Y-axis. It is the same data as in *Graph 1*, but it more clearly shows what is going on in arrays of smaller sizes. **It is very interesting to see that the CPU scan is even faster than thrust for upto 100,000 array size!** + +What is surprising, as mentioned earlier, is that the work-efficient GPU scan algorithm is *slower* than both the CPU and also the naive GPU algorithm. This makes very little sense, because while it makes sense for the CPU version to be faster for smaller arrays, the naive version should not be faster than the work-efficient version. + +**What could be going wrong? Why are the GPU implementations so slow?** + +I believe this is happening due to a combination of multiple reasons. + +- Every iteration the up-sweep and down-sweep kernels are launched, the number of meaningful computations is halved. However, the number of threads remains the same, therefore the number of threads doing meaningful work is halved every iteration of the kernels. +- This would still be okay, except for the fact that the algorithm works in a way where every `s` threads do meaningful computations, where `s = stride`. This leads to branching in every single block, but the goal should be to minimize branching as much as possible. It would make more sense to rework the indexing in the algorithm so that the threads doing meaningful computations are shifted. Instead of every `s` thread doing the compute, each thread *until* `s` should do meaningful compute. The second half of the threads that do no work can be retired early for use in other compute. +- Neither the naive nor the work-efficient scans use shared memory, and therefore do not utilize CUDA's capability to the fullest. + +**Why is the thrust scan implementation significantly faster?** + +Thrust likely uses many optimizations in its implementation of scan, such as the ones mentioned above. + +Below is a screenshot of the Nsight timeline for this program. + +|![](img/nsight1.png)| +|:--:| +|Image 1| + +```diff ++ Green +Green sections show time spent in copying over memory from the device to the host. + +- Red +Red sections show time spent in copying over memory back from the host to device. + +@@ Purple @@ +Purple coloured sections are the actual kernels. +``` +The memory I/O operations in red and green take the most amount of time. This project's naive and work-efficient scan implementations take 27.4% of all kernel execution time, whereas thrust's `DeviceScanInitKernel` and `DeviceScanKernel` take only 1.2% of the total kernel execution time. + +If we zoom into the `kernelScan` section that takes 27.4% of all kernel execution time, we can see that there is not one, but several kernels executed at regular intervals. + +|![](img/nsight2.png)| +|:--:| +|Image 2| + +These kernels are this project's implementation of the naive and work-efficient kernels. The CPU launches many kernels based on the array size, where the number of threads doing compute reduces by half in every iteration, as explained in the previous sections. + +Compare that to the very tiny thrust's scan below (zoomed 100X so that the tiny section from the right side of Image 1 is visible): + +|![](img/nsight3.png)| +|:--:| +|Image 3| + +Thrust seems to only launch 2 kernels for arrays of arbitrary size: one to initialize the data, and another one to perform the actual scan. In addition to the optimisations mentioned above, it seems that thrust either performs the up-sweep and down-sweep simultaneously while also managing the iterations on the GPU side, or it uses an entirely more efficient algorithm altogether that performs the scan in-place. + +### Stream Compaction + +Each of the stream compaction implementations was compared against each other with increasing array sizes. + +| ![](./img/streamCompactionPerformance.png) | +|:--:| +| *Graph 3*. **Stream Compaction**: time (ms) vs array size, Block Size: 128, time scales linearly | + +In the above graph, time scales linearly on the Y-axis. These results make much more sense. While there is not much of a performance gain for smaller array sizes, we start seeing the benefits of massive parallelisation on the GPU with larger array sizes. In the case of 50M elements, the GPU work-efficient stream compact implementation performs, which is a 5x performance increase. While the CPU stream compaction is sequential (or imitates the GPU side and is even slower due to the additional overhead), the GPU scan is able to utilize massive parallelization and perform more compute simultaneously rather than in sequence. + +| ![](./img/streamCompactionPerformanceLog.png) | +|:--:| +| *Graph 3*. **Stream Compaction**: time (ms) vs array size, Block Size: 128, time scales logarithmically | + +In the above graph, time scales logarithmically on the Y-axis. It is the same data as in graph 3, but the logarithmic time scale helps see performance differences for array sizes less than 100,000. In this case, GPU stream compaction is slower than its CPU counterparts, up to 100X slower for arrays of size 100. + +This is due to the fact that the GPU work-efficient scan algorithm used inside the GPU stream compaction algorithm is slow, as explained in the previous section. + +# Sample Output + +This output was obtained with a block size of **128**, power of 2 array size of **216**, and non-power of array size of **216 - 3**. + +``` +**************** +** SCAN TESTS ** +**************** + [ 19 33 49 24 24 41 7 27 48 8 21 34 25 ... 17 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 0.0379ms (std::chrono Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605490 1605507 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 0.0317ms (std::chrono Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605434 1605446 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 0.158912ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605490 1605507 ] + passed +==== naive scan, non-power-of-two ==== + elapsed time: 0.193856ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605434 1605446 ] + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 0.326432ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605490 1605507 ] + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 0.247808ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605434 1605446 ] + passed +==== thrust scan, power-of-two ==== + elapsed time: 0.059744ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605490 1605507 ] + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 0.051552ms (CUDA Measured) + [ 0 19 52 101 125 149 190 197 224 272 280 301 335 ... 1605434 1605446 ] + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 1 3 2 0 3 3 3 2 2 1 0 3 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 0.1817ms (std::chrono Measured) + [ 3 1 3 2 3 3 3 2 2 1 3 3 1 ... 3 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 0.1807ms (std::chrono Measured) + [ 3 1 3 2 3 3 3 2 2 1 3 3 1 ... 3 3 ] + passed +==== cpu compact with scan ==== + elapsed time: 0.2497ms (std::chrono Measured) + [ 3 1 3 2 3 3 3 2 2 1 3 3 1 ... 3 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 0.31232ms (CUDA Measured) + [ 3 1 3 2 3 3 3 2 2 1 3 3 1 ... 3 1 ] + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 0.395264ms (CUDA Measured) + [ 3 1 3 2 3 3 3 2 2 1 3 3 1 ... 3 3 ] + passed +Press any key to continue . . . +``` diff --git a/img/nsight1.png b/img/nsight1.png new file mode 100644 index 0000000..95325f7 Binary files /dev/null and b/img/nsight1.png differ diff --git a/img/nsight2.png b/img/nsight2.png new file mode 100644 index 0000000..2080f50 Binary files /dev/null and b/img/nsight2.png differ diff --git a/img/nsight3.png b/img/nsight3.png new file mode 100644 index 0000000..700cbdf Binary files /dev/null and b/img/nsight3.png differ diff --git a/img/scanPerformance.png b/img/scanPerformance.png new file mode 100644 index 0000000..6bb1fdb Binary files /dev/null and b/img/scanPerformance.png differ diff --git a/img/scanPerformanceLog.png b/img/scanPerformanceLog.png new file mode 100644 index 0000000..c61d68b Binary files /dev/null and b/img/scanPerformanceLog.png differ diff --git a/img/streamCompactionPerformance.png b/img/streamCompactionPerformance.png new file mode 100644 index 0000000..cc647db Binary files /dev/null and b/img/streamCompactionPerformance.png differ diff --git a/img/streamCompactionPerformanceLog.png b/img/streamCompactionPerformanceLog.png new file mode 100644 index 0000000..c51b8b9 Binary files /dev/null and b/img/streamCompactionPerformanceLog.png differ diff --git a/img/timeVsBlockSize.png b/img/timeVsBlockSize.png new file mode 100644 index 0000000..486eb97 Binary files /dev/null and b/img/timeVsBlockSize.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..cb027a5 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -51,48 +51,48 @@ int main(int argc, char* argv[]) { printDesc("naive scan, power-of-two"); StreamCompaction::Naive::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); - /* For bug-finding only: Array of 1s to help find bugs in stream compaction or scan - onesArray(SIZE, c); - printDesc("1s array for finding bugs"); - StreamCompaction::Naive::scan(SIZE, c, a); - printArray(SIZE, c, true); */ + // For bug-finding only: Array of 1s to help find bugs in stream compaction or scan + //onesArray(SIZE, c); + //printDesc("1s array for finding bugs"); + //StreamCompaction::Naive::scan(SIZE, c, a); + //printArray(SIZE, c, true); zeroArray(SIZE, c); printDesc("naive scan, non-power-of-two"); StreamCompaction::Naive::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, power-of-two"); StreamCompaction::Efficient::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("work-efficient scan, non-power-of-two"); StreamCompaction::Efficient::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(SIZE, c, true); + printArray(SIZE, c, true); printCmpResult(SIZE, b, c); zeroArray(SIZE, c); printDesc("thrust scan, non-power-of-two"); StreamCompaction::Thrust::scan(NPOT, c, a); printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(NPOT, c, true); + printArray(NPOT, c, true); printCmpResult(NPOT, b, c); printf("\n"); @@ -137,14 +137,14 @@ int main(int argc, char* argv[]) { printDesc("work-efficient compact, power-of-two"); count = StreamCompaction::Efficient::compact(SIZE, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedCount, b, c); zeroArray(SIZE, c); printDesc("work-efficient compact, non-power-of-two"); count = StreamCompaction::Efficient::compact(NPOT, c, a); printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); - //printArray(count, c, true); + printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); system("pause"); // stop Win32 console from closing on exit diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d63..b9f1ad9 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -23,7 +23,13 @@ namespace StreamCompaction { * which map to 0 will be removed, and elements which map to 1 will be kept. */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { - // TODO + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) + { + return; // invalid index + } + + bools[index] = idata[index] == 0 ? 0 : 1; } /** @@ -32,7 +38,16 @@ namespace StreamCompaction { */ __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { - // TODO + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) + { + return; // invalid index + } + + if (bools[index] == 1) + { + odata[indices[index]] = idata[index]; + } } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..2b3c830 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -12,6 +12,7 @@ #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define BLOCK_SIZE 128 /** * Check for CUDA errors; print and exit if there was a problem. diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..743a3d0 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -19,7 +19,12 @@ namespace StreamCompaction { */ void scan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // exclusive scan! + // odata[0] = 0; // don't need this line since odata is already filled with 0s + for (int i = 1; i < n; i++) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } timer().endCpuTimer(); } @@ -30,9 +35,16 @@ namespace StreamCompaction { */ int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + int o = 0; + for (int i = 0; i < n; i++) + { + if (idata[i] != 0) + { + odata[o++] = idata[i]; + } + } timer().endCpuTimer(); - return -1; + return o; } /** @@ -42,9 +54,34 @@ namespace StreamCompaction { */ int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); - // TODO + // generate temp array with 1 where element != 0 and 0 otherwise + int* temp = new int[n]; + for (int i = 0; i < n; i++) + { + temp[i] = idata[i] == 0 ? 0 : 1; + } + + // run scan, copy the code from scan() because we can't use scan() here + // because scan() runs its own timer and running a timer while its already + // running throws an exception. :) + //scan(n, odata, temp); + for (int i = 1; i < n; i++) + { + odata[i] = odata[i - 1] + temp[i - 1]; + } + + // scatter + for (int i = 0; i < n; i++) + { + if (temp[i] == 1) + { + odata[odata[i]] = idata[i]; + } + } + + delete[] temp; timer().endCpuTimer(); - return -1; + return odata[n - 1]; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..b634df5 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -12,13 +12,81 @@ namespace StreamCompaction { return timer; } + __global__ void kernelUpSweep(int n, int* idata, int d) + { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) + { + return; // invalid index + } + + int _d = 1 << d; // 2^d + int _d1 = 1 << (d + 1); // 2^(d+1) + if (index % _d1 == 0) + { + idata[index + _d1 - 1] += idata[index + _d - 1]; + } + } + + __global__ void kernelDownSweep(int n, int* idata, int d) + { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n - 1) + { + return; // invalid index + } + + int _d = 1 << d; // 2^d + int _d1 = 1 << (d + 1); // 2^(d+1) + if (index % _d1 == 0) + { + int left = idata[index + _d - 1]; + idata[index + _d - 1] = idata[index + _d1 - 1]; + idata[index + _d1 - 1] += left; + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int max = ilog2ceil(n); + int nNextPowerOf2 = 1 << max; + + int totalBlocks = (nNextPowerOf2 + BLOCK_SIZE - 1) / BLOCK_SIZE; + + int* dev_idata; + cudaMalloc((void**)&dev_idata, sizeof(int) * nNextPowerOf2); + checkCUDAError("cudaMalloc dev_idata failed"); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_idata failed"); + + // Pad end elements with 0 + // For me this was not required and the "unset" elements were automatically defaulting to 0 + // But some compilers may throw this off so I'll just manually do the padding + cudaMemset(&dev_idata[n], 0, sizeof(int) * (nNextPowerOf2 - n)); + checkCUDAError("cudaMemset dev_idata failed"); + timer().startGpuTimer(); - // TODO + // upsweep + for (int d = 0; d < max; d++) + { + kernelUpSweep<<>>(nNextPowerOf2, dev_idata, d); + } + + cudaMemset(dev_idata + nNextPowerOf2 - 1, 0, sizeof(int)); + + // downsweep + for (int d = max - 1; d >= 0; d--) + { + kernelDownSweep<<>>(nNextPowerOf2, dev_idata, d); + } timer().endGpuTimer(); + + cudaMemcpy(odata, dev_idata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy dev_idata to odata failed"); + cudaFree(dev_idata); + checkCUDAError("cudaFree dev_idata failed"); } /** @@ -31,10 +99,72 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { + int* dev_idata; + cudaMalloc((void**)&dev_idata, sizeof(int) * n); + checkCUDAError("cudaMalloc dev_idata failed"); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_idata failed"); + + int* dev_odata; + cudaMalloc((void**)&dev_odata, sizeof(int) * n); + checkCUDAError("cudaMalloc dev_idata failed"); + + int* dev_bools; + cudaMalloc((void**)&dev_bools, sizeof(int) * n); + checkCUDAError("cudaMalloc dev_bools failed"); + cudaMemset(dev_bools, -1, sizeof(int) * n); + checkCUDAError("cudaMemset dev_bools failed"); + + int totalBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; + + int max = ilog2ceil(n); + int nNextPowerOf2 = 1 << max; + int totalBlocksPow2 = (nNextPowerOf2 + BLOCK_SIZE - 1) / BLOCK_SIZE; + + int* dev_indices; + cudaMalloc((void**)&dev_indices, sizeof(int) * nNextPowerOf2); + checkCUDAError("cudaMalloc dev_indices failed"); + cudaMemset(dev_indices, 0, sizeof(int) * nNextPowerOf2); + checkCUDAError("cudaMemset dev_indices failed"); + timer().startGpuTimer(); - // TODO + // Map to bools array + StreamCompaction::Common::kernMapToBoolean<<>>(n, dev_bools, dev_idata); + + // Scan step + // Copy bools to indices array to run scan on + cudaMemcpy(dev_indices, dev_bools, sizeof(int) * n, cudaMemcpyDeviceToDevice); + // Upsweep + for (int d = 0; d < max; d++) + { + kernelUpSweep<<>> (nNextPowerOf2, dev_indices, d); + } + + cudaMemset(dev_indices + nNextPowerOf2 - 1, 0, sizeof(int)); + checkCUDAError("cudaMemset dev_indices failed"); + + // downsweep + for (int d = max - 1; d >= 0; d--) + { + kernelDownSweep<<>>(nNextPowerOf2, dev_indices, d); + } + + // Scatter step + StreamCompaction::Common::kernScatter<<>>(n, dev_odata, dev_idata, dev_bools, dev_indices); timer().endGpuTimer(); - return -1; + + int len; + cudaMemcpy(&len, dev_indices + nNextPowerOf2 - 1, sizeof(int), cudaMemcpyDeviceToHost); // length of compacted array + checkCUDAError("cudaMemcpy dev_indices last element failed"); + cudaMemcpy(odata, dev_odata, sizeof(int) * len, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy dev_odata failed"); + + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_bools); + cudaFree(dev_indices); + + return len; } } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..3c6297a 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -12,14 +12,73 @@ namespace StreamCompaction { return timer; } // TODO: __global__ + __global__ void kernelScan(int n, int* odata, const int* idata, int offset) + { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) + { + return; // invalid index + } + + if (index >= offset) + { + odata[index] = idata[index - offset] + idata[index]; + } + else + { + odata[index] = idata[index]; + } + } + + __global__ void kernelShiftRight(int n, int* odata, int* idata) + { + int index = (blockDim.x * blockIdx.x) + threadIdx.x; + if (index >= n) + { + return; // invalid index + } + + if (index == 0) + { + odata[index] = 0; + } + else + { + odata[index] = idata[index - 1]; + } + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + int* dev_idata; + int* dev_odata; + cudaMalloc((void**)&dev_idata, sizeof(int) * n); + checkCUDAError("cudaMalloc dev_idata failed"); + cudaMalloc((void**)&dev_odata, sizeof(int) * n); + checkCUDAError("cudaMalloc dev_odata failed"); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_idata failed"); + cudaMemcpy(dev_odata, odata, sizeof(int) * n, cudaMemcpyHostToDevice); + checkCUDAError("cudaMemcpy dev_odata failed"); + + int totalBlocks = (n + BLOCK_SIZE - 1) / BLOCK_SIZE; timer().startGpuTimer(); - // TODO + int max = ilog2ceil(n); + for (int d = 1; d <= max; d++) + { + kernelScan<<>>(n, dev_odata, dev_idata, 1 << d - 1); + std::swap(dev_odata, dev_idata); + } + kernelShiftRight<<>>(n, dev_odata, dev_idata); timer().endGpuTimer(); + cudaMemcpy(odata, dev_odata, sizeof(int) * n, cudaMemcpyDeviceToHost); + checkCUDAError("cudaMemcpy dev_odata to odata failed"); + cudaFree(dev_idata); + checkCUDAError("cudaFree dev_idata failed"); + cudaFree(dev_odata); + checkCUDAError("cudaFree dev_odata failed"); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..6edc2bf 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -18,11 +18,16 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::host_vector host_idata_vec(idata, idata + n); + + // we can copy a host vector into device simply like this + thrust::device_vector dev_idata_vec = host_idata_vec; + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` - // example: for device_vectors dv_in and dv_out: - // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_idata_vec.begin(), dev_idata_vec.end(), dev_idata_vec.begin()); timer().endGpuTimer(); + + cudaMemcpy(odata, dev_idata_vec.data().get(), sizeof(int) * n, cudaMemcpyDeviceToHost); } } }