diff --git a/README.md b/README.md index 0e38ddb1..d38bc703 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,196 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Zhuoran Li + * [LinkedIn](https://www.linkedin.com/in/zhuoran-li-856658244/) +* Tested on: Windows 11, AMD Ryzen 5 5600H @ 3.30 GHz 16.0GB, NVIDIA GeForce RTX 3050 Laptop GPU 4GB -### (TODO: Your README) +## 1. Overview +This project implements the parallel **prefix sum (scan)** algorithm (naive and work-efficient), as well as two of its applications: **stream compaction** and **radix sort**. -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +### cpu.cu +- `CPU::scan`, `CPU::compactWithoutScan` and `CPU::compactWithScan` provide serial implementations used as references for testing. +### naive.cu +- `Naive::scan`: Performs an exclusive scan within each block by iteratively doubling the offset and using shared memory as a ping-pong buffer. Each thread contributes by summing its own element with a neighbor at the current offset. + + **For arrays larger than a block**, I compute a block sum (the last element of a block's scan + the last element of the input array within the block) for each block and then scan the block sum array on the host. The resulting offsets are written back to the device and added to each block's output. + + +### efficient.cu +- `Efficient::scan`: Contains a binary tree-based algorithm, including an up-sweep and a down-sweep phase. Like the naive scan, it relies on shared memory for cross-block communication. + + Since the algorithm assumes a binary tree structure, we need to pad the array to the next power of 2. To do that, I allocated another array on device with padded size, initialize all elements with 0, and copy the initial input array into the padded array. +- `Efficient::compact`: The stream compaction algorithm first maps the input array into a boolean array, marking elements with 1 and 0. An exclusive scan is then performed on this boolean array to produce target indices for the valid elements. Finally, a scatter kernel places the valid elements into their new compacted positions, and the total count is obtained from the last scanned index plus the last boolean value. + + +## 2. Extra Features +### 2.1. Shared Memory Allocation +Both `Naive::scan` and `Efficient::scan` load the input array into shared memory to improve performance. Shared memory is allocated dynamically at kernel launch, with the required size specified as part of the kernel configuration. + +Using `Efficient::scan()` as an example: + +- **Single-block case:** If only *one* block is used, we allocate `2 * n * sizeof(int)` bytes of shared memory, where `n` is the array length. + +- **General case:** For arrays of arbitrary length processed across multiple blocks, we instead allocate `2 * blockDim.x * sizeof(int)` bytes per block, which is how I implemented here. + +``` +kernEfficientScan<<>>(pad_n, d_odata, d_idata, d_blockSum); +``` +Inside the kernel, the input array is first copied into a shared memory buffer: +``` +extern __shared__ int temp[]; +temp[2 * thid] = idata[2 * id]; +temp[2 * thid + 1] = idata[2 * id + 1]; +``` + +Notice here `thid = threadIdx.x`, `id = threadIdx.x + blockIdx.x * blockDim.x`. + +In the initial version, each thread simply loaded two consecutive elements into `temp`. In the optimized version, additional steps are taken to avoid bank conflicts (see [2.2. Avoid Bank Conflicts](#2.2.-avoid-bank-conflicts)). + + +### 2.2. Avoid Bank Conflicts +To avoid bank conflicts in `Efficient::scan`, we add a small offset to each index so that consecutive threads are distributed across different banks. Compared to the implementation in [GPU Gem 3 Ch 39-3](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), the key difference lies in how the input array is loaded **across blocks**. + +The reference code uses an offset of `n / 2` when assigning indices. In contrast, my implementation uses `blockDim.x` as the offset. This adjustment is necessary because each thread must access elements based on its **global index** rather than just its local position within a block. + +Accordingly, calculate indices relative to the start of each block: +``` +int ai = thid; +int bi = thid + blockDim.x; +// ... +int blockStart = blockIdx.x * size; +temp[ai + bankOffsetA] = idata[blockStart + ai]; +temp[bi + bankOffsetB] = idata[blockStart + bi]; +``` + +### 2.3. Radix Sort +I implemented radix sort in `Efficient::sort` inside efficient.cu. To do this, I wrote two kernels in `common.cu`: `Common::kernMapToBit` and `Common::kernRadixScatter`. + +The process works like this: +- Since an integer has 32 bits, I run 32 passes. +- In each pass, I call `Common::kernMapToBit` to extract the current bit from each integer. +- Then, I call `Efficient::sort` to compute the array of indices for elements where the bit is 0. +- Next, I call `Common::kernRadixScatter` to scatter the elements into their correct positions using those indices. + +For radix sort, I just use global memory because I'm lazy :). + +I implement a thrust call using `thrust::sort` in `thrust.cu` as the reference. Finally, I added two tests in `main.cpp` to test the correctness of my radix sort. + +The usage of radix sort: +``` +StreamCompaction::Efficient::sort(SIZE, c, a); +``` + +The given array and the output: +``` + [ 2 10 6 37 7 38 48 27 19 31 40 30 31 ... 48 0 ] +==== thrust sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ] +==== work-efficient sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ] + passed +``` + +### 2.4. Thrust::remove_if +I also implemented stream compaction using Thrust in `Thrust::compact` with the `thrust::remove_if` function. The code is placed in `thrust.cu`, and I added tests after the compaction tests in `main.cpp` to compare it with my own implementations. + +## 3. Performance Analysis +### Block size optimization +![](images/graph3.png) + +Based on the test, performance is best with the block size of 512. + +### Scan +![](images/graph1.png) + +For smaller array sizes, CPU performs better than GPU due to lower overhead in launching kernels and managing memory transfers. However, as the array size grows, the parallelism of the GPU becomes more effective, and all GPU algorithms begin to perform better than CPU. Among the GPU implementations, the work-efficient scan consistently works faster than the naive version, while Thrust achieves the best performance overall. + +### Stream Compaction +![](images/graph2.png) + +Stream compaction shows similar pattern - for smaller array size, CPU performs better. When the array size become larger, the cost of CPU grows much more faster, and GPU algorithms beats the CPU's. Thrust still performs better. + +### Nsight Analysis +Thrust: `thrust::exclusive_scan` and `thrust::remove_if` +![](images/report1.png) + +Work-efficient scan: `kernEfficientScan` +![](images/report2.png) + +Stream compaction: `kernMapToBoolean` and `kernScatter` +![](images/report3.png) + +The memory throughput in thrust kernels is very high, meaning that data is loaded efficiently and bandwidth is well utilized. My work-efficient scan shows much lower memory throughput, meaning it does not achieve ideal bandwidth. Furthermore, the compute throughput of my kernels is also quite low, meaning that threads are not being kept fully busy. Thrust's scan and compaction kernels, instead, achieve relatively high compute throughput. + +### Test output (array size: 2^24) +``` +**************** +** SCAN TESTS ** +**************** + [ 46 10 13 38 0 15 26 33 48 46 2 37 47 ... 37 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 11.078ms (std::chrono Measured) + [ 0 46 56 69 107 107 122 148 181 229 275 277 314 ... 410917379 410917416 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 11.6785ms (std::chrono Measured) + [ 0 46 56 69 107 107 122 148 181 229 275 277 314 ... 410917332 410917365 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 5.64531ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 5.0217ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 3.21946ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 2.62963ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 1.34349ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 1.5104ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 2 2 0 3 3 0 3 3 2 0 0 2 0 ... 0 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 33.1457ms (std::chrono Measured) + [ 2 2 3 3 3 3 2 2 1 3 1 1 2 ... 2 2 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 34.8257ms (std::chrono Measured) + [ 2 2 3 3 3 3 2 2 1 3 1 1 2 ... 1 2 ] + passed +==== cpu compact with scan ==== + elapsed time: 83.9328ms (std::chrono Measured) + [ 2 2 3 3 3 3 2 2 1 3 1 1 2 ... 2 2 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 2.35418ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 5.15379ms (CUDA Measured) + passed +==== thrust compact, power-of-two ==== + elapsed time: 1.65043ms (CUDA Measured) + passed +==== thrust compact, non-power-of-two ==== + elapsed time: 1.6233ms (CUDA Measured) + passed + +********************** +** RADIX SORT TESTS ** +********************** + [ 3 38 24 29 29 19 1 28 4 18 6 26 39 ... 10 0 ] +==== thrust sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ] +==== work-efficient sort, power-of-two ==== + [ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ] + passed +``` diff --git a/images/graph1.png b/images/graph1.png new file mode 100644 index 00000000..5c057a82 Binary files /dev/null and b/images/graph1.png differ diff --git a/images/graph2.png b/images/graph2.png new file mode 100644 index 00000000..6aeb30b1 Binary files /dev/null and b/images/graph2.png differ diff --git a/images/graph3.png b/images/graph3.png new file mode 100644 index 00000000..d73f6a02 Binary files /dev/null and b/images/graph3.png differ diff --git a/images/report1.png b/images/report1.png new file mode 100644 index 00000000..abb35aa9 Binary files /dev/null and b/images/report1.png differ diff --git a/images/report2.png b/images/report2.png new file mode 100644 index 00000000..285f306d Binary files /dev/null and b/images/report2.png differ diff --git a/images/report3.png b/images/report3.png new file mode 100644 index 00000000..3c37ae51 Binary files /dev/null and b/images/report3.png differ diff --git a/src/main.cpp b/src/main.cpp index 3d5c8820..5a7d3e2c 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -13,7 +13,7 @@ #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int SIZE = 1 << 24; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -147,6 +147,46 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + // Thrust compaction + zeroArray(SIZE, c); + printDesc("thrust compact, power-of-two"); + count = StreamCompaction::Thrust::compact(SIZE, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpLenResult(count, expectedCount, b, c); + + zeroArray(SIZE, c); + printDesc("thrust compact, non-power-of-two"); + count = StreamCompaction::Thrust::compact(NPOT, c, a); + printElapsedTime(StreamCompaction::Thrust::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpLenResult(count, expectedNPOT, b, c); + + + // Radix sort tests + printf("\n"); + printf("**********************\n"); + printf("** RADIX SORT TESTS **\n"); + printf("**********************\n"); + + genArray(SIZE - 1, a, 50); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, b); // Use as reference + printDesc("thrust sort, power-of-two"); + StreamCompaction::Thrust::sort(SIZE, b, a); + //printElapsedTime(StreamCompaction::Thrust::timer().getCpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, b, true); + + zeroArray(SIZE, c); + printDesc("work-efficient sort, power-of-two"); + StreamCompaction::Efficient::sort(SIZE, c, a); + //printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/common.cu b/stream_compaction/common.cu index 2ed6d630..175a1eaa 100644 --- a/stream_compaction/common.cu +++ b/stream_compaction/common.cu @@ -24,6 +24,14 @@ namespace StreamCompaction { */ __global__ void kernMapToBoolean(int n, int *bools, const int *idata) { // TODO + int id = threadIdx.x + blockIdx.x * blockDim.x; + if (id >= n) return; + + if (idata[id] == 0) { + bools[id] = 0; + } else { + bools[id] = 1; + } } /** @@ -33,6 +41,35 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices) { // TODO + int id = threadIdx.x + blockIdx.x * blockDim.x; + if (id >= n) return; + + if (bools[id] == 1) { + odata[indices[id]] = idata[id]; + } + } + + __global__ void kernMapToBit(int n, int* revBits, const int* idata, int pass) { + int id = threadIdx.x + blockIdx.x * blockDim.x; + if (id >= n) return; + + int bit = (idata[id] >> pass) & 1; + revBits[id] = 1 - bit; + } + + __global__ void kernRadixScatter(int n, int totalFalses, int* odata, const int* revBits, const int* falses, const int* idata) { + int id = threadIdx.x + blockIdx.x * blockDim.x; + if (id >= n) return; + + int t = id - falses[id] + totalFalses; + + __syncthreads(); + + int index = revBits[id] ? falses[id] : t; + + __syncthreads(); + + odata[index] = idata[id]; } } diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed9..e8e87c3b 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -9,6 +9,7 @@ #include #include #include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) @@ -37,6 +38,11 @@ namespace StreamCompaction { __global__ void kernScatter(int n, int *odata, const int *idata, const int *bools, const int *indices); + __global__ void kernMapToBit(int n, int* bits, const int* idata, int pass); + + __global__ void kernRadixScatter(int n, int totalFalses, int* odata, + const int* revBits, const int* falses, const int* idata); + /** * This class is used for timing the performance * Uncopyable and unmovable diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa115..1a1bcc4c 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -1,5 +1,6 @@ #include #include "cpu.h" +#include #include "common.h" @@ -17,9 +18,45 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int* odata, const int* idata) { timer().startCpuTimer(); // TODO + + // -------------- Option 1: Single for loop ---------------- + odata[0] = 0; + for (int i = 1; i < n; ++i) { + odata[i] = odata[i - 1] + idata[i - 1]; + } + + // --------------- Option 2: Simulate GPU ------------------- + //int layer = ilog2ceil(n); + + //// Copy the input to the ping pong buffer + //int* ppdata = new int[n]; + //for (int i = 0; i < n; ++i) { + // ppdata[i] = idata[i]; + //} + + //// Scan + //for (int d = 1; d <= layer; ++d) { + // int offset = 1 << (d - 1); + // for (int i = 0; i < n; ++i) { + // if (i >= offset) { + // odata[i] = ppdata[i - offset] + ppdata[i]; + // } else { + // odata[i] = ppdata[i]; + // } + // } + // std::swap(odata, ppdata); + //} + //// Shift right + //odata[0] = 0; + //for (int i = 1; i < n; ++i) { + // odata[i] = ppdata[i - 1]; + //} + //delete[] ppdata; + + // ----------------------------------------------------------- timer().endCpuTimer(); } @@ -31,8 +68,16 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + int idx = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] != 0) { + odata[idx] = idata[i]; + idx++; + } + } + timer().endCpuTimer(); - return -1; + return idx; } /** @@ -43,8 +88,38 @@ namespace StreamCompaction { int compactWithScan(int n, int *odata, const int *idata) { timer().startCpuTimer(); // TODO + // Create a boolean array + int* boolVal = new int[n]; + int cnt = 0; + for (int i = 0; i < n; ++i) { + if (idata[i] == 0) { + boolVal[i] = 0; + } else { + boolVal[i] = 1; + cnt++; + } + } + + // Scan + int* scanVal = new int[n]; + scanVal[0] = 0; + for (int i = 1; i < n; ++i) { + scanVal[i] = scanVal[i - 1] + boolVal[i - 1]; + } + + // Scatter + for (int i = 0; i < n; ++i) { + if (boolVal[i]) { + int idx = scanVal[i]; + odata[idx] = idata[i]; + } + } + + delete[] boolVal; + delete[] scanVal; + timer().endCpuTimer(); - return -1; + return cnt; } } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346ee..cc7d0d8a 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -3,6 +3,10 @@ #include "common.h" #include "efficient.h" +#define NUM_BANKS 16 +#define LOG_NUM_BANKS 4 +#define CONFLICT_FREE_OFFSET(n)((n) >> NUM_BANKS + (n) >> (2 * LOG_NUM_BANKS)) + namespace StreamCompaction { namespace Efficient { using StreamCompaction::Common::PerformanceTimer; @@ -12,13 +16,162 @@ namespace StreamCompaction { return timer; } + __global__ void kernEfficientScan(int n, int* odata, const int* idata, int* blockSum) { + //int id = blockIdx.x * blockDim.x + threadIdx.x; + int thid = threadIdx.x; + int size = 2 * blockDim.x; + int offset = 1; + + // Shared memory array + extern __shared__ int temp[]; + + // Load input into shared memory + // Avoid bandk conflict + int ai = thid; + int bi = thid + blockDim.x; + + int bankOffsetA = CONFLICT_FREE_OFFSET(ai); + int bankOffsetB = CONFLICT_FREE_OFFSET(bi); + + int blockStart = blockIdx.x * size; + + temp[ai + bankOffsetA] = idata[blockStart + ai]; + temp[bi + bankOffsetB] = idata[blockStart + bi]; + + //temp[2 * thid] = idata[2 * id]; + //temp[2 * thid + 1] = idata[2 * id + 1]; + __syncthreads(); + + // Up sweep + for (int d = size >> 1; d > 0; d >>= 1) { + __syncthreads(); + if (thid < d) { + int ai = offset * (2 * thid + 1) - 1; + int bi = offset * (2 * thid + 2) - 1; + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + temp[bi] += temp[ai]; + } + offset *= 2; + } + + + // Set the root for down sweep tree to 0 + if (thid == 0) { + blockSum[blockIdx.x] = temp[size - 1]; + //temp[size - 1] = 0; + temp[size - 1 + CONFLICT_FREE_OFFSET(size - 1)] = 0; + } + + // Down sweep + for (int d = 1; d < size; d *= 2) { + offset >>= 1; + __syncthreads(); + if (thid < d) { + int ai = offset * (2 * thid + 1) - 1; + int bi = offset * (2 * thid + 2) - 1; + ai += CONFLICT_FREE_OFFSET(ai); + bi += CONFLICT_FREE_OFFSET(bi); + int t = temp[ai]; + temp[ai] = temp[bi]; + temp[bi] += t; + } + } + __syncthreads(); + + // Write results + odata[blockStart + ai] = temp[ai + bankOffsetA]; + odata[blockStart + bi] = temp[bi + bankOffsetB]; + + //odata[2 * id] = temp[2 * thid]; + //odata[2 * id + 1] = temp[2 * thid + 1]; + } + + + __global__ void kernAddBlockOffset(int n, int* odata, const int* blockOffset) { + int id = blockIdx.x * blockDim.x + threadIdx.x; + if (id >= n / 2) return; + odata[2 * id] += blockOffset[blockIdx.x]; + odata[2 * id + 1] += blockOffset[blockIdx.x]; + } + + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO + // Pad + int pad_n = 1 << ilog2ceil(n); + + // Assign hreads and blocks + int threads = 512; + int blocks = (pad_n / 2 + threads - 1) / threads; // ceil + + // Allocate memory on host + int* blockSum = new int[blocks]; + int* blockOffset = new int[blocks]; + + // Allocate & copy memory on device + int* d_odata, *d_idata, *d_blockSum, *d_blockOffset; + + cudaMalloc(&d_odata, pad_n * sizeof(int)); + checkCUDAError("Efficient::scan::cudaMalloc d_odata fails!"); + + cudaMalloc(&d_idata, pad_n * sizeof(int)); + checkCUDAError("Efficient::scan::cudaMalloc d_idata fails!"); + + cudaMalloc(&d_blockSum, blocks * sizeof(int)); + checkCUDAError("Efficient::scan::cudaMalloc d_blockSum fails!"); + + cudaMalloc(&d_blockOffset, blocks * sizeof(int)); + checkCUDAError("Efficient::scan::cudaMalloc d_blockOffset fails!"); + + cudaMemset(d_idata, 0, pad_n * sizeof(int)); + checkCUDAError("Efficient::scan::cudaMemset d_idata fails!"); + + cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Efficient::scan::cudaMemcpyHostToDevice fails!"); + + // Handle array of arbitrary length - create and set blockSum + cudaMemset(d_blockSum, 0, blocks * sizeof(int)); // Initialize the sum to 0 + checkCUDAError("Efficient::scan::cudaMemset d_blockSum fails!"); + + timer().startGpuTimer(); + + // Call kernEfficientScan + kernEfficientScan<<>>(pad_n, d_odata, d_idata, d_blockSum); + checkCUDAError("Efficient::scan::kernEfficientScan fails!"); + + // Handle array of arbitrary length - scan the blockSum (on CPU) + cudaMemcpy(blockSum, d_blockSum, blocks * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("Efficient::scan::cudaMemcpyDeviceToHost fails!"); + + blockOffset[0] = 0; + for (int i = 1; i < blocks; ++i) { + blockOffset[i] = blockOffset[i - 1] + blockSum[i - 1]; + } + + // Handle array of arbitrary length - add the offset back to array + cudaMemcpy(d_blockOffset, blockOffset, blocks * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Efficient::scan::cudaMemcpyHostToDevice fails!"); + + // Call kernAddBlockOffset + kernAddBlockOffset<<>>(pad_n, d_odata, d_blockOffset); + checkCUDAError("Efficient::scan::kernAddBlockOffset fails!"); + timer().endGpuTimer(); + + // Copy the value back + cudaMemcpy(odata, d_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("Efficient::scan::cudaMemcpyDeviceToHost fails!"); + + delete[] blockSum; + delete[] blockOffset; + cudaFree(d_idata); + cudaFree(d_odata); + cudaFree(d_blockSum); + cudaFree(d_blockOffset); } /** @@ -31,10 +184,119 @@ namespace StreamCompaction { * @returns The number of elements remaining after compaction. */ int compact(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO - timer().endGpuTimer(); - return -1; + int threads = 512; + int blocks = (n + threads - 1) / threads; // ceil + + // Allocate memory on device + int* d_odata, *d_idata, *d_bools, *d_indices; + + cudaMalloc(&d_odata, n * sizeof(int)); + checkCUDAError("Efficient::compact::cudaMalloc d_odata fails!"); + + cudaMalloc(&d_idata, n * sizeof(int)); + checkCUDAError("Efficient::compact::cudaMalloc d_idata fails!"); + + cudaMalloc(&d_bools, n * sizeof(int)); + checkCUDAError("Efficient::compact::cudaMalloc d_bools fails!"); + + cudaMalloc(&d_indices, n * sizeof(int)); + checkCUDAError("Efficient::compact::cudaMalloc d_indices fails!"); + + cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Efficient::compact::cudaMemcpyHostToDevice fails!"); + + //timer().startGpuTimer(); + + // Create boolean array + Common::kernMapToBoolean<<>>(n, d_bools, d_idata); + checkCUDAError("Efficient::compact::kernMapToBoolean fails!"); + + // Create indices array through exclusive scan + scan(n, d_indices, d_bools); + + // Scatter + Common:: kernScatter<<>>(n, d_odata, d_idata, d_bools, d_indices); + checkCUDAError("Efficient::compact::kernScatter fails!"); + + //timer().endGpuTimer(); + + cudaMemcpy(odata, d_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("Efficient::compact::cudaMemcpyDeviceToHost fails!"); + + // Get the count + int lastIndex, lastBool; + cudaMemcpy(&lastIndex, d_indices + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastBool, d_bools + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + int count = lastIndex + lastBool; + + cudaFree(d_odata); + cudaFree(d_idata); + cudaFree(d_bools); + cudaFree(d_indices); + + return count; + } + + void sort(int n, int* odata, const int* idata) { + int threads = 512; // temp + int blocks = (n + threads - 1) / threads; // temp + + // Allocate memory on device + int* d_odata, *d_idata, *d_revBits, *d_falses, *d_trues, *d_indices; + + cudaMalloc(&d_odata, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_odata fails!"); + + cudaMalloc(&d_idata, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_idata fails!"); + + cudaMalloc(&d_revBits, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_bits fails!"); + + cudaMalloc(&d_falses, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_falses fails!"); + + cudaMalloc(&d_trues, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_trues fails!"); + + cudaMalloc(&d_indices, n * sizeof(int)); + checkCUDAError("Efficient::sort::cudaMalloc d_indices fails!"); + + // Copy memory to device + cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Efficient::sort::cudaMemcpyHostToDevice fails!"); + + for (int i = 0; i < 32; ++i) { + // Map to bit + Common::kernMapToBit<<>>(n, d_revBits, d_idata, i); + checkCUDAError("Common::kernMapToBit fails!"); + + // Scan the reversed bits + scan(n, d_falses, d_revBits); + + // Get total falses + int lastRevBit, lastFalse; + cudaMemcpy(&lastRevBit, d_revBits + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&lastFalse, d_falses + (n - 1), sizeof(int), cudaMemcpyDeviceToHost); + int totalFalses = lastRevBit + lastFalse; + + // Get indices + Common::kernRadixScatter<<>>(n, totalFalses, d_odata, d_revBits, d_falses, d_idata); + + int* temp = d_idata; + d_idata = d_odata; + d_odata = temp; + } + + cudaMemcpy(odata, d_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + + cudaFree(d_odata); + cudaFree(d_idata); + cudaFree(d_revBits); + cudaFree(d_falses); + cudaFree(d_trues); + cudaFree(d_indices); } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4fe..ce56eadf 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -9,5 +9,7 @@ namespace StreamCompaction { void scan(int n, int *odata, const int *idata); int compact(int n, int *odata, const int *idata); + + void sort(int n, int* odata, const int* idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 43088769..e973e301 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -13,13 +13,121 @@ namespace StreamCompaction { } // TODO: __global__ + + __global__ void kernNaiveScan(int n, int* odata, const int* idata, int* blockSum) { + int id = blockIdx.x * blockDim.x + threadIdx.x; + int thid = threadIdx.x; + int size = blockDim.x; + + if (id >= n) return; + + // Shared memory array + extern __shared__ int temp[]; + + // Set ping pong buffer flags + int pout = 0, pin = 1; + + // Load array + temp[pout * size + thid] = (thid > 0) ? idata[id - 1] : 0; // thread 0 -> 0, others -> idata[id - 1] + __syncthreads(); + + // Scan within the block + for (int offset = 1; offset < n; offset *= 2) { + pout = 1 - pout; + pin = 1 - pout; + if (thid >= offset) { + temp[pout * size + thid] = temp[pin * size + thid] + temp[pin * size + thid - offset]; + } + else { + temp[pout * size + thid] = temp[pin * size + thid]; + } + __syncthreads(); + } + + odata[id] = temp[pout * size + thid]; + + // Write block sum + if (threadIdx.x == blockDim.x - 1 || id == n - 1) { + int blockEnd = (n < (blockIdx.x + 1) * size) ? n : ((blockIdx.x + 1) * size); // last global index in this block + blockSum[blockIdx.x] = odata[blockEnd - 1] + idata[blockEnd - 1]; + } + } + + __global__ void kernAddBlockOffset(int n, int* odata, const int* blockOffset) { + int id = blockIdx.x * blockDim.x + threadIdx.x; + if (id >= n) return; + odata[id] += blockOffset[blockIdx.x]; + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { - timer().startGpuTimer(); // TODO + int threads = 512; + int blocks = (n + threads - 1) / threads; // ceil + + // Allocate memory on host + int* blockSum = new int[blocks]; + int* blockOffset = new int[blocks]; + + // Allocate & copy memory on device + int* d_odata, *d_idata, *d_blockSum, *d_blockOffset; + + cudaMalloc(&d_odata, n * sizeof(int)); + checkCUDAError("Naive::cudaMalloc d_odata fails!"); + + cudaMalloc(&d_idata, n * sizeof(int)); + checkCUDAError("Naive::cudaMalloc d_idata fails!"); + + cudaMalloc(&d_blockSum, blocks * sizeof(int)); + checkCUDAError("Naive::cudaMalloc d_blockSum fails!"); + + cudaMalloc(&d_blockOffset, blocks * sizeof(int)); + checkCUDAError("Naive::cudaMalloc d_blockOffset fails!"); + + cudaMemcpy(d_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Naive::cudaMemcpyHostToDevice fails!"); + + // Handle array of arbitrary length - create and set blockSum + cudaMemset(d_blockSum, 0, blocks * sizeof(int)); // Initialize the sum to 0 + checkCUDAError("Naive::cudaMemset d_blockSum fails!"); + + timer().startGpuTimer(); + + // Call kernNaiveScan + kernNaiveScan<<>>(n, d_odata, d_idata, d_blockSum); + checkCUDAError("Naive::kernNaiveScan fails!"); + + // Handle array of arbitrary length - scan the blockSum (on CPU) + cudaMemcpy(blockSum, d_blockSum, blocks * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("Naive::cudaMemcpyDeviceToHost fails!"); + + blockOffset[0] = 0; + for (int i = 1; i < blocks; ++i) { + blockOffset[i] = blockOffset[i - 1] + blockSum[i - 1]; + } + + // Handle array of arbitrary length - add the offset back to array + cudaMemcpy(d_blockOffset, blockOffset, blocks * sizeof(int), cudaMemcpyHostToDevice); + checkCUDAError("Naive::cudaMemcpyHostToDevice fails!"); + + // Call kernAddBlockOffset + kernAddBlockOffset<<>>(n, d_odata, d_blockOffset); + checkCUDAError("Naive::kernAddBlockOffset fails!"); + timer().endGpuTimer(); + + // Copy the value back + cudaMemcpy(odata, d_odata, n * sizeof(int), cudaMemcpyDeviceToHost); + checkCUDAError("Naive::cudaMemcpyDeviceToHost fails!"); + + delete[] blockSum; + delete[] blockOffset; + cudaFree(d_idata); + cudaFree(d_odata); + cudaFree(d_blockSum); + cudaFree(d_blockOffset); } } } diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e7..91d81f41 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -3,6 +3,8 @@ #include #include #include +#include +#include #include "common.h" #include "thrust.h" @@ -18,11 +20,54 @@ namespace StreamCompaction { * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ void scan(int n, int *odata, const int *idata) { + thrust::host_vector in(idata, idata + n); + thrust::device_vector d_in = in; + thrust::device_vector d_out(n); + timer().startGpuTimer(); + // TODO use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + + thrust::exclusive_scan(d_in.begin(), d_in.end(), d_out.begin()); + + timer().endGpuTimer(); + + thrust::copy(d_out.begin(), d_out.end(), odata); + } + + struct is_zero + { + __host__ __device__ + bool operator()(const int x) const { + return x == 0; + } + }; + + int compact(int n, int* odata, const int* idata) { + thrust::device_vector d_in(idata, idata + n); + + timer().startGpuTimer(); + + auto new_end = thrust::remove_if(d_in.begin(), d_in.end(), is_zero()); + + // Get the new array size + int new_size = new_end - d_in.begin(); + timer().endGpuTimer(); + + thrust::copy(d_in.begin(), new_end, odata); + + return new_size; + } + + void sort(int n, int* odata, const int* idata) { + thrust::device_vector d_in(idata, idata + n); + + thrust::sort(d_in.begin(), d_in.end()); + + thrust::copy(d_in.begin(), d_in.end(), odata); } } } diff --git a/stream_compaction/thrust.h b/stream_compaction/thrust.h index fe98206b..74d6a15d 100644 --- a/stream_compaction/thrust.h +++ b/stream_compaction/thrust.h @@ -7,5 +7,9 @@ namespace StreamCompaction { StreamCompaction::Common::PerformanceTimer& timer(); void scan(int n, int *odata, const int *idata); + + int compact(int n, int* odata, const int* idata); + + void sort(int n, int* odata, const int* idata); } }