diff --git a/README.md b/README.md index 0e38ddb..4c55cd8 100644 --- a/README.md +++ b/README.md @@ -3,12 +3,206 @@ CUDA Stream Compaction **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2** -* (TODO) YOUR NAME HERE - * (TODO) [LinkedIn](), [personal website](), [twitter](), etc. -* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab) +* Gehan Zheng + * [LinkedIn](https://www.linkedin.com/in/gehan-zheng-05877b24a/), [personal website](https://grahamzen.github.io/). +* Tested on: Windows 10, AMD Ryzen 7 5800H @ 3.2GHz 16GB, GeForce RTX 3060 Laptop 6144MB (Personal Laptop) -### (TODO: Your README) +## Features -Include analysis, etc. (Remember, this is public, so don't put -anything here that you don't want to share with the world.) +This project implements the following algorithms on CPU and GPU. All extra parts are implemented: +* Scan + + * CPU Scan (Part 1) + * Naive GPU Scan (Part 2) + * Efficient GPU Scan (Part 3, 5) + * Thrust Scan (Part 4) + * GPU Scan Using Shared Memory && Hardware Optimization (Part 7) +* Stream Compaction + + * CPU Stream Compaction (Part 1) + * GPU Stream Compaction (Part 3) +* Radix Sort (Part 6) + * Shared Memory Optimization + +### Comments on Part 5 + +When writing kernel functions, I use stride to compact threads in less warps from the beginning so it is faster than cpu approach for larger array size. + +### Comments on Part 7 + +GPU Scan Using Shared Memory is implemented through using following scheme (courtesy of slides(Parallel Algorithms)): + +![1695064484197](image/README/1695064484197.png) + + We can perform action of storing data into shared memory for each block, and then perform scan on each block. + Instead of performing inclusive scan as in the slides, I perform exclusive scan on each block, so that in the last step, the result is naturally exclusive scan on the whole array. Within this step, we can also compute the sum of each block inside the same kernel. Since I am performing exclusive scan, Block sums needed to be computed instead of simply using the last element of each block (still in the same kernel). + +![1695064518909](image/README/1695064518909.png) + +After we have the block sums, we can perform exclusive scan on the block sums, and then add the result to each block. However, when # of elements is large like $2^{28}$, a better way is to reuse the kernel we have written for the first step since # of block sums is still large. We will do this until # of block sums is small enough to be computed on CPU. Obviously, we should also continuously add the block sums back to the array until we get the final result. It requires much more cudaMalloc if we call cudaMalloc for computing each block sum. So I use a buffer to store both the block sums and the array, and using different offset to access them. Thanks Chang Liu for this idea. + +For reducing bank conflict, I simply add an offset to the index of shared memory. I have tried using the scheme in [GPU Gem Ch 39](https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda), but it is not faster than the scheme I am using now. + +## Performance Analysis + +### Choosing Block Size + +I tested the performance of different block sizes on scan. In the table below, time is in milliseconds. + +| Block Size | 32 | 64 | 128 | 256 | 512 | +| ----------------------------------------------------- | -------: | ------: | -------: | -------: | -------: | +| Naive scan, power-of-two | 353.852 | 263.32 | 259.132 | 266.057 | 259.93 | +| Naive scan, non-power-of-two | 354.332 | 265.482 | 260.186 | 262.097 | 259.205 | +| Work-efficient scan, power-of-two | 81.0401 | 82.6295 | 83.5005 | 83.4684 | 83.6312 | +| Work-efficient scan, non-power-of-two | 81.1159 | 82.5769 | 84.338 | 83.9051 | 83.9558 | +| Work-efficient scan (shared memory), power-of-two | 46.8579 | 27.9556 | 22.0506 | 23.4973 | 26.1206 | +| Work-efficient scan (shared memory), non-power-of-two | 46.3985 | 28.2063 | 22.0158 | 23.4613 | 26.032 | +| Thrust scan, power-of-two | 9.61946 | 9.43411 | 9.40954 | 9.66554 | 9.69523 | +| Thrust scan, non-power-of-two | 9.2928 | 9.81402 | 9.55187 | 9.22931 | 9.29382 | +| Average | 122.8136 | 96.1773 | 93.77304 | 95.17262 | 94.73296 | + +Though the performance difference is not significant, from the table above, we can still easily find that the best block size is 128, which outperforms other block sizes in most cases. + +### Comparing different algorithms + +* Scan + +For clarity, I only plot the performance of non-power-of-two scan. + +![scan](image/README/scan.png) + +Excitingly, in several cases ($2^{20}, 2^{22}$), work-efficient scan using shared memory is faster than Thrust scan. My guess is that Thrust scan isn't well-optimized for these array sizes because it is designed for larger array sizes and it's hard to optimize for every array size. + +* Stream Compaction +![compact](image/README/compact.png) + +* Radix Sort + +For Radix sort, I tested performance of sorting integers within range [0, 2^31-1]. + +![sort](image/README/sort.png) + +From the figure above, we can see that Radix sort is much faster than std::sort when array size is comparatively large, and it is even faster when optimizing for shared memory. + + +* Comparing Scan w/ and w/o bank conflict reduction + +![bankconflict](image/README/bankconflict.png) + +We can see a roughly 20% performance improvement when using bank conflict reduction, and the performance improvement is more significant when array size is large. +#### Guess about implementation of Thrust Scan + +![1695092955546](image/README/1695092955546.png) + +From the profile we can see that Thrust Scan launches 3 kernels. I confirm that they are launched in this order by observing the order of the kernels while profiling. +Since DeviceScanKernel uses 56 registers, much more than any other kernels, I guess it is the kernel for computing scan. From the name I guess what another two kernels has done is to prepare for the context of DeviceScanKernel kernel. + + +Since I don't know any details about the implementation of Thrust Scan, I will not explain the performance of Thrust Scan. + +### Phenomena we can observe from the figure of Scan +* For small array size, CPU scan is much faster than GPU scan. + * It is because the overhead of launching kernel is much larger than the time spent on computing scan. As the array size increases, the time spent on computing scan becomes more significant, and GPU scan outperforms CPU scan. + +* Naive scan is always slower than CPU scan, and work-efficient scan is faster than CPU scan when array size is large enough. + * It is because Naive scan launches kernel much more times than work-efficient scan, and when array size is large enough, the time spent on computing scan is much larger than the overhead of launching kernel. Though work-efficient scan is faster than CPU scan, it accesses data using global memory, which has a much lower bandwidth than shared memory. + +### Bottleneck of Scans + +* Naive Scan + * Memory I/O: Global Memory Bandwidth + * Computation: Warps' Low Occupancy due to memory access pattern (large stride). + +* Work-efficient Scan + * Memory I/O: Global Memory Bandwidth. + +* Work-efficient Scan Using Shared Memory: + * Computation: Overhead of launching kernels. + + +### Output of the test program + +I added tests for work-efficient scan with shared memory(both power-of-two and non-power-of-two), std::sort, and Radix sort with and without shared memory(both power-of-two and non-power-of-two). + +``` + +**************** +** SCAN TESTS ** +**************** + [ 23 22 5 42 29 39 9 20 41 37 14 4 27 ... 22 0 ] +==== cpu scan, power-of-two ==== + elapsed time: 245.021ms (std::chrono Measured) + [ 0 23 45 50 92 121 160 169 189 230 267 281 285 ... -2015456254 -2015456232 ] +==== cpu scan, non-power-of-two ==== + elapsed time: 195.01ms (std::chrono Measured) + [ 0 23 45 50 92 121 160 169 189 230 267 281 285 ... -2015456389 -2015456348 ] + passed +==== naive scan, power-of-two ==== + elapsed time: 260.406ms (CUDA Measured) + passed +==== naive scan, non-power-of-two ==== + elapsed time: 259.134ms (CUDA Measured) + passed +==== work-efficient scan, power-of-two ==== + elapsed time: 83.6105ms (CUDA Measured) + passed +==== work-efficient scan, non-power-of-two ==== + elapsed time: 83.6329ms (CUDA Measured) + passed +==== work-efficient scan (shared memory), power-of-two ==== + elapsed time: 22.0134ms (CUDA Measured) + passed +==== work-efficient scan (shared memory), non-power-of-two ==== + elapsed time: 21.939ms (CUDA Measured) + passed +==== thrust scan, power-of-two ==== + elapsed time: 9.83552ms (CUDA Measured) + passed +==== thrust scan, non-power-of-two ==== + elapsed time: 9.20576ms (CUDA Measured) + passed + +***************************** +** STREAM COMPACTION TESTS ** +***************************** + [ 3 0 2 0 3 3 3 2 3 0 2 3 1 ... 1 0 ] +==== cpu compact without scan, power-of-two ==== + elapsed time: 461.709ms (std::chrono Measured) + [ 3 2 3 3 3 2 3 2 3 1 1 3 3 ... 2 1 ] + passed +==== cpu compact without scan, non-power-of-two ==== + elapsed time: 461.435ms (std::chrono Measured) + [ 3 2 3 3 3 2 3 2 3 1 1 3 3 ... 3 1 ] + passed +==== cpu compact with scan ==== + elapsed time: 834.647ms (std::chrono Measured) + [ 3 2 3 3 3 2 3 2 3 1 1 3 3 ... 2 1 ] + passed +==== work-efficient compact, power-of-two ==== + elapsed time: 102.577ms (CUDA Measured) + passed +==== work-efficient compact, non-power-of-two ==== + elapsed time: 103.391ms (CUDA Measured) + passed + +***************************** +** RADIX-SORT TESTS ** +***************************** + [ 22645 21411 14651 15547 10324 2673 8146 27956 23148 22752 30878 16323 20512 ... 24259 0 ] +==== cpu std::sort, power-of-two ==== + elapsed time: 12380.9ms (std::chrono Measured) +==== radix-sort, power-of-two ==== + elapsed time: 3215.3ms (CUDA Measured) + passed +==== radix-sort (shared memory), power-of-two ==== + elapsed time: 1301.83ms (CUDA Measured) + passed +==== cpu std::sort, non-power-of-two ==== + elapsed time: 12305.5ms (std::chrono Measured) +==== radix-sort, non-power-of-two ==== + elapsed time: 3214.63ms (CUDA Measured) +==== radix-sort (shared memory), non-power-of-two ==== + elapsed time: 1301.99ms (CUDA Measured) + passed +``` diff --git a/image/README/1695064484197.png b/image/README/1695064484197.png new file mode 100644 index 0000000..9356801 Binary files /dev/null and b/image/README/1695064484197.png differ diff --git a/image/README/1695064497269.png b/image/README/1695064497269.png new file mode 100644 index 0000000..12f90bd Binary files /dev/null and b/image/README/1695064497269.png differ diff --git a/image/README/1695064518909.png b/image/README/1695064518909.png new file mode 100644 index 0000000..e68beef Binary files /dev/null and b/image/README/1695064518909.png differ diff --git a/image/README/1695092955546.png b/image/README/1695092955546.png new file mode 100644 index 0000000..709dc55 Binary files /dev/null and b/image/README/1695092955546.png differ diff --git a/image/README/bankconflict.png b/image/README/bankconflict.png new file mode 100644 index 0000000..0301756 Binary files /dev/null and b/image/README/bankconflict.png differ diff --git a/image/README/compact.png b/image/README/compact.png new file mode 100644 index 0000000..9d78de1 Binary files /dev/null and b/image/README/compact.png differ diff --git a/image/README/scan.png b/image/README/scan.png new file mode 100644 index 0000000..9bdeb3e Binary files /dev/null and b/image/README/scan.png differ diff --git a/image/README/sort.png b/image/README/sort.png new file mode 100644 index 0000000..b452ceb Binary files /dev/null and b/image/README/sort.png differ diff --git a/src/main.cpp b/src/main.cpp index 896ac2b..fd5d2e3 100644 --- a/src/main.cpp +++ b/src/main.cpp @@ -10,10 +10,12 @@ #include #include #include +#include #include #include "testing_helpers.hpp" -const int SIZE = 1 << 8; // feel free to change the size of array +const int NUMBITS = 31; // for radix sort +const int SIZE = 1 << 28; // feel free to change the size of array const int NPOT = SIZE - 3; // Non-Power-Of-Two int *a = new int[SIZE]; int *b = new int[SIZE]; @@ -81,6 +83,20 @@ int main(int argc, char* argv[]) { //printArray(NPOT, c, true); printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); + printDesc("work-efficient scan (shared memory), power-of-two"); + StreamCompaction::Efficient::scanShared(SIZE, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(SIZE, c, true); + printCmpResult(SIZE, b, c); + + zeroArray(SIZE, c); + printDesc("work-efficient scan (shared memory), non-power-of-two"); + StreamCompaction::Efficient::scanShared(NPOT, c, a); + printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + printCmpResult(NPOT, b, c); + zeroArray(SIZE, c); printDesc("thrust scan, power-of-two"); StreamCompaction::Thrust::scan(SIZE, c, a); @@ -147,6 +163,51 @@ int main(int argc, char* argv[]) { //printArray(count, c, true); printCmpLenResult(count, expectedNPOT, b, c); + printf("\n"); + printf("*****************************\n"); + printf("** RADIX-SORT TESTS **\n"); + printf("*****************************\n"); + + genArray(SIZE - 1, a, (1 << NUMBITS) - 1); // Leave a 0 at the end to test that edge case + a[SIZE - 1] = 0; + printArray(SIZE, a, true); + + zeroArray(SIZE, b); + printDesc("cpu std::sort, power-of-two"); + StreamCompaction::CPU::sort(SIZE, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + + StreamCompaction::RadixSort::sort(SIZE, c, a, NUMBITS); + printDesc("radix-sort, power-of-two"); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpLenResult(SIZE, SIZE, b, c); + + + zeroArray(SIZE, c); + StreamCompaction::RadixSort::sortShared(SIZE, c, a, NUMBITS); + printDesc("radix-sort (shared memory), power-of-two"); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + printCmpLenResult(SIZE, SIZE, b, c); + + + zeroArray(SIZE, b); + printDesc("cpu std::sort, non-power-of-two"); + StreamCompaction::CPU::sort(NPOT, b, a); + printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)"); + + zeroArray(SIZE, c); + StreamCompaction::RadixSort::sort(NPOT, c, a, NUMBITS); + printDesc("radix-sort, non-power-of-two"); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + + zeroArray(SIZE, c); + StreamCompaction::RadixSort::sortShared(NPOT, c, a, NUMBITS); + printDesc("radix-sort (shared memory), non-power-of-two"); + printElapsedTime(StreamCompaction::RadixSort::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)"); + //printArray(NPOT, c, true); + + printCmpLenResult(NPOT, NPOT, b, c); + system("pause"); // stop Win32 console from closing on exit delete[] a; delete[] b; diff --git a/stream_compaction/CMakeLists.txt b/stream_compaction/CMakeLists.txt index 567795b..8656c2f 100644 --- a/stream_compaction/CMakeLists.txt +++ b/stream_compaction/CMakeLists.txt @@ -4,6 +4,7 @@ set(headers "naive.h" "efficient.h" "thrust.h" + "radixSort.h" ) set(sources @@ -12,6 +13,7 @@ set(sources "naive.cu" "efficient.cu" "thrust.cu" + "radixSort.cu" ) list(SORT headers) diff --git a/stream_compaction/common.h b/stream_compaction/common.h index d2c1fed..5a014a7 100644 --- a/stream_compaction/common.h +++ b/stream_compaction/common.h @@ -9,9 +9,12 @@ #include #include #include +#include #define FILENAME (strrchr(__FILE__, '/') ? strrchr(__FILE__, '/') + 1 : __FILE__) #define checkCUDAError(msg) checkCUDAErrorFn(msg, FILENAME, __LINE__) +#define BLOCKSPERGRID(n, blockSize) ((n + blockSize - 1) / blockSize) +constexpr int blockSize = 128; /** * Check for CUDA errors; print and exit if there was a problem. @@ -21,6 +24,7 @@ void checkCUDAErrorFn(const char *msg, const char *file = NULL, int line = -1); inline int ilog2(int x) { int lg = 0; while (x >>= 1) { + if (x <= 0) { throw std::runtime_error("Dead loop while shifting"); } ++lg; } return lg; @@ -32,6 +36,47 @@ inline int ilog2ceil(int x) { namespace StreamCompaction { namespace Common { + class devDataBuffer { + private: + int* dev_data; + int totalSize, size_; + std::vector sizes; + std::vector offsets; + public: + devDataBuffer(int n, int blockSize) :totalSize(0), size_(0) { + int extendedSize = BLOCKSPERGRID(n, blockSize) * blockSize; + while (extendedSize > 1) { + if (extendedSize < blockSize) { + break; + } + size_++; + sizes.push_back(extendedSize); + offsets.push_back(totalSize); + totalSize += extendedSize; + extendedSize = BLOCKSPERGRID(extendedSize, blockSize); + } + cudaMalloc((void**)&dev_data, sizeof(int) * totalSize); + } + ~devDataBuffer() { + cudaFree(dev_data); + } + int* operator[](int i) const { + return dev_data + offsets[i]; + } + int* data() const { + return dev_data; + } + int size() const { + return size_; + } + int memCnt()const { + return totalSize; + } + int sizeAt(int i) const { + return sizes[i]; + } + + }; __global__ void kernMapToBoolean(int n, int *bools, const int *idata); __global__ void kernScatter(int n, int *odata, diff --git a/stream_compaction/cpu.cu b/stream_compaction/cpu.cu index 719fa11..6988efc 100644 --- a/stream_compaction/cpu.cu +++ b/stream_compaction/cpu.cu @@ -17,9 +17,14 @@ namespace StreamCompaction { * For performance analysis, this is supposed to be a simple for loop. * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int* odata, const int* idata) { timer().startCpuTimer(); - // TODO + // DONE + odata[0] = 0; + for (int i = 1; i < n; i++) + { + odata[i] = odata[i - 1] + idata[i - 1]; + } timer().endCpuTimer(); } @@ -28,11 +33,18 @@ namespace StreamCompaction { * * @returns the number of elements remaining after compaction. */ - int compactWithoutScan(int n, int *odata, const int *idata) { + int compactWithoutScan(int n, int* odata, const int* idata) { timer().startCpuTimer(); - // TODO + // DONE + int oi = 0, ii = 0; + while (ii < n) { + if (idata[ii] != 0) { + odata[oi++] = idata[ii]; + } + ii++; + } timer().endCpuTimer(); - return -1; + return oi; } /** @@ -40,11 +52,32 @@ namespace StreamCompaction { * * @returns the number of elements remaining after compaction. */ - int compactWithScan(int n, int *odata, const int *idata) { + int compactWithScan(int n, int* odata, const int* idata) { + int* notZero = new int[n]; timer().startCpuTimer(); - // TODO + // DONE + notZero[0] = 0; + for (int i = 0; i < n - 1; i++) + { + notZero[i + 1] = notZero[i] + (idata[i] == 0 ? 0 : 1); + } + int num = notZero[n - 1]; + for (int i = 0; i < n; i++) + { + if (idata[i] != 0) { + odata[notZero[i]] = idata[i]; + } + } + timer().endCpuTimer(); + delete[]notZero; + return num; + } + + void sort(int n, int* odata, const int* idata) { + memcpy(odata, idata, sizeof(int) * n); + timer().startCpuTimer(); + std::sort(odata, odata + n); timer().endCpuTimer(); - return -1; } } } diff --git a/stream_compaction/cpu.h b/stream_compaction/cpu.h index 873c047..4261454 100644 --- a/stream_compaction/cpu.h +++ b/stream_compaction/cpu.h @@ -11,5 +11,7 @@ namespace StreamCompaction { int compactWithoutScan(int n, int *odata, const int *idata); int compactWithScan(int n, int *odata, const int *idata); + + void sort(int n, int *odata, const int* idata); } } diff --git a/stream_compaction/efficient.cu b/stream_compaction/efficient.cu index 2db346e..a8eaac7 100644 --- a/stream_compaction/efficient.cu +++ b/stream_compaction/efficient.cu @@ -2,6 +2,8 @@ #include #include "common.h" #include "efficient.h" +#define REDUCE_BANK_CONFLICT 1 +#define CONFLICT_FREE_OFFSET(n) ((n) >> 6) namespace StreamCompaction { namespace Efficient { @@ -12,15 +14,199 @@ namespace StreamCompaction { return timer; } + __device__ __forceinline__ int computeIndex(int idx) { +#if REDUCE_BANK_CONFLICT + return idx + CONFLICT_FREE_OFFSET(idx); +#else + return idx; +#endif + } + + __global__ void kernUpSweep(const int n, int* data, const int stride) { + int index = blockIdx.x * blockDim.x + threadIdx.x + 1; + if (index > n)return; + int real_i = index * stride * 2 - 1; + data[real_i] += data[real_i - stride]; + } + + __global__ void kernDownSweep(const int n, int* data, const int stride) { + int index = blockIdx.x * blockDim.x + threadIdx.x + 1; + if (index > n)return; + int real_i = index * stride * 2 - 1; + int t = data[real_i]; + data[real_i] += data[real_i - stride]; + data[real_i - stride] = t; + } + + __global__ void kernPrescanInplace(int n, int* data, int* blockSum) { + // allocated on invocation + extern __shared__ int temp[]; + extern __shared__ int last; + int thid = threadIdx.x; + int blockOffset = blockIdx.x * blockDim.x; + int index = blockOffset + thid; + if (index > n) + return; + + int offset = 1; + temp[computeIndex(thid)] = data[index]; + if (thid == blockDim.x - 1) + last = temp[computeIndex(blockDim.x - 1)]; + __syncthreads(); + + // build sum in place up the tree + for (int d = blockDim.x >> 1; d > 0; d >>= 1) + { + __syncthreads(); + if (thid < d) + { + int ai = offset * (2 * thid + 1) - 1; + int bi = ai + offset; + temp[computeIndex(bi)] += temp[computeIndex(ai)]; + } + offset *= 2; + } + // clear the last element + if (thid == 0) { temp[computeIndex(blockDim.x - 1)] = 0; } + __syncthreads(); + // traverse down tree & build scan + for (int d = 1; d < blockDim.x; d *= 2) + { + offset >>= 1; + __syncthreads(); + if (thid < d) + { + int ai = offset * (2 * thid + 1) - 1; + int bi = ai + offset; + int t = temp[computeIndex(ai)]; + temp[computeIndex(ai)] = temp[computeIndex(bi)]; + temp[computeIndex(bi)] += t; + } + } + __syncthreads(); + data[index] = temp[computeIndex(thid)]; + if (thid == 0) { + blockSum[blockIdx.x] = last + temp[computeIndex(blockDim.x - 1)]; + } + } + + __global__ void kernAddBlockSum(int* data, const int* blockSum, int n) { + extern __shared__ int sum; + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n)return; + if (threadIdx.x == 0) + sum = blockSum[blockIdx.x]; + __syncthreads(); + data[index] += sum; + } + + void scanInplace(int n, int* dev_data) { + if (n != 1 << ilog2ceil(n)) + throw std::runtime_error("n is not pow of 2"); + int strideMax = n >> 1; + dim3 fullBlocksPerGrid; + for (int i = 1, int n = strideMax; i < strideMax; i <<= 1, n >>= 1) + { + fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); + kernUpSweep << > > (n, dev_data, i); + } + cudaMemset(&dev_data[n - 1], 0, sizeof(int)); + for (int i = strideMax, int n = 1; i >= 1; i >>= 1, n <<= 1) + { + fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); + kernDownSweep << > > (n, dev_data, i); + } + } + /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int* odata, const int* idata) { + int dMax = ilog2ceil(n); + int extended_n = 1 << dMax; + int* dev_data; + cudaMalloc((void**)&dev_data, sizeof(int) * extended_n); + cudaMemset(dev_data, 0, sizeof(int) * extended_n); + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); timer().startGpuTimer(); - // TODO + // DONE + scanInplace(extended_n, dev_data); timer().endGpuTimer(); + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_data); + } + + void scanSharedNaive_(int n, int* dev_data, int* dev_blockSum, int blockSize) { + dim3 fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); + kernPrescanInplace << > > (n, dev_data, dev_blockSum); + scanInplace(n / blockSize, dev_blockSum); + kernAddBlockSum << > > (dev_data, dev_blockSum, n); } + void scanSharedNaive(int n, int* odata, const int* idata) { + int dMax = ilog2ceil(n); + int extended_n = 1 << dMax; + if (extended_n < blockSize) { + scan(n, odata, idata); + return; + } + int* dev_data; + int* dev_blockSum; + cudaMalloc((void**)&dev_data, sizeof(int) * extended_n); + cudaMalloc((void**)&dev_blockSum, sizeof(int) * extended_n); + cudaMemset(dev_data, 0, sizeof(int) * extended_n); + cudaMemcpy(dev_data, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); + scanSharedNaive_(extended_n, dev_data, dev_blockSum, blockSize); + timer().endGpuTimer(); + cudaMemcpy(odata, dev_data, sizeof(int) * n, cudaMemcpyDeviceToHost); + cudaFree(dev_data); + cudaFree(dev_blockSum); + } + void scanSharedInplace(int n, Common::devDataBuffer& buffer) { + int i = 0; + for (i = 0; i < buffer.size() - 1; i++) { + kernPrescanInplace << > > (buffer.sizeAt(i), buffer[i], buffer[i + 1]); + } + scanInplace(buffer.sizeAt(i), buffer[i]); + for (int i = buffer.size() - 1; i >= 1; i--) { + kernAddBlockSum << > > (buffer[i - 1], buffer[i], buffer.sizeAt(i - 1)); + } + } + void scanShared(int n, int* odata, const int* idata) { + int dMax = ilog2ceil(n); + int extended_n = 1 << dMax; + if (extended_n < blockSize) { + scan(n, odata, idata); + return; + } + Common::devDataBuffer buffer(extended_n, blockSize); + cudaMemcpy(buffer.data(), idata, sizeof(int) * n, cudaMemcpyHostToDevice); + timer().startGpuTimer(); + int i = 0; + for (i = 0; i < buffer.size() - 1; i++) { + kernPrescanInplace << > > (buffer.sizeAt(i), buffer[i], buffer[i + 1]); + } + scanInplace(buffer.sizeAt(i), buffer[i]); + for (int i = buffer.size() - 1; i >= 1; i--) { + kernAddBlockSum << > > (buffer[i - 1], buffer[i], buffer.sizeAt(i - 1)); + } + timer().endGpuTimer(); + cudaMemcpy(odata, buffer.data(), sizeof(int) * n, cudaMemcpyDeviceToHost); + } + + __global__ void kernMapToBoolean(int n, int* bools, const int* idata) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index > n)return; + if (idata[index] != 0) + bools[index] = 1; + } + __global__ void kernScatter(int n, int* odata, const int* idata, const int* bools, const int* indices) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index > n)return; + if (idata[index] != 0) + odata[indices[index]] = idata[index]; + } /** * Performs stream compaction on idata, storing the result into odata. * All zeroes are discarded. @@ -30,11 +216,28 @@ namespace StreamCompaction { * @param idata The array of elements to compact. * @returns The number of elements remaining after compaction. */ - int compact(int n, int *odata, const int *idata) { + int compact(int n, int* odata, const int* idata) { + int extended_n = 1 << ilog2ceil(n); + int* dev_idata, * dev_odata, * dev_indices, num; + cudaMalloc((void**)&dev_idata, sizeof(int) * extended_n); + cudaMalloc((void**)&dev_odata, sizeof(int) * n); + cudaMalloc((void**)&dev_indices, sizeof(int) * extended_n); + cudaMemcpy(dev_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice); + cudaMemset(dev_odata, 0, sizeof(int) * n); + cudaMemset(dev_indices, 0, sizeof(int) * extended_n); + dim3 fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); timer().startGpuTimer(); - // TODO + // DONE + kernMapToBoolean << > > (n, dev_indices, dev_idata); + scanInplace(extended_n, dev_indices); + cudaMemcpy(&num, &dev_indices[extended_n - 1], sizeof(int), cudaMemcpyDeviceToHost); + kernScatter << > > (n, dev_odata, dev_idata, dev_idata, dev_indices); timer().endGpuTimer(); - return -1; + cudaMemcpy(odata, dev_odata, sizeof(int) * num, cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_indices); + return num; } } } diff --git a/stream_compaction/efficient.h b/stream_compaction/efficient.h index 803cb4f..3150e5c 100644 --- a/stream_compaction/efficient.h +++ b/stream_compaction/efficient.h @@ -4,10 +4,17 @@ namespace StreamCompaction { namespace Efficient { + StreamCompaction::Common::PerformanceTimer& timer(); - void scan(int n, int *odata, const int *idata); + void scanInplace(int n, int* dev_data); + + void scan(int n, int* odata, const int* idata); + void scanSharedInplace(int n, Common::devDataBuffer& buffer); + + void scanShared(int n, int* odata, const int* idata); + void scanSharedNaive(int n, int* odata, const int* idata); - int compact(int n, int *odata, const int *idata); + int compact(int n, int* odata, const int* idata); } } diff --git a/stream_compaction/naive.cu b/stream_compaction/naive.cu index 4308876..96c4d72 100644 --- a/stream_compaction/naive.cu +++ b/stream_compaction/naive.cu @@ -12,14 +12,36 @@ namespace StreamCompaction { return timer; } // TODO: __global__ - + __global__ void kernNaiveScan(const int n, const int d, int* odata, int* idata) + { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n)return; + if (index >= d) + odata[index] = idata[index - d] + idata[index]; + else + odata[index] = idata[index]; + } /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int* odata, const int* idata) { + int dMax = ilog2ceil(n); + int* dev_idata, * dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMemcpy(dev_idata+1, idata, (n - 1) * sizeof(int), cudaMemcpyHostToDevice); + dim3 fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); timer().startGpuTimer(); - // TODO + // DONE + for (int i = 1; i <= dMax; i++) + { + kernNaiveScan << > > (n, 1 << (i - 1), dev_odata, dev_idata); + std::swap(dev_odata, dev_idata); + } timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, n*sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); } } } diff --git a/stream_compaction/radixSort.cu b/stream_compaction/radixSort.cu new file mode 100644 index 0000000..76107da --- /dev/null +++ b/stream_compaction/radixSort.cu @@ -0,0 +1,96 @@ +#include +#include +#include "efficient.h" +#include "radixSort.h" + +namespace StreamCompaction { + namespace RadixSort { + using StreamCompaction::Common::PerformanceTimer; + PerformanceTimer& timer() + { + static PerformanceTimer timer; + return timer; + } + + __global__ void kernMapToBoolean(int n, int* odata, const int* idata, int mask) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + odata[index] = (idata[index] & mask) == 0; + } + + __global__ void kernScatter(int n, int* odata, const int* idata, const int* falses, int mask, int totalFalses) { + int index = blockIdx.x * blockDim.x + threadIdx.x; + if (index >= n) { + return; + } + int falsesIndex = falses[index]; + if ((idata[index] & mask) == 0) { + odata[falsesIndex] = idata[index]; + } + else { + odata[index - falsesIndex + totalFalses] = idata[index]; + } + } + + void sort(int n, int* odata, const int* idata, int numBits) { + int extended_n = 1 << ilog2ceil(n); + dim3 fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); + int* dev_idata; + int* dev_odata; + int* dev_falses; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + cudaMalloc((void**)&dev_falses, extended_n * sizeof(int)); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); + unsigned maxMask = 1 << numBits; + for (unsigned mask = 1; mask < maxMask; mask <<= 1) + { + kernMapToBoolean << > > (n, dev_falses, dev_idata, mask); + StreamCompaction::Efficient::scanInplace(extended_n, dev_falses); + int totalFalses = 0, tmp_back = 0; + cudaMemcpy(&totalFalses, dev_falses + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&tmp_back, dev_idata + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + totalFalses += (tmp_back & mask) == 0; + kernScatter << > > (n, dev_odata, dev_idata, dev_falses, mask, totalFalses); + std::swap(dev_idata, dev_odata); + } + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); + cudaFree(dev_falses); + } + + void sortShared(int n, int* odata, const int* idata, int numBits) { + int extended_n = 1 << ilog2ceil(n); + dim3 fullBlocksPerGrid = BLOCKSPERGRID(n, blockSize); + int* dev_idata; + int* dev_odata; + cudaMalloc((void**)&dev_idata, n * sizeof(int)); + cudaMalloc((void**)&dev_odata, n * sizeof(int)); + Common::devDataBuffer buffer(extended_n, blockSize); + cudaMemcpy(dev_idata, idata, n * sizeof(int), cudaMemcpyHostToDevice); + timer().startGpuTimer(); + unsigned int maxMask = 1 << numBits; + for (unsigned mask = 1; mask < maxMask; mask <<= 1) + { + kernMapToBoolean << > > (n, buffer.data(), dev_idata, mask); + StreamCompaction::Efficient::scanSharedInplace(extended_n, buffer); + int totalFalses = 0, tmp_back = 0; + cudaMemcpy(&totalFalses, buffer.data() + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + cudaMemcpy(&tmp_back, dev_idata + n - 1, sizeof(int), cudaMemcpyDeviceToHost); + totalFalses += (tmp_back & mask) == 0; + kernScatter << > > (n, dev_odata, dev_idata, buffer.data(), mask, totalFalses); + std::swap(dev_idata, dev_odata); + } + timer().endGpuTimer(); + cudaMemcpy(odata, dev_idata, n * sizeof(int), cudaMemcpyDeviceToHost); + cudaFree(dev_idata); + cudaFree(dev_odata); + } + + } +} diff --git a/stream_compaction/radixSort.h b/stream_compaction/radixSort.h new file mode 100644 index 0000000..5a53f38 --- /dev/null +++ b/stream_compaction/radixSort.h @@ -0,0 +1,12 @@ +#pragma once + +#include "common.h" + +namespace StreamCompaction { + namespace RadixSort { + StreamCompaction::Common::PerformanceTimer& timer(); + + void sort(int n, int* odata, const int* idata, int numBits); + void sortShared(int n, int* odata, const int* idata, int numBits); + } +} diff --git a/stream_compaction/thrust.cu b/stream_compaction/thrust.cu index 1def45e..a157a51 100644 --- a/stream_compaction/thrust.cu +++ b/stream_compaction/thrust.cu @@ -17,12 +17,19 @@ namespace StreamCompaction { /** * Performs prefix-sum (aka scan) on idata, storing the result into odata. */ - void scan(int n, int *odata, const int *idata) { + void scan(int n, int* odata, const int* idata) { + thrust::host_vector hst_idata(idata, idata + n); + thrust::device_vector dev_idata = hst_idata; + thrust::device_vector dev_odata(n); + timer().startGpuTimer(); - // TODO use `thrust::exclusive_scan` + // DONE use `thrust::exclusive_scan` // example: for device_vectors dv_in and dv_out: // thrust::exclusive_scan(dv_in.begin(), dv_in.end(), dv_out.begin()); + thrust::exclusive_scan(dev_idata.begin(), dev_idata.end(), dev_odata.begin()); timer().endGpuTimer(); + + thrust::copy(dev_odata.begin(), dev_odata.end(), odata); } } }