Skip to content
37 changes: 31 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,37 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
Han Wang

### (TODO: Your README)
Tested on: Windows 11, 11th Gen Intel(R) Core(TM) i9-11900H @ 2.50GHz 22GB, GTX 3070 Laptop GPU

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### Analysis
**Roughly optimize the block sizes of each of your implementations for minimal run time on your GPU.**

**(You shouldn't compare unoptimized implementations to each other!)
Compare all of these GPU Scan implementations (Naive, Work-Efficient, and Thrust) to the serial CPU version of Scan. Plot a graph of the comparison (with array size on the independent axis).**

![Unlock FPS](img/graph1.png)

Based on my observation in my text, there are three phenomenons:
1. The block size seems to not influence the output that much.
2. The naive gpu approach is slower than the efficient approach.
3. Though I didn't plot out the output of the CPU scan, the CPU operation seems to be actually faster than the GPU operation.

The first phenomenon seems not really reasonable. I expect that with more block size, we can improve the run time speed in the parallel operation. But I think I need to keep the truth I just post the real output. The second phenomenon seems reasonable. The book explained the high-speed algorithm. The third phenomenon might be because the GPU approach might spend more time accessing the shared memory compared to the CPU's fast access. For the current data size, the benefit of the parallel operation cannot cover the loss of that.


**Don't mix up CpuTimer and GpuTimer.
To guess at what might be happening inside the Thrust implementation (e.g. allocation, memory copy), take a look at the Nsight timeline for its execution. Your analysis here doesn't have to be detailed, since you aren't even looking at the code for the implementation.
Write a brief explanation of the phenomena you see here.**

The observation is that the thrust implementation is slower than the GPU and CPU approach. Based on my knowledge, I think it is more possible that the thrust implementation might spend large amounts of time on memory I/O operation. I trust that the algorithm of the thrust might be fast and reliable.



**Can you find the performance bottlenecks? Is it memory I/O? Computation? Is it different for each implementation?
Paste the output of the test program into a triple-backtick block in your README.**

Because our time check excludes the init and end memory operation, based on my observation and my hypothesis, I think that the memory I/O computation might be there but didn't actually influence me. At least I didn't observe the performance bottlenecks.

![Unlock FPS](img/output.png)
Binary file added img/graph1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/output.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 24 additions & 0 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,20 @@ namespace StreamCompaction {
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) {
return;
}
if (idata[idx] > 0) {
bools[idx] = 1;
}
else{
bools[idx] = 0;
}




}

/**
Expand All @@ -33,6 +47,16 @@ namespace StreamCompaction {
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) {
return;
}
if (bools[idx] > 0) {
odata[indices[idx]] = idata[idx];
}



}

}
Expand Down
60 changes: 55 additions & 5 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,17 @@ namespace StreamCompaction {
* (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();

// TODO
timer().endCpuTimer();
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = odata[i - 1] + idata[i - 1];

}




}

/**
Expand All @@ -31,8 +39,17 @@ namespace StreamCompaction {
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int output = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[output] = idata[i];
output += 1;
}
}


timer().endCpuTimer();
return -1;
return output;
}

/**
Expand All @@ -41,10 +58,43 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();


// TODO

int* checked = new int[n];
int* preCheck = new int[n];
int counter = 0;

timer().startCpuTimer();

for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
preCheck[i] = 1;
counter += 1;
}
else {
preCheck[i] = 0;
}
}

scan(n, checked, preCheck);

for (int i = 0; i < n; i++) {
if (preCheck[i]==1) {
odata[checked[i]] = idata[i];
}

}
timer().endCpuTimer();
return -1;

delete[] checked;
delete[] preCheck;


return counter;


}
}
}
146 changes: 142 additions & 4 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
#include "common.h"
#include "efficient.h"

#define blockSize 128
namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -11,14 +12,96 @@ namespace StreamCompaction {
static PerformanceTimer timer;
return timer;
}
__global__ void upSweep(int n, int base, int* idata) {
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) {
return;
}
int k = idx * (1 << base + 1);
if (k >= n) {
return;
}

idata[k + (1 << base + 1) - 1] += idata[k + (1 << base) - 1];
}
// referemce to book page algorithm 4
__global__ void downSweep(int n, int base, int* idata) {
int idx = (blockIdx.x * blockDim.x) + threadIdx.x;
if (idx >= n) {
return;
}
int k = idx * (1 << base + 1);
if (k >= n) {
return;
}

int t = idata[k + (1 << base) - 1];
idata[k + (1 << base) - 1] = idata[k + (1 << base + 1) - 1];
idata[k + (1 << base + 1) - 1] += t;

}

void processScan(int n, int ending, int* gpu_idata) {


for (int i = 0; i < ilog2ceil(n); i++) {

dim3 fullBlocksPerGrid((ending / (1 << (i + 1)) + blockSize - 1) / blockSize);
upSweep <<<fullBlocksPerGrid, blockSize >>> (n, i, gpu_idata);
}
cudaMemset(&gpu_idata[ending - 1], 0, sizeof(int));
checkCUDAError("error in loop 0");

for (int i = ilog2ceil(n) -1; i>=0; i--) {

dim3 fullBlocksPerGrid((ending / (1 << (i + 1)) + blockSize - 1) / blockSize);
downSweep <<<fullBlocksPerGrid, blockSize >>> (n, i, gpu_idata);


}
checkCUDAError("error in loop 0111");
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
timer().startGpuTimer();

// TODO
int* gpu_odataa;
int* gpu_idataa;
int ending = 1 << ilog2ceil(n);

cudaMalloc((void**)&gpu_odataa, ending * sizeof(int));
cudaMalloc((void**)&gpu_idataa, ending * sizeof(int));
checkCUDAError("memory error 0101!!!!!");
cudaMemset(gpu_odataa, 0, ending * sizeof(int));
checkCUDAError("memory error 0102!!!!!");
cudaMemset(gpu_idataa, 0, ending * sizeof(int));
checkCUDAError("memory error 0103!!!!!");
cudaMemcpy(gpu_idataa, idata, sizeof(int) * n, cudaMemcpyHostToDevice);

checkCUDAError("memory error 01!!!!!");


timer().startGpuTimer();
processScan(n, ending, gpu_idataa);
timer().endGpuTimer();


checkCUDAError("error in loop final process!!!!!");
int* temp = gpu_odataa;
gpu_odataa = gpu_idataa;
gpu_idataa = temp;

cudaMemcpy(odata, gpu_odataa, sizeof(int) * n, cudaMemcpyDeviceToHost);

checkCUDAError("memory error 02!!!!!");

cudaFree(gpu_odataa);
cudaFree(gpu_idataa);


}

/**
Expand All @@ -31,10 +114,65 @@ namespace StreamCompaction {
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
timer().startGpuTimer();
//timer().startGpuTimer();



dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);
//dim3 numBlocks((n - 1 + blockSize - 1) / blockSize);

int* gpu_odata;
int* gpu_idata;

int ending = 1 << ilog2ceil(n);
int* gpu_bool;
int* gup_sum;


cudaMalloc((void**)&gpu_odata, n * sizeof(int));
cudaMalloc((void**)&gpu_idata, n * sizeof(int));
cudaMalloc((void**)&gpu_bool, n * sizeof(int));
cudaMalloc((void**)&gup_sum, n * sizeof(int));

cudaMemcpy(gpu_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);

checkCUDAError("memory error 01!!!!!");

Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, gpu_bool, gpu_idata);

scan(n, gup_sum, gpu_bool);

checkCUDAError("memory error 02!!!!!");

Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, gpu_odata, gpu_idata, gpu_bool, gup_sum);




// TODO
timer().endGpuTimer();
return -1;
//timer().endGpuTimer();

int counter = -1;


cudaMemcpy(odata, gpu_odata, sizeof(int) * n, cudaMemcpyDeviceToHost);

cudaMemcpy(&counter, &gup_sum[n-1], sizeof(int), cudaMemcpyDeviceToHost);
if (idata[n - 1] != 0) {
counter += 1;
}


checkCUDAError("memory error 023!!!!!");
cudaFree(gpu_odata);
cudaFree(gpu_idata);

cudaFree(gpu_bool);
cudaFree(gup_sum);



return counter;
}
}
}
Loading