Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
Language: Cpp
IndentWidth: 4
TabWidth: 4
UseTab: Never
ColumnLimit: 100
55 changes: 55 additions & 0 deletions .clangd
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# Apply this config conditionally to all C files
If:
PathMatch: .*\.(c|h)$
CompileFlags:
Compiler: /usr/bin/gcc

---

# Apply this config conditionally to all C++ files
If:
PathMatch: .*\.(c|h)pp
CompileFlags:
Compiler: /usr/bin/g++

---

# Apply this config conditionally to all CUDA files
If:
PathMatch: .*\.cuh?
CompileFlags:
Compiler: /usr/local/cuda/bin/nvcc

---

# Tweak the clangd parse settings for all files
CompileFlags:
Add:
# report all errors
- "-ferror-limit=0"
- "-I/usr/local/cuda/include/cccl"
Remove:
# strip CUDA fatbin args
- "-Xfatbin*"
# strip CUDA arch flags
- "-gencode*"
- "--generate-code*"
# strip CUDA flags unknown to clang
- "-ccbin*"
- "--compiler-options*"
- "--expt-extended-lambda"
- "--expt-relaxed-constexpr"
- "-forward-unknown-to-host-compiler"
- "-Werror=cross-execution-space-call"
- "-arch=native"
- "--options-file"
- "-G"

Hover:
ShowAKA: No
InlayHints:
Enabled: No
Diagnostics:
Suppress:
- "variadic_device_fn"
- "attributes_not_allowed"
11 changes: 10 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ build
.LSOverride

# Icon must end with two \r
Icon
Icon


# Thumbnails
._*
Expand Down Expand Up @@ -256,6 +257,11 @@ bld/
# Uncomment if you have tasks that create the project's static files in wwwroot
#wwwroot/

.vscode/*
!.vscode/launch.json
!.vscode/extensions.json
!.vscode/settings.json

# MSTest test Results
[Tt]est[Rr]esult*/
[Bb]uild[Ll]og.*
Expand All @@ -269,6 +275,9 @@ TestResult.xml
[Rr]eleasePS/
dlldata.c

# Clangd cache
.cache/clangd

# DNX
project.lock.json
artifacts/
Expand Down
25 changes: 25 additions & 0 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"$schema": "vscode://schemas/launch",
// Use IntelliSense to learn about possible attributes.
// Hover to view descriptions of existing attributes.
// For more information, visit: https://go.microsoft.com/fwlink/?linkid=830387
"version": "0.2.0",
"configurations": [
{
"name": "CUDA C++: Launch",
"type": "cuda-gdb",
"request": "launch",
"environment": [
{"name": "WAYLAND_DISPLAY", "value": ""},
{"name": "XDG_SESSION_TYPE", "value": "x11"}
],
"program": "${workspaceFolder}/build/bin/cis5650_stream_compaction_test",
"cwd": "${workspaceFolder}"
},
{
"name": "CUDA C++: Attach",
"type": "cuda-gdb",
"request": "attach"
}
]
}
15 changes: 15 additions & 0 deletions .vscode/settings.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
{
"files.associations": {
"*.cu": "cuda-cpp"
},
"[cpp]": {
"editor.defaultFormatter": "llvm-vs-code-extensions.vscode-clangd"
},
"[cuda-cpp]": {
"editor.defaultFormatter": "llvm-vs-code-extensions.vscode-clangd"
},
"[python]": {
"editor.defaultFormatter": "charliermarsh.ruff",
"editor.tabSize": 4,
},
}
150 changes: 143 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,150 @@
CUDA Stream Compaction
======================

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**
![](img/graphs/scan_performance_nonpow2.png)

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
Q: (why is thrust so slow???) A: because I forgot to build binaries in release mode.

### (TODO: Your README)
**University of Pennsylvania, CIS 5650: GPU Programming and Architecture, Project 2**

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
* Thomas Shaw
* [LinkedIn](https://www.linkedin.com/in/thomas-shaw-54468b222), [personal website](https://tlshaw.me), [GitHub](https://github.com/printer83mph), etc.
* Tested on: Fedora 42, Ryzen 7 5700x @ 4.67GHz, 32GB, RTX 2070 8GB


## Features

- CUDA exclusive scan and compaction implementations!
- Work-efficient algorithm speeds up scan even more!
- Uses segmented blocks running in-place at the same time without worry of race condition, with arbitrary size input... can be adapted to use shared memory!
- Faster scan than CPU!!
- GPU Radix Sort implementation
- (tested against CPU version in `main`)
- Python [analysis module](./analysis/README.md) for spitting out nice CSV files measuring performance


## Performance Analysis

Python scripts have been created in `analysis/` for easier stat collection. The [README](./analysis/README.md) within provides info on how to run these.

### Block Size Optimizations

Let's take a look at the resulting performance from block size choices, all normalized:

![](img/graphs/block_sizes_all_normalized.png)

Performance seems to fluctuate differently per-algorithm between possible block sizes from 64 and 1024.

See below all the different algorithms independently:
| | | |
|---|---|---|
| ![](img/graphs/block_sizes_scan_naive.png) Naive Scan | ![](img/graphs/block_sizes_scan_work_efficient.png) Work Efficient Scan | ![](img/graphs/block_sizes_scan_thrust.png) Thrust Scan |
| | ![](img/graphs/block_sizes_compact.png) Stream Compaction (Work-Efficient Scan) | ![](img/graphs/block_sizes_radix.png) Radix |

It seems that the optimal block size for this machine is somewhere between 128 and 512, but that really depends on the algorithm.

### Scan Implementation Comparisons

See below the performance change over different array sizes, when aligned with powers of 2, and when not.

![](img/graphs/scan_performance_nonpow2.png)

![](img/graphs/scan_performance_pow2.png)

Our Work-Efficient GPU Scan outperforms the Naive one at almost all array sizes. This is to be expected.

The CPU eventually falls behind the work-efficient solution.

It seems that Thrust is generally much faster than our solutions. What black magic are they working?

### NSight analysis

![](img/graphs/nsight_work_efficient.png)

The above is our timeline view in NSight Systems.

![](img/graphs/nsight_thrust.png)

And the above is thrust.

It seems they perform the entire operation inside a single kernel. They likely utilize shared memory access patterns, or somehow improve order-of-operations when it comes to memory access and computations.

### Test output at n = 2^27

```
****************
** SCAN TESTS **
****************
[ 45 9 27 23 1 22 41 22 38 25 35 19 49 ... 35 0 ]
==== cpu scan, power-of-two ====
elapsed time: 195.421ms (std::chrono Measured)
[ 0 45 54 81 104 105 127 168 190 228 253 288 307 ... -1006580612 -1006580577 ]
==== cpu scan, non-power-of-two ====
elapsed time: 195.83ms (std::chrono Measured)
[ 0 45 54 81 104 105 127 168 190 228 253 288 307 ... -1006580641 -1006580637 ]
passed
==== naive scan, power-of-two ====
elapsed time: 178.568ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 160.998ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 151.997ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 155.299ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 1000.16ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 999.244ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 2 3 3 0 3 3 2 0 2 0 0 0 ... 3 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 342.001ms (std::chrono Measured)
[ 1 2 3 3 3 3 2 2 3 1 3 2 3 ... 2 3 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 342.025ms (std::chrono Measured)
[ 1 2 3 3 3 3 2 2 3 1 3 2 3 ... 1 1 ]
passed
==== cpu compact with scan ====
elapsed time: 1084.35ms (std::chrono Measured)
[ 1 2 3 3 3 3 2 2 3 1 3 2 3 ... 2 3 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 218.355ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 170.299ms (CUDA Measured)
passed

*****************************
** RADIX SORT TESTS **
*****************************
[ 26 47 38 48 23 28 48 10 6 9 30 37 21 ... 17 0 ]
==== cpu sort, power-of-two ====
elapsed time: 17055.4ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== cpu sort, non-power-of-two ====
elapsed time: 17048.5ms (std::chrono Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
==== gpu radix sort, power-of-two ====
elapsed time: 5650.03ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
passed
==== gpu radix sort, non-power-of-two ====
elapsed time: 5632.99ms (CUDA Measured)
[ 0 0 0 0 0 0 0 0 0 0 0 0 0 ... 49 49 ]
passed
```

### Radix Sort GPU implementation

As seen above in the above output, the GPU Radix sort implementation is significantly faster than the CPU implementation at larger n values. For further research, it should be compared with Quicksort, or some other famous CPU sorting algorithm.
Loading