Skip to content

Commit 1b2f3c7

Browse files
authored
Merge pull request #1 from deaneeth/dev
Introduces the v2.0.0 release of TinyGPU
2 parents 34d3f92 + b6e8937 commit 1b2f3c7

29 files changed

+925
-100
lines changed

.github/workflows/ci.yml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -2,17 +2,17 @@ name: 🧪 CI
22

33
on:
44
push:
5-
branches: [ main, master ]
5+
branches: [ main, master, dev ]
66
pull_request:
7-
branches: [ main, master ]
7+
branches: [ main, master, dev ]
88

99
jobs:
1010
build:
1111
runs-on: ubuntu-latest
1212

1313
strategy:
1414
matrix:
15-
python-version: [ "3.11", "3.12" ]
15+
python-version: [ "3.11", "3.12", "3.13" ]
1616

1717
steps:
1818
- name: 🧰 Checkout repository

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,3 +5,5 @@ __pycache__/
55
*$py.class
66
/.pytest_cache
77

8+
.ruff_cache
9+

README.md

Lines changed: 74 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,17 +1,38 @@
11
# TinyGPU 🐉⚡
22

3-
[![PyPI version](https://img.shields.io/badge/version-1.0.0-blue.svg)](https://pypi.org/project/tinygpu)
3+
[![PyPI version](https://img.shields.io/badge/version-2.0.0-blue.svg)](https://pypi.org/project/tinygpu)
44
[![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
55
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
66
[![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
7+
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
8+
[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)
79

810
TinyGPU is a **tiny educational GPU simulator** - inspired by [Tiny8](https://github.com/sql-hkr/tiny8), designed to demonstrate how GPUs execute code in parallel. It models a small **SIMT (Single Instruction, Multiple Threads)** system with per-thread registers, global memory, synchronization barriers, branching, and a minimal GPU-like instruction set.
911

1012
> 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*
11-
13+
1214
| Odd-Even Sort | Reduction |
1315
|---------------|------------|
14-
| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
16+
| ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |
17+
18+
---
19+
20+
## 🚀 What's New in v2.0.0
21+
22+
- **Enhanced Instruction Set**:
23+
- Added `SHLD` and `SHST` for robust shared memory operations.
24+
- Improved `SYNC` semantics for better thread coordination.
25+
- **Visualizer Improvements**:
26+
- Export execution as GIFs with enhanced clarity.
27+
- Added support for saving visuals directly from the simulator.
28+
- **Refactored Core**:
29+
- Simplified step semantics for better extensibility.
30+
- Optimized performance for larger thread counts.
31+
- **CI/CD Updates**:
32+
- Integrated linting (`ruff`, `black`) and testing workflows.
33+
- Automated builds and tests on GitHub Actions.
34+
- **Documentation**:
35+
- Expanded examples and added detailed usage instructions.
1536

1637
---
1738

@@ -51,10 +72,11 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi
5172
> 🧭 TinyGPU aims to make GPU learning *intuitive, visual, and interactive* - from classroom demos to self-guided exploration.
5273
5374
---
75+
5476
## ✨ Highlights
5577

5678
- 🧩 **GPU-like instruction set:**
57-
`SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`.
79+
`SET`, `ADD`, `MUL`, `LD`, `ST`, `JMP`, `BNE`, `BEQ`, `SYNC`, `CSWAP`, `SHLD`, `SHST`.
5880
- 🧠 **Per-thread registers & PCs** - each thread executes the same kernel independently.
5981
- 🧱 **Shared global memory** for inter-thread operations.
6082
- 🔄 **Synchronization barriers** (`SYNC`) for parallel coordination.
@@ -69,31 +91,39 @@ TinyGPU was built as a **learning-first GPU simulator** - simple enough for begi
6991

7092
## 🖼️ Example Visuals
7193

72-
> Located in `examples/`you can generate these GIFs yourself.
94+
> Located in `src/outputs/`run the example scripts to generate these GIFs (they're saved under `src/outputs/<script_name>/`).
7395
74-
| Odd-Even Sort | Reduction |
75-
|---------------|------------|
76-
| ![Odd-Even Sort](outputs/run_odd_even_sort/run_odd_even_sort_20251025-205516.gif) | ![Reduction](outputs/run_reduce_sum/run_reduce_sum_20251025-210237.gif) |
96+
| Example | Description | GIF Preview |
97+
|---------|-------------|-------------|
98+
| Vector Add | Parallel vector addition (A+B -> C) | ![Vector Add](src/outputs/run_vector_add/run_vector_add_20251026-212734.gif) |
99+
| Block Shared Sum | Per-block shared memory sum example | ![Block Shared Sum](src/outputs/run_block_shared_sum/run_block_shared_sum_20251026-212542.gif) |
100+
| Odd-Even Sort | GPU-style odd-even transposition sort | ![Odd-Even Sort](src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif) |
101+
| Parallel Reduction | Sum reduction across an array | ![Reduction](src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif) |
102+
| Sync Test | Synchronization / barrier demonstration | ![Sync Test](src/outputs/run_sync_test/run_sync_test_20251027-000818.gif) |
103+
| Loop Test | Branching and loop behavior demo | ![Test Loop](src/outputs/run_test_loop/run_test_loop_20251026-212814.gif) |
104+
| Compare Test | Comparison and branching example | ![Test CMP](src/outputs/run_test_cmp/run_test_cmp_20251026-212823.gif) |
105+
| Kernel Args Test | Demonstrates passing kernel arguments | ![Kernel Args](src/outputs/run_test_kernel_args/run_test_kernel_args_20251026-212830.gif) |
77106

78107
---
79108

80109
## 🚀 Quickstart
81110

82111
### Clone and install
112+
83113
```bash
84114
git clone https://github.com/deaneeth/tinygpu.git
85115
cd tinygpu
86116
pip install -e .
87117
pip install -r requirements-dev.txt
88-
````
118+
```
89119

90120
### Run an example
91121

92122
```bash
93123
python -m examples.run_odd_even_sort
94124
```
95125

96-
> Produces: `examples/odd_even_sort.gif` — a visual GPU-style sorting process.
126+
> Produces: `src/outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.
97127
98128
### Other examples
99129

@@ -108,30 +138,50 @@ python -m examples.run_sync_test
108138

109139
## 🧩 Project Layout
110140

111-
```
112-
tinygpu/
141+
```text
142+
.
143+
├─ .github/
144+
│ └─ workflows/
145+
│ └─ ci.yml
146+
├─ docs/
147+
│ └─ index.md
113148
├─ examples/
114-
│ ├─ vector_add.tgpu
149+
│ ├─ odd_even_sort_tmp.tgpu
115150
│ ├─ odd_even_sort.tgpu
116151
│ ├─ reduce_sum.tgpu
117-
│ ├─ run_vector_add.py
118152
│ ├─ run_odd_even_sort.py
119153
│ ├─ run_reduce_sum.py
154+
│ ├─ run_sync_test.py
120155
│ ├─ run_test_loop.py
121-
│ └─ run_sync_test.py
122-
156+
│ ├─ run_vector_add.py
157+
│ ├─ sync_test.tgpu
158+
│ ├─ test_loop.tgpu
159+
│ └─ vector_add.tgpu
160+
├─ src/outputs/
161+
│ ├─ run_block_shared_sum/
162+
│ ├─ run_odd_even_sort/
163+
│ ├─ run_reduce_sum/
164+
│ ├─ run_sync_test/
165+
│ ├─ run_test_cmp/
166+
│ ├─ run_test_kernel_args/
167+
│ ├─ run_test_loop/
168+
│ └─ run_vector_add/
123169
├─ src/
124170
│ └─ tinygpu/
171+
│ ├─ __init__.py
125172
│ ├─ assembler.py
126173
│ ├─ gpu.py
127174
│ ├─ instructions.py
128-
│ ├─ visualizer.py
129-
│ └─ __init__.py
130-
175+
│ └─ visualizer.py
131176
├─ tests/
177+
│ ├─ test_assembler.py
178+
│ ├─ test_gpu_core.py
179+
│ ├─ test_gpu.py
180+
│ └─ test_programs.py
181+
├─ LICENSE
132182
├─ pyproject.toml
133-
├─ requirements-dev.txt
134-
└─ README.md
183+
├─ README.md
184+
└─ requirements-dev.txt
135185
```
136186

137187
---
@@ -156,6 +206,8 @@ TinyGPU uses a **minimal instruction set** designed for clarity and education -
156206
| `BNE Ra, Rb, target` | Branch if not equal. | Jump to `target` if `Ra != Rb`. |
157207
| `SYNC` | *(no operands)* | Synchronization barrier — all threads must reach this point before continuing. |
158208
| `CSWAP addrA, addrB` | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
209+
| `SHLD addr, Rs` | Load shared memory into register. | `Rs = shared_mem[addr]` |
210+
| `SHST addr, Rs` | Store register into shared memory. | `shared_mem[addr] = Rs` |
159211
| `CMP Rd, Ra, Rb` *(optional)* | Compare and set flag or register. | Used internally for extended examples (e.g., prefix-scan). |
160212
| `NOP` *(optional)* | *(no operands)* | No operation; placeholder instruction. |
161213

@@ -267,7 +319,7 @@ MIT - see [LICENSE](LICENSE)
267319

268320
## 🌟 Credits & Inspiration
269321

270-
❤️ Built by [Deaneeth](https://github.com/deaneeth)
322+
❤️ Built by [Deaneeth](https://github.com/deaneeth)
271323

272324
> Inspired by the educational design of [Tiny8 CPU Simulator](https://github.com/sql-hkr/tiny8).
273325

docs/index.md

Lines changed: 123 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,123 @@
1+
# TinyGPU 🐉⚡ — v2.0.0
2+
3+
[![Release v2.0.0](https://img.shields.io/badge/release-v2.0.0-blue.svg)](https://github.com/deaneeth/tinygpu/releases/tag/v2.0.0)
4+
[![Python 3.13](https://img.shields.io/badge/Python-3.13-blue.svg)](https://www.python.org/downloads/)
5+
[![License: MIT](https://img.shields.io/badge/license-MIT-green.svg)](LICENSE)
6+
[![CI](https://github.com/deaneeth/tinygpu/actions/workflows/ci.yml/badge.svg)](https://github.com/deaneeth/tinygpu/actions)
7+
[![Tests](https://img.shields.io/github/actions/workflow/status/deaneeth/tinygpu/ci.yml?label=tests)](https://github.com/deaneeth/tinygpu/actions)
8+
[![Code Style: Black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
9+
10+
TinyGPU is a **tiny educational GPU simulator** — a minimal SIMT-style simulator with:
11+
12+
- Per-thread registers & program counters
13+
- Shared global memory and per-block shared memory
14+
- A small GPU-style ISA and assembler
15+
- Visualizer and GIF export for educational animations
16+
17+
> 🎓 *Built for learning and visualization - see how threads, registers, and memory interact across cycles!*
18+
19+
---
20+
21+
## 🚀 What's New in v2.0.0
22+
23+
- **Enhanced Instruction Set**:
24+
- Added `SHLD` and `SHST` for robust shared memory operations.
25+
- Improved `SYNC` semantics for better thread coordination.
26+
- **Visualizer Improvements**:
27+
- Export execution as GIFs with enhanced clarity.
28+
- Added support for saving visuals directly from the simulator.
29+
- **Refactored Core**:
30+
- Simplified step semantics for better extensibility.
31+
- Optimized performance for larger thread counts.
32+
- **CI/CD Updates**:
33+
- Integrated linting (`ruff`, `black`) and testing workflows.
34+
- Automated builds and tests on GitHub Actions.
35+
- **Documentation**:
36+
- Expanded examples and added detailed usage instructions.
37+
38+
---
39+
40+
## Quick Screenshots / Demos
41+
42+
### Odd–Even Transposition Sort
43+
44+
![Odd-Even Sort](../src/outputs/run_odd_even_sort/run_odd_even_sort_20251026-212558.gif)
45+
46+
### Parallel Reduction (Sum)
47+
48+
![Reduce Sum](../src/outputs/run_reduce_sum/run_reduce_sum_20251026-212712.gif)
49+
50+
---
51+
52+
## Getting Started
53+
54+
Clone and install (editable):
55+
56+
```bash
57+
git clone https://github.com/deaneeth/tinygpu.git
58+
cd tinygpu
59+
pip install -e .
60+
pip install -r requirements-dev.txt
61+
```
62+
63+
Run a demo (odd-even sort):
64+
65+
```bash
66+
python -m examples.run_odd_even_sort
67+
```
68+
69+
> Produces: `outputs/run_odd_even_sort/run_odd_even_sort_*.gif` — a visual GPU-style sorting process.
70+
71+
---
72+
73+
## Examples & Runners
74+
75+
- `examples/run_vector_add.py` — simple parallel vector add
76+
- `examples/run_vector_add_kernel.py` — vector add with kernel arguments
77+
- `examples/run_test_loop.py` — branch/loop test (sum 1..4)
78+
- `examples/run_test_cmp.py` — comparison and branching test
79+
- `examples/run_test_kernel_args.py` — kernel arguments test
80+
- `examples/run_odd_even_sort.py` — odd-even transposition sort (GIF)
81+
- `examples/run_reduce_sum.py` — parallel reduction (GIF)
82+
- `examples/run_block_shared_sum.py` — per-block shared memory example
83+
- `examples/run_sync_test.py` — synchronization test
84+
- `examples/debug_repl.py` — interactive REPL debugger
85+
86+
---
87+
88+
## Instruction Set (Quick Reference)
89+
90+
| **Instruction** | **Operands** | **Description** |
91+
|-----------------------------|------------------------------------------|-----------------|
92+
| `SET Rd, imm` | `Rd` = destination register, `imm` = immediate value | Set register `Rd` to an immediate constant. |
93+
| `ADD Rd, Ra, Rb` | `Rd` = destination, `Ra` + `Rb` | Add two registers and store result in `Rd`. |
94+
| `ADD Rd, Ra, imm` | `Rd` = destination, `Ra` + immediate | Add register and immediate value. |
95+
| `MUL Rd, Ra, Rb` | Multiply two registers. | `Rd = Ra * Rb` |
96+
| `MUL Rd, Ra, imm` | Multiply register by immediate. | `Rd = Ra * imm` |
97+
| `LD Rd, addr` | Load from memory address into register. | `Rd = mem[addr]` |
98+
| `LD Rd, Rk` | Load from address in register `Rk`. | `Rd = mem[Rk]` |
99+
| `ST addr, Rs` | Store register into memory address. | `mem[addr] = Rs` |
100+
| `ST Rk, Rs` | Store value from `Rs` into memory at address in register `Rk`. | `mem[Rk] = Rs` |
101+
| `SHLD Rd, saddr` | Load from shared memory into register. | `Rd = shared_mem[saddr]` |
102+
| `SHST saddr, Rs` | Store register into shared memory. | `shared_mem[saddr] = Rs` |
103+
| `CSWAP addrA, addrB` | Compare-and-swap memory values. | If `mem[addrA] > mem[addrB]`, swap them. Used for sorting. |
104+
| `CMP Ra, Rb` | Compare and set flags. | Set Z/N/G flags based on `Ra - Rb`. |
105+
| `BRGT target` | Branch if greater. | Jump to `target` if G flag set. |
106+
| `BRLT target` | Branch if less. | Jump to `target` if N flag set. |
107+
| `BRZ target` | Branch if zero. | Jump to `target` if Z flag set. |
108+
| `JMP target` | Label or immediate. | Unconditional jump — sets PC to `target`. |
109+
| `SYNC` | *(no operands)* | Global synchronization barrier — all threads must reach this point. |
110+
| `SYNCB` | *(no operands)* | Block-level synchronization barrier. |
111+
112+
---
113+
114+
## Publishing & Contributing
115+
116+
- See `.github/workflows/ci.yml` for CI and packaging
117+
- To propose changes, open a PR. For bug reports, open an issue.
118+
119+
---
120+
121+
## License
122+
123+
MIT — See [LICENSE](../LICENSE).

examples/block_shared_sum.tgpu

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
; block_shared_sum.tgpu
2+
; R5 = block_id, R6 = thread_in_block, R7 = tid
3+
; R0 -> temp
4+
; R1 -> base (global base index for each block is block_id * block_stride)
5+
; We'll assume runner sets up base_addr per block in memory (or use a simple scheme)
6+
7+
; Each thread loads its input and stores it into shared[thread_in_block]
8+
; Then threads synchronize at block barrier and thread 0 sums the shared
9+
; values and writes the block sum to memory at address (100 + block_id).
10+
11+
; Load own value from memory[tid] (R7 contains tid)
12+
LD R3, R7 ; R3 = memory[tid]
13+
SHST R6, R3 ; shared[thread_in_block] = R3
14+
SYNCB ; wait for block
15+
16+
; Only thread with thread_in_block == 0 performs the reduction
17+
CMP R6, 0
18+
BRGT not_zero ; if R6 > 0 jump to not_zero (i.e., only R6==0 continues)
19+
20+
SET R4, 0 ; R4 = sum
21+
SET R2, 0 ; R2 = loop index
22+
sum_loop:
23+
SHLD R0, R2 ; R0 = shared[R2]
24+
ADD R4, R4, R0 ; R4 += R0
25+
ADD R2, R2, 1
26+
CMP R2, 4 ; compare with TPB (4)
27+
BRLT sum_loop
28+
29+
; write sum to memory at 100 + block_id (R5 holds block_id)
30+
SET R1, 100
31+
ADD R1, R1, R5
32+
ST R1, R4
33+
34+
JMP done_block
35+
not_zero:
36+
done_block:
37+
; end

0 commit comments

Comments
 (0)