Skip to content

midiareshadi/dataflow-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

dataflow-bench

GPU implementation and benchmarking of dense dataflow strategies and sparse SpMSpM algorithms for LLM-scale matrix workloads on NVIDIA T4 and AMD MI300x.

Motivated by dataflow analysis in [LOCAL] and sparse tensor computation in [Maple], this repo implements these ideas directly on GPU hardware using Triton — bridging the gap between algorithmic analysis and real GPU performance.


Key Findings

Dense Matmul — Dataflow Strategy

78 benchmark data points across 13 LLaMA 1-7B shapes × 3 dataflows × 2 hardware:

  • IS wins on MI300x — all 13/13 shapes. HBM3 bandwidth absorbs dataflow differences.
  • WS wins on T4 — 11/13 shapes. Bandwidth constraints make weight reuse critical.
  • OS is worst on both — 3.3x penalty on T4, only 1.26x on MI300x.
  • Dataflow sensitivity: T4 shows 3x spread between best/worst. MI300x only 1.25x.

→ On MI300x, dataflow choice barely matters. On T4, it matters significantly.

Dense Winner Heatmap

Sparse SpMSpM — Algorithm Choice

144 benchmark data points across 6 attention shapes × 3 sparsity levels × 4 algorithms × 2 hardware:

  • Outer-product wins universally — 36/36 cases on both T4 and MI300x.
  • All three Triton algorithms beat vendor (cuSPARSE/rocSPARSE) — 54/54 cases.
  • Max speedup: 198x over rocSPARSE on MI300x at 70% sparsity.
  • GPU reverses the CPU algorithm ranking from [GAMMA] — Gustavson ≈ Inner-product on GPU.
  • MI300x benefits outer-product most — 4.4x speedup over T4 vs 2.4x for Gustavson.

→ Specialized Triton kernels beat vendor libraries by orders of magnitude for LLM attention sparsity.

Sparse Winner Heatmap


Repository Structure

dataflow-bench/
├── workloads/          # LLaMA-7B matrix shape definitions (13 shapes)
├── dense/              # Dense matmul dataflow comparison
│   ├── kernels/        # WS, IS, OS Triton kernels
│   ├── results/        # T4 and MI300x benchmark CSVs
│   └── analysis/       # Scripts, 5 figures, report
└── sparse/             # Sparse SpMSpM algorithm comparison
    ├── sparsity/       # Top-k mask generator, CSR utilities
    ├── algorithms/     # Inner, Outer, Gustavson, Vendor kernels
    ├── results/        # T4 and MI300x benchmark CSVs
    └── analysis/       # Scripts, 5 figures, report

Hardware

Hardware Architecture Memory Bandwidth
NVIDIA T4 Turing, Tensor Cores 16GB GDDR6 300 GB/s
AMD MI300x CDNA3, Matrix Cores 192GB HBM3 5.3 TB/s

Workloads

13 LLaMA-7B matrix shapes:

Group Shapes Size
Attention projections (Q/K/V/Out) M1–M4 4096×4096×4096
FFN layers (gate/up/down) M5–M7 up to 4096×11008×4096
Attention scores Q×Kᵀ M8–M10 seq 512/1024/2048
Attention output scores×V M11–M13 seq 512/1024/2048

Quick Start

git clone https://github.com/midiareshadi/dataflow-bench.git
cd dataflow-bench
pip install torch triton pandas numpy matplotlib

Dense dataflow benchmark (NVIDIA T4)

python3 dense/benchmark_t4.py

Dense dataflow benchmark (AMD MI300x)

pip install torch --index-url https://download.pytorch.org/whl/rocm6.0
python3 dense/benchmark_mi300x.py

Sparse SpMSpM benchmark (NVIDIA T4)

python3 sparse/benchmark_t4.py

Sparse SpMSpM benchmark (AMD MI300x)

python3 sparse/benchmark_mi300x.py

Reproduce analysis and figures

python3 dense/analysis/analyze.py && python3 dense/analysis/plot.py
python3 sparse/analysis/analyze.py && python3 sparse/analysis/plot.py

Detailed Results


References

About

Benchmarking dense dataflow strategies and sparse SpMSpM algorithms for LLM workloads on NVIDIA T4 and AMD MI300x

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages