Work-Stealing-based Persistent Kernel by neoblizz · Pull Request #64 · ROCm/tritonBLAS

neoblizz · 2026-02-05T20:34:47Z

Motivation

Dynamically take away tile ids instead of fixed partitioning.

Getting Started

git clone -b neoblizz/work-stealing https://github.com/ROCm/tritonBLAS
cd tritonBLAS
pip install -e .

# Install latest triton
git clone https://github.com/triton-lang/triton
cd triton
pip install -e .

# Work-stealing CU sweep (304 to 32 CUs)
python benchmarks/tritonblas_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --work-stealing \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --counters-per-xcd 1 \
    --output-csv results_ws_cu_sweep.csv

python benchmarks/torch_matmul.py \
    --input-yaml datasets/bench_8k.yaml \
    --cu-sweep \
    --cu-sweep-max-remove 34 \
    --output-csv results_torch_cu_sweep.csv

python tools/plot_cu_sweep.py \
    --persistent results_persistent_sweep.csv \
    --torch      results_torch_cu_sweep.csv \
    --ws-cpc 1   results_ws_cu_sweep.csv \
    -o cu_sweep_plot.png

Copilot

Pull request overview

This PR introduces a work-stealing-based persistent GEMM kernel that dynamically allocates tile IDs across compute units instead of using fixed partitioning. The implementation uses per-XCD (chiplet) atomic counters to reduce contention compared to global atomic operations. The work-stealing kernel is exposed as an opt-in feature through a new work_stealing parameter in the matmul APIs.

Changes:

Added MatmulConfig class to pre-allocate and manage GPU buffers for kernel launches (tile counters, stream-K locks/partials)
Implemented work-stealing kernel with per-XCD atomic tile counters in persistent_gemm_work_stealing.py
Extended all matmul APIs with optional work_stealing and config parameters to support the new kernel

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 22 comments.

Show a summary per file

File	Description
`include/tritonblas/matmul.py`	Added MatmulConfig class for buffer management; integrated work_stealing parameter and ws_persistent_matmul kernel; refactored buffer allocation to use config objects
`include/tritonblas/kernels/persistent_gemm_work_stealing.py`	New work-stealing kernel implementation with per-XCD atomic counters and dynamic tile assignment
`include/tritonblas/kernels/__init__.py`	Exported ws_persistent_matmul kernel
`include/tritonblas/__init__.py`	Exported MatmulConfig and matmul_preamble to public API
`tests/test_work_stealing.py`	Standalone test with custom module loading to test work-stealing kernel correctness and performance
`benchmarks/benchmark_work_stealing.py`	Comprehensive benchmark comparing work-stealing against static persistent, stream-K, and torch.matmul

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/tritonblas/matmul.py

tests/test_work_stealing.py

include/tritonblas/kernels/persistent_gemm_work_stealing.py

include/tritonblas/matmul.py

benchmarks/benchmark_work_stealing.py

tests/test_work_stealing.py

ryanswann-amd

I think we need to think more about how we intend people to use matmul.

datasets/bench_8k.yaml

include/tritonblas/bench.py

ryanswann-amd · 2026-02-16T16:12:51Z

include/tritonblas/matmul.py

    b: torch.Tensor,
    c: torch.Tensor,
    selector,
+    config: MatmulConfig,


I'm pretty confident this makes the matmul call non torch like. Do we want the user to manage the locks tensor (by passing it in via the MatmulConfig object) or do we want tritonBLAS to manage it internally (like hipblasLT)

@asunderwood is a good example of a torch user. Thoughts?

This particular change isn't an issue because, from a torch API perspective, the user won't be calling persistent_matmul_lt() themselves. They'll instead call matmul() which does have a new, non-torch kwarg but it's not the first we've added and as long as it has a default value allowing a user to skip it then it retains torch compatibility.

ryanswann-amd · 2026-02-16T16:17:50Z

include/tritonblas/matmul.py

This results in allocation every call when not using config, right? If I just call C = tritonblas.matmul(A,B)

ryanswann-amd · 2026-02-16T16:19:01Z

include/tritonblas/matmul.py

    else:
-        locks = torch.empty(grids, device="cuda", dtype=torch.uint8)
-        P = torch.empty(grids, block_size, device="cuda", dtype=torch.float32)
+        locks = torch.empty(grids, device=cfg.device, dtype=torch.uint8)


This results in an allocation every call to C=tritonblas.matmul(A,B) if I don't pre-allocate buffers. This means to get performance (and not have an allocation each time) users have to pass in the pre-allocated buffers, which we were trying to avoid in the previous approach.

Aren't there already locks in the config being passed in? Shouldn't we just use those instead of re-allocating?

include/tritonblas/origami.py

Resolve merge conflicts from 'Support async copy (#72)': - origami.py: Keep both work-stealing params (total_cus, active_cus) and main's num_stages param. Keep work-stealing's multi-version origami API handling for select_workgroup_mapping over main's simpler version. - matmul.py: Use main's getattr(selector, "num_stages", 2) for proper num_stages propagation. Keep work-stealing's pre-allocated buffer optimization for locks/P but use a.device (from main) in fallback path.

ryanswann-amd

Looks good other than the weird diff file.

ryanswann-amd · 2026-03-17T20:35:53Z

arch.diff

+diff --git a/shared/origami/python/origami_module.cpp b/shared/origami/python/origami_module.cpp
+index a85c5da..c4b9b07 100644
+--- a/shared/origami/python/origami_module.cpp
+++ b/shared/origami/python/origami_module.cpp
+@@ -154,6 +154,7 @@ NB_MODULE(origami, m) {
+                           size_t,
+                           std::tuple<double, double, double>>())
+       .def("print", &hardware_t::print)
+      .def_rw("arch", &hardware_t::arch)
+       .def_rw("N_CU", &hardware_t::N_CU)
+       .def_rw("lds_capacity", &hardware_t::lds_capacity)
+       .def_rw("mem1_perf_ratio", &hardware_t::mem1_perf_ratio)


We should not commit diff files.

Point setup.py to ryaswann/tritonblas_expose_arch branch of rocm-libraries which includes the .def_rw("arch", &hardware_t::arch) change. Remove arch.diff and the git apply step since the change is now baked into the upstream commit.

neoblizz added 2 commits February 5, 2026 19:46

...

2373b38

8-way spread contention.

ecc5d73

Copilot AI review requested due to automatic review settings February 5, 2026 20:34

Copilot started reviewing on behalf of neoblizz February 5, 2026 20:35 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

neoblizz added 6 commits February 11, 2026 20:46

Better APIs, clean-up, use Origami for ws.

8c1d91b

Fix UNKNOWN pip package issue.

e4908aa

Benchmark and CU sweep plot.

cfbefd8

Add a simple 8K dataset.

a1e792c

CU sweep for torch.

4e5831a

Fix the cu-masking.

f849cbe

ryanswann-amd self-requested a review February 12, 2026 17:29

neoblizz added 3 commits February 12, 2026 18:49

Stride to avoid contention on the same coherence channel.

579cf89

global-counter option.

a0a3de8

...

37c72cd

ryanswann-amd reviewed Feb 16, 2026

View reviewed changes

neoblizz and others added 9 commits February 18, 2026 21:46

...

bb73904

Use a kernel for cache-flush.

b7a9498

Adding argument for CU masking

5d97f42

Adding CU masking with a mask tensor as default

38f1f22

Updating origami to use arch name instead of N_CU

47e381a

Adding work stealing for StreamK

b28d60e

Fixing merge conflicts.

5bc7c2c

Merge remote-tracking branch 'origin/main' into neoblizz/work-stealing

011a1d8

ryanswann-amd reviewed Mar 17, 2026

View reviewed changes

Plumb in work stealing flag

08ae60c

ryanswann-amd requested review from asunderwood and ryanswann-amd March 17, 2026 23:17

Conversation

neoblizz commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Getting Started

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanswann-amd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ryanswann-amd Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

ryanswann-amd Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

asunderwood Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

ryanswann-amd Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

ryanswann-amd Feb 16, 2026

Choose a reason for hiding this comment

Uh oh!

asunderwood Feb 18, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryanswann-amd left a comment

Choose a reason for hiding this comment

Uh oh!

ryanswann-amd Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

neoblizz commented Feb 5, 2026 •

edited

Loading