Stream-k mxfp4 kernel by xiaohuguo2023 · Pull Request #74 · ROCm/tritonBLAS

xiaohuguo2023 · 2026-03-13T13:09:19Z

add stream-k mxfp4 kernel
add unit and perf tests

…te of LDS usage and filter out those tile sizes that exceed LDS compacity

…ath for padding size estimation

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…ile tracing: is_compiling() returns True, so the condition is False, and we fall through to torch.empty(...) which FakeTensorMode can handle properly

Copilot

Pull request overview

This PR adds Stream‑K load balancing support for FP4 GEMM in tritonblas, along with an expanded FP4 test/benchmark suite and a pytest marker for performance benchmarks.

Changes:

Add a new Triton kernel implementing Stream‑K for FP4 matmul and wire it into the matmul_fp4 dispatch.
Extend FP4 tests to cover Stream‑K correctness/behavior and add Stream‑K performance benchmark tests.
Register a performance pytest marker in pyproject.toml.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
`include/tritonblas/matmul.py`	Adds FP4 Stream‑K dispatch and scratch-buffer handling; updates Stream‑K buffer reuse logic.
`include/tritonblas/kernels/fp4_streamk_gemm.py`	New FP4 Stream‑K GEMM Triton kernel implementation.
`include/tritonblas/kernels/__init__.py`	Exports the new FP4 Stream‑K kernel symbol.
`tests/test_matmul_fp4.py`	Adds Stream‑K FP4 correctness tests and multiple Stream‑K benchmark tests.
`pyproject.toml`	Defines the `performance` pytest marker.

Comments suppressed due to low confidence (1)

include/tritonblas/matmul.py:201

Global Stream-K scratch buffers (_global_locks / _global_P) are allocated once on the default CUDA device at import time. The reuse path here does not verify that a.device matches those buffers, which can crash or silently misbehave when matmul runs on a non-default GPU (e.g. cuda:1). Consider gating reuse on device match (or keeping a per-device cache of scratch buffers) and falling back to per-call allocation otherwise.

    # Reuse pre-allocated global buffers at runtime, but allocate fresh during
    # torch.compile tracing (FakeTensorMode cannot slice real tensors).
    if not torch.compiler.is_compiling() and grids <= MAX_SMS and block_size <= MAX_BLOCK_SIZE:
        locks = _global_locks[:grids]
        P = _global_P[:grids, :block_size]
    else:
        locks = torch.empty(grids, device=a.device, dtype=torch.uint8)
        P = torch.empty(grids, block_size, device=a.device, dtype=torch.float32)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

include/tritonblas/matmul.py

 def matmul_fp4(
    a: torch.Tensor,
    b: torch.Tensor,
    c: torch.Tensor,
    a_scales: torch.Tensor,
    b_scales: torch.Tensor,
-    block_m: int = None, #Overrides Origami value
-    block_n: int = None, #Overrides Origami value
-    block_k: int = None, #Overrides Origami value
-    group_size_m: int = 8, #Overrides Origami value
-    num_warps: int = 8,
-    num_stages: int = 2,
+    enable_streamk: bool = False,
+    sk_grid: Optional[int] = None,
 ):


include/tritonblas/matmul.py

+    grids = total_programs_streamk
+    block_size = BLK_M * BLK_N
+
+    if not torch.compiler.is_compiling() and grids <= MAX_SMS and block_size <= MAX_BLOCK_SIZE:
+        locks = _global_locks[:grids]
+        P = _global_P[:grids, :block_size]
+    else:
+        locks = torch.empty(grids, device=a.device, dtype=torch.uint8)
+        P = torch.empty(grids, block_size, device=a.device, dtype=torch.float32)
+


xiaohuguo2023 and others added 16 commits March 12, 2026 18:50

the first working version of async_copy support, adding proper estima…

884e881

…te of LDS usage and filter out those tile sizes that exceed LDS compacity

pass in num_stages for accurate LDS calculation, and implement fast m…

6f9e1e5

…ath for padding size estimation

add LDS tests

23c3875

add lds tests into ci workflow

11ffa5c

update origamini commit

cf7fcc6

add blank line

35b141e

small tile size shouldn#t have LDS issues at all, just raise error

7e9a590

fix origami dtype assignment bug

1a8c124

remove the redundant lds test

8d3483c

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Added three replacement tests that can actually fail

bb6e44b

make python 3.8 compatible

2c2615f

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

let's revert back to old api to fix issue in addmm tests

77974be

adds a torch.compiler.is_compiling() guard so that: During torch.comp…

2e05e15

…ile tracing: is_compiling() returns True, so the condition is False, and we fall through to torch.empty(...) which FakeTensorMode can handle properly

the first version of working streamk mxfp4

1b0882a

registers a custom pytest marker called performance

f87eea3

the same fix for non compatible api

0e9cb39

xiaohuguo2023 requested review from neoblizz and ryanswann-amd March 13, 2026 13:09

xiaohuguo2023 added 2 commits March 13, 2026 08:31

seperate persistent and stream-k fp4 matmul wrapper

9661e29

forgot commit streamk fp4 kernel

93bf83b

xiaohuguo2023 changed the base branch from support_async_copy to main March 15, 2026 21:34

Merge branch 'main' into streamk_mxfp4

78bf25c

Copilot AI review requested due to automatic review settings March 15, 2026 21:37

Copilot started reviewing on behalf of xiaohuguo2023 March 15, 2026 21:38 View session

Copilot AI reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stream-k mxfp4 kernel#74

Stream-k mxfp4 kernel#74
xiaohuguo2023 wants to merge 19 commits intomainfrom
streamk_mxfp4

xiaohuguo2023 commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

xiaohuguo2023 commented Mar 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants