Cache modifier predict with origami by xiaohuguo2023 · Pull Request #69 · ROCm/tritonBLAS

xiaohuguo2023 · 2026-02-19T15:47:27Z

Cast strides to int64 to avoid int32 overflow for large tensors
update origami to support async_copy
update origami to support triton cache_modifier predict for global loads/buffer_loads.

Copilot

Pull request overview

This PR updates the Origami-based matmul selector and Triton kernels to better handle large-tensor indexing and to automatically choose Triton cache_modifier hints (e.g., non-temporal/streaming loads) based on problem shape and the selected tile.

Changes:

Add Triton-aware LDS (shared memory) usage validation to Origami tile selection and re-select from constrained configs when needed.
Detect and plumb Origami-derived cache hints into Triton cache_modifier for A/B loads.
Promote stride/index arithmetic to int64 in StreamK + staged views to avoid int32 overflow on large tensors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
include/tritonblas/origami.py	Adds LDS validation, cache-hint detection, and exposes `cache_modifier_{a,b}` for kernels.
include/tritonblas/matmul.py	Wires selector-provided cache modifiers into persistent + streamk kernel launches.
include/tritonblas/kernels/streamk_gemm.py	Casts key strides/indices to int64 to avoid overflow in pointer arithmetic.
include/tritonblas/kernels/stages/tile.py	Promotes `pid_m/pid_n` to int64 when forming tile indices to prevent overflow.
include/tritonblas/kernels/stages/matrix_view.py	Casts view strides to int64 to avoid overflow when computing pointer grids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-19T15:55:39Z

+    @property
+    def cache_modifier_a(self):
+        """Triton cache_modifier string for operand A loads, or None for default."""
+        return self._cache_hints_to_modifier(self._result.config.cache_hints_a)
+
+    @property
+    def cache_modifier_b(self):
+        """Triton cache_modifier string for operand B loads, or None for default."""
+        return self._cache_hints_to_modifier(self._result.config.cache_hints_b)
+


New selector behavior (_validate_lds_usage, _detect_cache_hints, and the cache_modifier_{a,b} properties) isn’t covered by existing tests (no tests reference cache modifiers today). Add at least one test that exercises a shape expected to trigger nontemporal hints and asserts the selector returns the expected cache modifier strings (and/or that the kernel can be launched with those modifiers).

ryanswann-amd · 2026-02-19T17:16:00Z

+    @staticmethod
+    def _cache_hints_to_modifier(cache_hints):
+        """Map Origami cache_hints integer to Triton cache_modifier string.
+
+        The Origami/Tensile NonTemporal value is a bitmask: [bit2=NT][bit1=SLC][bit0=GLC].
+        Triton on AMD exposes three cache modifiers that control GLC/SLC bits:
+            .ca  (None) - cache at all levels         (GLC=0, SLC=0)
+            .cg         - cache at global level only  (GLC=1, SLC=0)
+            .cs         - cache streaming / nontemporal (GLC=1, SLC=1)
+        """
+        if cache_hints == 0:
+            return None   # default (.ca) - cache at all levels
+        elif cache_hints == 1:
+            return ".cg"  # GLC only - cache at global/L2 level, bypass L1
+        else:
+            return ".cs"  # SLC or NT set - streaming / nontemporal


In this example, you have options for no caching, cache_hint = 2 and everything else. The no caching makes sense, but the function you're using to decide on the cache hints below only ever returns 4, which would make the cache_hints == 1 statement unnecessary?

ryanswann-amd · 2026-02-19T17:17:49Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

xiaohuguo2023 · 2026-02-19T23:38:02Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

can you also point to me where origami predict nontemporal ? I feel I may misunderstand the nontemporal implementation ?

ryanswann-amd · 2026-02-19T23:47:00Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

can you also point to me where origami predict nontemporal ? I feel I may misunderstand the nontemporal implementation ?

https://github.com/ROCm/rocm-libraries/blob/1c7af2df08362d60001890e19f1bba7206208553/shared/origami/src/origami/gemm.cpp#L928-L961

xiaohuguo2023 · 2026-02-20T00:16:28Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

ryanswann-amd · 2026-02-23T17:22:00Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

.cg is cache only in L1 I think? Origami predicts between cache nowhere and cache everywhere right now.

xiaohuguo2023 · 2026-02-23T23:01:38Z

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

.cg is cache only in L1 I think? Origami predicts between cache nowhere and cache everywhere right now.

hm, .cg means cache L2 or below, it's actually quite important in triton. I would quite be surprised if hipblaslt doesn't need this ?

ryanswann-amd · 2026-02-24T17:24:50Z

We did not see a case in which just avoiding LLC resulted in significant performance improvement, are you seeing cases in triton where this is beneficial?

xiaohuguo2023 added 2 commits February 19, 2026 09:42

Cast strides to int64 to avoid int32 overflow for large tensors.

feb9b7a

update origami to support async_copy and cache_modifier predict

56a48b2

xiaohuguo2023 requested review from Copilot, neoblizz and ryanswann-amd February 19, 2026 15:47

Copilot started reviewing on behalf of xiaohuguo2023 February 19, 2026 15:48 View session

Copilot AI reviewed Feb 19, 2026

View reviewed changes

ryanswann-amd reviewed Feb 19, 2026

View reviewed changes

xiaohuguo2023 and others added 4 commits February 19, 2026 21:09

fix bug which could becomes 0 for 4 bit types

90d8fe2

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

passing num_stages into origami to correct LDS calculation

ca20f4c

fix stride_bias for int32 overflow

2f73505

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

rm, rn should fit in int32

feb46c4

we don't need bias stride to be int64

91645f6

ryanswann-amd mentioned this pull request Mar 2, 2026

Expose Origami's cache_modifier to tritonBLAS #55

Open

ryanswann-amd pushed a commit to ryanswann-amd/tritonBLAS that referenced this pull request Apr 12, 2026

Add triton package to build dependencies (ROCm#69)

f4b6193

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cache modifier predict with origami#69

Cache modifier predict with origami#69
xiaohuguo2023 wants to merge 7 commits intomainfrom
cache_modifier_predict

xiaohuguo2023 commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Uh oh!

Uh oh!

ryanswann-amd Feb 19, 2026

Uh oh!

ryanswann-amd commented Feb 19, 2026

Uh oh!

xiaohuguo2023 commented Feb 19, 2026

Uh oh!

ryanswann-amd commented Feb 19, 2026

Uh oh!

xiaohuguo2023 commented Feb 20, 2026

Uh oh!

ryanswann-amd commented Feb 23, 2026

Uh oh!

xiaohuguo2023 commented Feb 23, 2026

Uh oh!

ryanswann-amd commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xiaohuguo2023 commented Feb 19, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ryanswann-amd Feb 19, 2026

Choose a reason for hiding this comment

Uh oh!

ryanswann-amd commented Feb 19, 2026

Uh oh!

xiaohuguo2023 commented Feb 19, 2026

Uh oh!

ryanswann-amd commented Feb 19, 2026

Uh oh!

xiaohuguo2023 commented Feb 20, 2026

Uh oh!

ryanswann-amd commented Feb 23, 2026

Uh oh!

xiaohuguo2023 commented Feb 23, 2026

Uh oh!

ryanswann-amd commented Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants