Skip to content

Cache modifier predict with origami#69

Open
xiaohuguo2023 wants to merge 7 commits intomainfrom
cache_modifier_predict
Open

Cache modifier predict with origami#69
xiaohuguo2023 wants to merge 7 commits intomainfrom
cache_modifier_predict

Conversation

@xiaohuguo2023
Copy link
Copy Markdown
Member

  • Cast strides to int64 to avoid int32 overflow for large tensors
  • update origami to support async_copy
  • update origami to support triton cache_modifier predict for global loads/buffer_loads.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the Origami-based matmul selector and Triton kernels to better handle large-tensor indexing and to automatically choose Triton cache_modifier hints (e.g., non-temporal/streaming loads) based on problem shape and the selected tile.

Changes:

  • Add Triton-aware LDS (shared memory) usage validation to Origami tile selection and re-select from constrained configs when needed.
  • Detect and plumb Origami-derived cache hints into Triton cache_modifier for A/B loads.
  • Promote stride/index arithmetic to int64 in StreamK + staged views to avoid int32 overflow on large tensors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
include/tritonblas/origami.py Adds LDS validation, cache-hint detection, and exposes cache_modifier_{a,b} for kernels.
include/tritonblas/matmul.py Wires selector-provided cache modifiers into persistent + streamk kernel launches.
include/tritonblas/kernels/streamk_gemm.py Casts key strides/indices to int64 to avoid overflow in pointer arithmetic.
include/tritonblas/kernels/stages/tile.py Promotes pid_m/pid_n to int64 when forming tile indices to prevent overflow.
include/tritonblas/kernels/stages/matrix_view.py Casts view strides to int64 to avoid overflow when computing pointer grids.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread include/tritonblas/origami.py Outdated
Comment thread include/tritonblas/origami.py Outdated
Comment thread include/tritonblas/kernels/streamk_gemm.py
Comment on lines +169 to +178
@property
def cache_modifier_a(self):
"""Triton cache_modifier string for operand A loads, or None for default."""
return self._cache_hints_to_modifier(self._result.config.cache_hints_a)

@property
def cache_modifier_b(self):
"""Triton cache_modifier string for operand B loads, or None for default."""
return self._cache_hints_to_modifier(self._result.config.cache_hints_b)

Copy link

Copilot AI Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New selector behavior (_validate_lds_usage, _detect_cache_hints, and the cache_modifier_{a,b} properties) isn’t covered by existing tests (no tests reference cache modifiers today). Add at least one test that exercises a shape expected to trigger nontemporal hints and asserts the selector returns the expected cache modifier strings (and/or that the kernel can be launched with those modifiers).

Copilot uses AI. Check for mistakes.
Comment thread include/tritonblas/kernels/stages/tile.py Outdated
Comment thread include/tritonblas/origami.py Outdated
Comment on lines +183 to +198
@staticmethod
def _cache_hints_to_modifier(cache_hints):
"""Map Origami cache_hints integer to Triton cache_modifier string.

The Origami/Tensile NonTemporal value is a bitmask: [bit2=NT][bit1=SLC][bit0=GLC].
Triton on AMD exposes three cache modifiers that control GLC/SLC bits:
.ca (None) - cache at all levels (GLC=0, SLC=0)
.cg - cache at global level only (GLC=1, SLC=0)
.cs - cache streaming / nontemporal (GLC=1, SLC=1)
"""
if cache_hints == 0:
return None # default (.ca) - cache at all levels
elif cache_hints == 1:
return ".cg" # GLC only - cache at global/L2 level, bypass L1
else:
return ".cs" # SLC or NT set - streaming / nontemporal
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, you have options for no caching, cache_hint = 2 and everything else. The no caching makes sense, but the function you're using to decide on the cache hints below only ever returns 4, which would make the cache_hints == 1 statement unnecessary?

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

xiaohuguo2023 and others added 4 commits February 19, 2026 21:09
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@xiaohuguo2023
Copy link
Copy Markdown
Member Author

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

can you also point to me where origami predict nontemporal ? I feel I may misunderstand the nontemporal implementation ?

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

can you also point to me where origami predict nontemporal ? I feel I may misunderstand the nontemporal implementation ?

https://github.com/ROCm/rocm-libraries/blob/1c7af2df08362d60001890e19f1bba7206208553/shared/origami/src/origami/gemm.cpp#L928-L961

@xiaohuguo2023
Copy link
Copy Markdown
Member Author

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418

Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

.cg is cache only in L1 I think? Origami predicts between cache nowhere and cache everywhere right now.

@xiaohuguo2023
Copy link
Copy Markdown
Member Author

Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418
Basically "4" should result in nt flag set on loads in the assembly.

ah, so origami doesn't support .cg ?

.cg is cache only in L1 I think? Origami predicts between cache nowhere and cache everywhere right now.

hm, .cg means cache L2 or below, it's actually quite important in triton. I would quite be surprised if hipblaslt doesn't need this ?

@ryanswann-amd
Copy link
Copy Markdown
Collaborator

We did not see a case in which just avoiding LLC resulted in significant performance improvement, are you seeing cases in triton where this is beneficial?

ryanswann-amd pushed a commit to ryanswann-amd/tritonBLAS that referenced this pull request Apr 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants