Cache modifier predict with origami#69
Conversation
xiaohuguo2023
commented
Feb 19, 2026
- Cast strides to int64 to avoid int32 overflow for large tensors
- update origami to support async_copy
- update origami to support triton cache_modifier predict for global loads/buffer_loads.
There was a problem hiding this comment.
Pull request overview
This PR updates the Origami-based matmul selector and Triton kernels to better handle large-tensor indexing and to automatically choose Triton cache_modifier hints (e.g., non-temporal/streaming loads) based on problem shape and the selected tile.
Changes:
- Add Triton-aware LDS (shared memory) usage validation to Origami tile selection and re-select from constrained configs when needed.
- Detect and plumb Origami-derived cache hints into Triton
cache_modifierfor A/B loads. - Promote stride/index arithmetic to
int64in StreamK + staged views to avoid int32 overflow on large tensors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| include/tritonblas/origami.py | Adds LDS validation, cache-hint detection, and exposes cache_modifier_{a,b} for kernels. |
| include/tritonblas/matmul.py | Wires selector-provided cache modifiers into persistent + streamk kernel launches. |
| include/tritonblas/kernels/streamk_gemm.py | Casts key strides/indices to int64 to avoid overflow in pointer arithmetic. |
| include/tritonblas/kernels/stages/tile.py | Promotes pid_m/pid_n to int64 when forming tile indices to prevent overflow. |
| include/tritonblas/kernels/stages/matrix_view.py | Casts view strides to int64 to avoid overflow when computing pointer grids. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @property | ||
| def cache_modifier_a(self): | ||
| """Triton cache_modifier string for operand A loads, or None for default.""" | ||
| return self._cache_hints_to_modifier(self._result.config.cache_hints_a) | ||
|
|
||
| @property | ||
| def cache_modifier_b(self): | ||
| """Triton cache_modifier string for operand B loads, or None for default.""" | ||
| return self._cache_hints_to_modifier(self._result.config.cache_hints_b) | ||
|
|
There was a problem hiding this comment.
New selector behavior (_validate_lds_usage, _detect_cache_hints, and the cache_modifier_{a,b} properties) isn’t covered by existing tests (no tests reference cache modifiers today). Add at least one test that exercises a shape expected to trigger nontemporal hints and asserts the selector returns the expected cache modifier strings (and/or that the kernel can be launched with those modifiers).
| @staticmethod | ||
| def _cache_hints_to_modifier(cache_hints): | ||
| """Map Origami cache_hints integer to Triton cache_modifier string. | ||
|
|
||
| The Origami/Tensile NonTemporal value is a bitmask: [bit2=NT][bit1=SLC][bit0=GLC]. | ||
| Triton on AMD exposes three cache modifiers that control GLC/SLC bits: | ||
| .ca (None) - cache at all levels (GLC=0, SLC=0) | ||
| .cg - cache at global level only (GLC=1, SLC=0) | ||
| .cs - cache streaming / nontemporal (GLC=1, SLC=1) | ||
| """ | ||
| if cache_hints == 0: | ||
| return None # default (.ca) - cache at all levels | ||
| elif cache_hints == 1: | ||
| return ".cg" # GLC only - cache at global/L2 level, bypass L1 | ||
| else: | ||
| return ".cs" # SLC or NT set - streaming / nontemporal |
There was a problem hiding this comment.
In this example, you have options for no caching, cache_hint = 2 and everything else. The no caching makes sense, but the function you're using to decide on the cache hints below only ever returns 4, which would make the cache_hints == 1 statement unnecessary?
|
Also can we validate that the cache hints you're giving result in the same assembly flags (sc0, sc1, sc0+sc1, nt) as tensilelite : https://github.com/ROCm/rocm-libraries/blob/4b01840243f57be0296757ddef4075b063bd65b7/shared/tensile/Tensile/Common.py#L1412-L1418 Basically "4" should result in nt flag set on loads in the assembly. |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
can you also point to me where origami predict nontemporal ? I feel I may misunderstand the nontemporal implementation ? |
|
ah, so origami doesn't support .cg ? |
.cg is cache only in L1 I think? Origami predicts between cache nowhere and cache everywhere right now. |
hm, .cg means cache L2 or below, it's actually quite important in triton. I would quite be surprised if hipblaslt doesn't need this ? |
|
We did not see a case in which just avoiding LLC resulted in significant performance improvement, are you seeing cases in triton where this is beneficial? |