Decompose memrefs early #647

Hardcode84 · 2025-12-28T17:54:26Z

Decompose memrefs early, materializing index calculations, llvm GEPs and llvm loads/stores explicitly. This makes index calculations more amendable for integer range and (future) uniformity analyses. Add an integer narrowing pass to the pipleine to see the reduced count of VGPRs on mxfp gemm.

Some operations with non-trivial lowering (buffer casts and amdgpu.gather-to-lds) are kept in memref land but converted to 0D memrefs, to expose indexing as well.

This pass is basically an alternative memref-to-llvm-lowering and it is compatible with the existing upstream one as for operations not lowered it will generate MemrefDescriptor shims.

What's wrong with the default upstream memref-to-llvm lowering? Default lowering uses MemrefDescriptors, hiding all index calculations.
Why not use 1D/0D memrefs? I want * sizeof(T) part of index calculations to be materialized explicitly. Early POC was using memref<?xi8> and memref.view but it required a lot of memref casts back and forth.
Why not ptr dialect? I had a ptr dialect POC as well but ptr dialect is currently incomplete and also at this point it is a carbon copy of llvm dialect and provides no useful abstraction.

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

This reverts commit 81d9747. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

This reverts commit 548b353. Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Signed-off-by: Ivan Butygin <ivan.butygin@gmail.com>

Copilot

Pull request overview

This PR introduces an early memref decomposition pass that explicitly materializes index calculations, LLVM GEPs, and loads/stores. This enables better integer range and uniformity analyses by exposing index arithmetic early in the compilation pipeline. The changes demonstrate a tangible benefit with VGPR count reduction from 160 to 140 in the mxfp gemm test case.

Key changes:

Adds water-memref-decomposition pass that converts multi-dimensional memref operations to explicit pointer arithmetic with affine maps
Integrates integer narrowing pass (arith-int-range-narrowing) into the compilation pipeline to leverage exposed index calculations
Reorders pipeline to run memref decomposition before affine lowering, enabling better optimization opportunities

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
wave_lang/kernel/wave/water.py	Updates compilation pipeline to add memref decomposition before affine lowering and integrates integer range optimizations
water/tools/water-opt/water-opt.cpp	Registers the arithmetic integer range narrowing pass
water/test/Transforms/memref-decomposition.mlir	Comprehensive test suite covering load/store, vector operations, reinterpret_cast, and AMD GPU-specific operations
water/lib/Transforms/MemrefDecomposition.cpp	Core implementation of memref decomposition pass with type converter and pattern rewriters
water/lib/Transforms/CMakeLists.txt	Adds new source file and required dependencies (AMDGPU, SCFTransforms)
water/include/water/Transforms/Passes.td	Defines the new pass with documentation and dialect dependencies
tests/kernel/wave_gemm_mxfp_test.py	Updates expected VGPR count from 160 to 140 and adjusts waitcount expectations reflecting optimization impact

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.