Problem Description
In cases where the problem dimensions are not divisible by the tile size, performance is significantly worse (order of magnitude)
Operating System
Ubuntu 22.04.2 LTS (Jammy Jellyfish)
CPU
AMD EPYC 9575F 64-Core Processor
GPU
AMD Instinct MI355X
ROCm Version
ROCm 7.0.0
ROCm Component
No response
Steps to Reproduce
Run the FP4 kernel with a problem size that is not divisible by the tile size by forcing the tile size through the function arguments.
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
No response
Additional Information
No response