Skip to content

Releases: ROCm/flashinfer

v0.5.3+amd.2

15 Apr 18:02
d981804

Choose a tag to compare

Added

  • Add Jupyter notebook tutorial for using amd-flashinfer on ROCm (#213) @diptorupd
  • Add ROCm profiler module (flashinfer/rocm_profiler/) and FA2 single-prefill benchmark driver using rocprofv3 (#205) @diptorupd
  • Gate torch.compile integration behind FLASHINFER_USE_TORCH_CUSTOM_OPS, with HIP pytest coverage (#210) @demandal25

Changed

  • FA2 prefill on HIP: improve occupancy and throughput on CDNA3 via LDS-aware CTA_TILE_Q selection and shared-memory budget capping (#209) @diptorupd
  • Add k128B_16Row swizzle mode for prefill shared memory to reduce LDS bank conflicts on CDNA3 (#207) @diptorupd

Fixed

  • HIP FA2 on CDNA3: correct bfloat16 row-sum MFMA intrinsic selection and MFMA C/D row indexing for custom masks with GQA (#214) @subhajitdchow
  • Editable installs: ensure JIT sees current headers under include/ with scikit-build-core redirect mode (symlink at install + get_include_paths fix) (#208) @diptorupd

Maintenance


Contributors: @diptorupd, @demandal25, @subhajitdchow

Summary: This point release focuses on ROCm prefill correctness and performance, developer tooling, and install ergonomics. It fixes silent bf16 softmax scaling and custom-mask/GQA mis-indexing in HIP FA2, improves CDNA3 occupancy and LDS bank behavior for FA2 prefill, adds an opt-in torch.compile path, a rocprofv3-based profiling kit, and a notebook tutorial. Editable installs now pick up header edits without reinstalling.

v0.5.3+amd.1

18 Mar 16:04
c4a66de

Choose a tag to compare

Updated Upstream

Added

  • Port new upstream RoPE kernels to ROCm: fused RoPE + FP8 quantization and paged KV cache append (#196) @diptorupd
  • Add test_hip_utils (#197) @rtmadduri
  • Add AITER version check to FlashInfer and configure AITER during Docker build (#191) @rtmadduri
  • Add AITER_ROCM_VERSION (#193) @rtmadduri
  • Enable gfx950 support for FlashInfer+ROCm (#188) @rtmadduri
  • Add prefill and decode aliases to sys.modules (#185) @diptorupd
  • Support environments built by TheRock build system (#172) @eppaneamd
  • Run tests in parallel with pytest-xdist and multi-GPU scheduling (#176) @diptorupd
  • JIT core and env: IS_HIP block with cpp_ext_hip, JitSpecRegistry, and amd-jit-cache integration (#173) @diptorupd

Changed

Fixed

  • Fix AITER prefill graph-capture path + update page-size support (#170) @rtmadduri
  • Add support for both CUDA and HIP generator headers in sampling.cu (#180) @eppaneamd
  • Move HIP_VISIBLE_DEVICES setting to top-level conftests.py (#183) @diptorupd
  • Fix HIP import errors blocking canary test on ROCm (pynvml lazy import, fp4 guard, modules_hip generate_additional_params, mypy) (#173) @diptorupd
  • Fix flaky Sampling Tests (#189) @rtmadduri
  • Reduce sampling threshold from 0.99 to 0.98 (#190) @diptorupd
  • Improve test_logits_processor_hip tolerances (#192) @rtmadduri
  • Fix arch check (#187) @diptorupd

Maintenance


Contributors: @diptorupd, @rtmadduri, @eppaneamd

Summary: This release rebases amd-flashinfer onto the upstream v0.5.3 tag. It adds a full port of the fused RoPE + FP8 quantization and paged KV cache append kernels to ROCm, gfx950 (MI350) support, and AITER version checking with Docker build integration. JIT and env layers now use a dedicated IS_HIP path with cpp_ext_hip and amd-jit-cache. Fixes include AITER prefill graph-capture and page-size handling, HIP/CUDA generator header compatibility in sampling, and ROCm import/canary test fixes. Testing improvements include parallel pytest runs with multi-GPU scheduling, test_hip_utils, and TheRock build system support.


v0.3.1+amd.1

23 Feb 16:27
66921c7

Choose a tag to compare

v0.3.1+amd.1

Updated Upstream

Added

Changed

Removed

Fixed

  • Fix device contextmanager to use per-call context instead of setting default globally (#146) @diptorupd
  • Fix minor issues in pytorch_hip.py @diptorupd

Maintenance

  • Tech debt reduction: remove superficial diffs and unused code (#152, #153) @diptorupd
  • Update pre-commit hooks with AMD-specific configuration @diptorupd

Contributors: @diptorupd, @demandal25, @rtmadduri

Summary: This release updates amd-flashinfer to use the upstream v0.3.1 tag and adds significant new functionality. Key additions include AITER backend support for both single and batch prefill, full sampling and quantization module ports to ROCm/HIP, and CUDA graph support for paged batch prefill. Infrastructure improvements include isolation of HIP kernels into a dedicated csrc_rocm directory, updated AOT build infrastructure, and initial AMD-specific code coverage tooling.

v0.2.5+amd.2

23 Jan 17:26
45135a5

Choose a tag to compare

Added

Changed

Removed

  • Removes leftover src and all tvm bindings (#99) @diptorupd
  • Remove verbose CMake installation messages for editable JIT (#97) @demandal25
  • Chore: Refactors the codebase to remove libflashinfer (#88) @diptorupd
  • Remove xfail markers about HIP support from pytests (#92) @demandal25
  • Chore: Reduce tech debt by removing CUDA sections from generic/prefill.cuh (#87) @diptorupd
  • Removes the test_transpose_4x4_half_registers (#11) @diptorupd

Fixed

  • Add custom ROCm version scheme to fix wheels version naming (#110) @diptorupd
  • Fix datatypes for HIP when using customized attention kernels (#111) @demandal25
  • Fix partition-kv=True case and memory allocation issues in batch prefill (#89) @demandal25
  • Fixes the single prefill kernel dispatch for HEAD_DIM_QK values gt. 64 (#86) @diptorupd
  • Fix/threadblock sync mdo (#62) @diptorupd
  • Fix batch prefill example script for ragged kv cache (#73) @demandal25
  • Fixes to the single prefill dispatch for HIP devices (#64) @diptorupd
  • Skip failing C++ tests and fix mma_debug_utils (#59) @diptorupd
  • Fix Log-sum-exp (LSE) write back for single prefill kernels for CDNA3 (#42) @diptorupd
  • Implemented fix for the write_o_reg_gmem kernel (#39) @diptorupd
  • Fix few more leftover SPDX headers (#38) @diptorupd
  • Fix SPDX headers for AMD authored files (#37) @diptorupd
  • Improvements to the S-matrix (s_frag) materialization to LDS for debugging (#20) @diptorupd
  • fix the pipe at the end of a table (#19) @demandal25
  • Fix some compiler warnings in Cxx unit tests (#13) @diptorupd
  • Adds debug utility functions for CDNA3 MMA ops. (#3) @diptorupd
  • Fixes fragment loading to properly pack 16b values into a 32b register (#2) @diptorupd

Maintenance

Miscellaneous


Contributors: @diptorupd , @rtmadduri, @demandal25, @clint

Summary: This release brings full prefill kernel support to ROCm, including single and batch prefill with paged and ragged KV cache. Major performance improvements include k128B swizzle mode and FP8 support. Significant infrastructure improvements include complete pytest coverage, improved build system, and updated to ROCm 7.1.1 / PyTorch 2.8.0.