fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1
Closed
demandal25 wants to merge 1 commit into
Closed
fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1demandal25 wants to merge 1 commit into
demandal25 wants to merge 1 commit into
Conversation
In the partition-KV path of SingleDecodeWithKVCacheDispatched, max_num_kv_chunks was computed as `max_grid_size / num_kv_heads` without a lower bound. When num_kv_heads exceeds max_grid_size — e.g. MI308X CPX exposes 20 CUs while a shape uses 32 kv-heads — the integer division underflows to 0 and the subsequent `ceil_div(seq_len, 0)` raises SIGFPE in the kernel-launch host code. The existing guard only catches `num_blocks_per_sm == 0`, not this divisor underflow. Clamp the result to >=1 so the path falls back to one CTA per kv-head (no further KV split), which is the correct behavior when the device cannot fit all kv-heads simultaneously. Reproduces with `tests/attention/test_logits_cap.py::test_single_decode_logits_soft_cap` at (seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0) on a 20-CU device. After the fix, all 450 tests in test_logits_cap.py pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Owner
Author
|
Wrong base — re-opening against ROCm/flashinfer:amd-integration. |
demandal25
added a commit
that referenced
this pull request
May 20, 2026
…raph, return_lse (ROCm#234) ## Summary The AITER PA v1 decode backend on `amd-integration` has three call patterns that produce wrong output, hard crashes, or unhelpful `NotImplementedError`s. This PR fixes each one at the level it's actually broken at, rather than blanket-disabling AITER. | Case | Behavior on `amd-integration` | This PR | |---|---|---| | Sliding-window attention (`window_left >= 0`) | AITER selected. Wrapper passes `sliding_window = window_left` (off-by-one), and `window_left = 0` collides with AITER's "disabled" sentinel — silently wrong output. | **AITER runs** with corrected convention mapping (`window_left + 1`). | | `use_cuda_graph=True` with explicit `backend="aiter"` | AITER selected. Per-plan scalars (`max_kv_len`, `max_blocks_per_seq`) are baked into the captured graph; replay against a larger batch launches with an under-sized grid. | Clear `ValueError` at `plan()` time. (auto-select already routes to `fa2`.) | | `run(return_lse=True)` | Raises `NotImplementedError("AITER decode backend does not currently return LSE")`. | Transparent dispatch through a pre-built FA2 shadow plan; one-time warning at `plan()` time so the per-call backend switch is not silent. | ## Why each fix ### 1. Sliding window — wrapper convention bug, not a kernel gap AITER PA v1's kernel *does* implement window masking (`csrc/cpp_itfs/pa/pa_kernels.cuh:457`): ```cpp if (local_token_idx + i < context_len - sliding_window) tmp = -FLT_MAX; ``` gated by the `sliding_window_enabled` template flag (set by the compile step at `csrc/cpp_itfs/pa/pa_v1.py:144`). The wrapper already plumbs `sliding_window` through `_aiter_pa_v1_resolve` and the run-time call. The bug on trunk is a **convention difference**: - FlashInfer: `window_left = W` → query at position `kv_len-1` sees positions `[kv_len-1-W, kv_len-1]` = `W+1` tokens. - AITER: `sliding_window = S` (with `S > 0` enabling the mask) → admits `S` tokens. - `S = 0` is AITER's compile-time "disabled" sentinel. Trunk passes `sliding_window = window_left` — off by one, plus `window_left = 0` (one visible token) collides with AITER's disabled sentinel and silently returns full attention. Fixed: `sliding_window = window_left + 1` when `window_left >= 0`, else 0. This keeps AITER on the hot path for sliding-window models (Gemma, Mistral, etc.) instead of giving up perf to FA2. ### 2. CUDA graph — wrapper-level limitation, hard-rejected in explicit path AITER's launch grid is computed at `plan()` time from `max_kv_len` and `max_blocks_per_seq` of the *current* batch and passed by value to the kernel launch. Under CUDA-graph capture these scalars are baked into the captured graph and can't be widened on replay against a larger batch. Supporting this properly would require capturing with worst-case dimensions — a new API parameter (e.g. `max_seq_len_per_request`) — which is out of scope for this PR. The auto-select fallback to FA2 stays in place; the explicit `backend="aiter"` path (which on trunk silently produces broken launches) now raises: ``` ValueError: AITER decode backend is incompatible with CUDA-graph capture: the kernel's launch grid is sized from per-plan scalars (max_kv_len, max_blocks_per_seq) that are baked into the captured graph at capture time. Use backend='fa2' for CUDA-graph workflows, or backend='auto' which routes around this automatically. ``` ### 3. return_lse — replace NotImplementedError with transparent fallback AITER PA v1 does not output LSE (only `out`; the kernel computes per-partition `exp_sums`/`max_logits` internally for split-K but does not expose normalized LSE). Trunk raises `NotImplementedError` at `run()` time, breaking any caller that needs LSE under an AITER plan. This PR pre-builds an FA2 decode plan at AITER `plan()` time and dispatches through it whenever `return_lse=True` arrives at `run()`. This is the only correct option since `return_lse` is per-call, not per-plan. Two details worth flagging: - The shadow plan uses the real `window_left` (and the corresponding template flag), so it produces correct LSE under sliding-window AITER plans (now supported per fix #1). - A one-time-per-device warning is emitted at AITER plan() time. Without it, a user toggling `return_lse=True` on a hot path would silently move from AITER → FA2 with no signal. ## Tests added `tests/rocm_tests/test_batch_decode_aiter_hip.py`: - `test_batch_decode_aiter_sliding_window_vs_fa2` — AITER↔FA2 parity over `window_left ∈ {0, 31, 127, 1023}`, fp16/bf16, batch sizes, GQA ratios, including the saturation regime (`window_left >= max_kv_len-1`) that exercises the kernel's no-op masking branch. - `test_batch_decode_aiter_return_lse_via_fa2` — verifies (a) `return_lse=False` still runs through AITER, (b) `return_lse=True` falls back to the shadow FA2 plan and returns `(output, lse)` matching the pure-FA2 reference, with and without sliding window. - `test_batch_decode_auto_routes_cuda_graph_to_fa2` — `backend="auto"` + `use_cuda_graph=True` resolves to `fa2`. - Extended `test_batch_decode_aiter_rejects_invalid_config` — explicit `backend="aiter"` + `use_cuda_graph=True` raises a `ValueError` mentioning "CUDA-graph". ## Test plan - [x] `pytest tests/rocm_tests/test_batch_decode_aiter_hip.py -v` — **178 passed** in 69 s. New parity + LSE-fallback + CUDA-graph rejection cases. - [x] `pytest tests/rocm_tests/test_sliding_window_hip.py -m "not slow"` — **1248 passed** in 60 s. Exercises the AITER path on every sliding-window decode shape that was previously silently wrong on trunk. - [x] `pytest tests/rocm_tests/test_batch_decode_kernels_hip.py -m "not slow" -n auto --reruns 2` — **1872 passed** in 174 s. Covers the broader decode matrix including `return_lse=True` (which on trunk raised `NotImplementedError` under AITER) and CUDA-graph wrappers (which now route to FA2 cleanly). ## API impact - `BatchDecodeWithPagedKVCacheWrapper` docstring updated to document the AITER-specific constraints (CUDA-graph incompatible; sliding-window supported transparently; `return_lse` falls back to FA2). - No public signature changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
tests/attention/test_logits_cap.py::test_single_decode_logits_soft_capSIGFPEs at(seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0)on MI308X CPX (20 CUs).SingleDecodeWithKVCacheDispatched's partition-KV path,max_num_kv_chunks = max_grid_size / num_kv_headsunderflows to 0 whennum_kv_heads > max_grid_size(e.g. 20 CUs × 1 block/SM = 20 < 32 kv-heads). The next line then callsceil_div(seq_len, 0)→ SIGFPE in the host launch code.max_num_kv_chunksto>= 1. With the clamp, the kernel falls back to one CTA per kv-head (no further KV split) — the correct behavior when the device can't fit all kv-heads simultaneously.The existing guard at line 700 only catches
num_blocks_per_sm == 0; it does not cover this divisor-underflow case.Test plan
seq_len ∈ {256, 257, 320, 729, 33001},head_dim=256,num_heads=32) returns valid output instead of crashing.pytest tests/attention/test_logits_cap.py— all 450 cases pass (was crashing on first head_dim=256 / num_heads=32 / seq_len>256 case).🤖 Generated with Claude Code