fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices by demandal25 · Pull Request #1 · demandal25/flashinfer

demandal25 · 2026-04-29T04:39:30Z

Summary

tests/attention/test_logits_cap.py::test_single_decode_logits_soft_cap SIGFPEs at (seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0) on MI308X CPX (20 CUs).
Root cause: in SingleDecodeWithKVCacheDispatched's partition-KV path, max_num_kv_chunks = max_grid_size / num_kv_heads underflows to 0 when num_kv_heads > max_grid_size (e.g. 20 CUs × 1 block/SM = 20 < 32 kv-heads). The next line then calls ceil_div(seq_len, 0) → SIGFPE in the host launch code.
Fix: clamp max_num_kv_chunks to >= 1. With the clamp, the kernel falls back to one CTA per kv-head (no further KV split) — the correct behavior when the device can't fit all kv-heads simultaneously.

The existing guard at line 700 only catches num_blocks_per_sm == 0; it does not cover this divisor-underflow case.

Test plan

Minimal repro (seq_len ∈ {256, 257, 320, 729, 33001}, head_dim=256, num_heads=32) returns valid output instead of crashing.
pytest tests/attention/test_logits_cap.py — all 450 cases pass (was crashing on first head_dim=256 / num_heads=32 / seq_len>256 case).
Run full ROCm test suite in CI to confirm no regressions in non-CPX paths.

🤖 Generated with Claude Code

In the partition-KV path of SingleDecodeWithKVCacheDispatched, max_num_kv_chunks was computed as `max_grid_size / num_kv_heads` without a lower bound. When num_kv_heads exceeds max_grid_size — e.g. MI308X CPX exposes 20 CUs while a shape uses 32 kv-heads — the integer division underflows to 0 and the subsequent `ceil_div(seq_len, 0)` raises SIGFPE in the kernel-launch host code. The existing guard only catches `num_blocks_per_sm == 0`, not this divisor underflow. Clamp the result to >=1 so the path falls back to one CTA per kv-head (no further KV split), which is the correct behavior when the device cannot fit all kv-heads simultaneously. Reproduces with `tests/attention/test_logits_cap.py::test_single_decode_logits_soft_cap` at (seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0) on a 20-CU device. After the fix, all 450 tests in test_logits_cap.py pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

demandal25 · 2026-04-29T04:45:00Z

Wrong base — re-opening against ROCm/flashinfer:amd-integration.

…raph, return_lse (ROCm#234) ## Summary The AITER PA v1 decode backend on `amd-integration` has three call patterns that produce wrong output, hard crashes, or unhelpful `NotImplementedError`s. This PR fixes each one at the level it's actually broken at, rather than blanket-disabling AITER. | Case | Behavior on `amd-integration` | This PR | |---|---|---| | Sliding-window attention (`window_left >= 0`) | AITER selected. Wrapper passes `sliding_window = window_left` (off-by-one), and `window_left = 0` collides with AITER's "disabled" sentinel — silently wrong output. | **AITER runs** with corrected convention mapping (`window_left + 1`). | | `use_cuda_graph=True` with explicit `backend="aiter"` | AITER selected. Per-plan scalars (`max_kv_len`, `max_blocks_per_seq`) are baked into the captured graph; replay against a larger batch launches with an under-sized grid. | Clear `ValueError` at `plan()` time. (auto-select already routes to `fa2`.) | | `run(return_lse=True)` | Raises `NotImplementedError("AITER decode backend does not currently return LSE")`. | Transparent dispatch through a pre-built FA2 shadow plan; one-time warning at `plan()` time so the per-call backend switch is not silent. | ## Why each fix ### 1. Sliding window — wrapper convention bug, not a kernel gap AITER PA v1's kernel *does* implement window masking (`csrc/cpp_itfs/pa/pa_kernels.cuh:457`): ```cpp if (local_token_idx + i < context_len - sliding_window) tmp = -FLT_MAX; ``` gated by the `sliding_window_enabled` template flag (set by the compile step at `csrc/cpp_itfs/pa/pa_v1.py:144`). The wrapper already plumbs `sliding_window` through `_aiter_pa_v1_resolve` and the run-time call. The bug on trunk is a **convention difference**: - FlashInfer: `window_left = W` → query at position `kv_len-1` sees positions `[kv_len-1-W, kv_len-1]` = `W+1` tokens. - AITER: `sliding_window = S` (with `S > 0` enabling the mask) → admits `S` tokens. - `S = 0` is AITER's compile-time "disabled" sentinel. Trunk passes `sliding_window = window_left` — off by one, plus `window_left = 0` (one visible token) collides with AITER's disabled sentinel and silently returns full attention. Fixed: `sliding_window = window_left + 1` when `window_left >= 0`, else 0. This keeps AITER on the hot path for sliding-window models (Gemma, Mistral, etc.) instead of giving up perf to FA2. ### 2. CUDA graph — wrapper-level limitation, hard-rejected in explicit path AITER's launch grid is computed at `plan()` time from `max_kv_len` and `max_blocks_per_seq` of the *current* batch and passed by value to the kernel launch. Under CUDA-graph capture these scalars are baked into the captured graph and can't be widened on replay against a larger batch. Supporting this properly would require capturing with worst-case dimensions — a new API parameter (e.g. `max_seq_len_per_request`) — which is out of scope for this PR. The auto-select fallback to FA2 stays in place; the explicit `backend="aiter"` path (which on trunk silently produces broken launches) now raises: ``` ValueError: AITER decode backend is incompatible with CUDA-graph capture: the kernel's launch grid is sized from per-plan scalars (max_kv_len, max_blocks_per_seq) that are baked into the captured graph at capture time. Use backend='fa2' for CUDA-graph workflows, or backend='auto' which routes around this automatically. ``` ### 3. return_lse — replace NotImplementedError with transparent fallback AITER PA v1 does not output LSE (only `out`; the kernel computes per-partition `exp_sums`/`max_logits` internally for split-K but does not expose normalized LSE). Trunk raises `NotImplementedError` at `run()` time, breaking any caller that needs LSE under an AITER plan. This PR pre-builds an FA2 decode plan at AITER `plan()` time and dispatches through it whenever `return_lse=True` arrives at `run()`. This is the only correct option since `return_lse` is per-call, not per-plan. Two details worth flagging: - The shadow plan uses the real `window_left` (and the corresponding template flag), so it produces correct LSE under sliding-window AITER plans (now supported per fix #1). - A one-time-per-device warning is emitted at AITER plan() time. Without it, a user toggling `return_lse=True` on a hot path would silently move from AITER → FA2 with no signal. ## Tests added `tests/rocm_tests/test_batch_decode_aiter_hip.py`: - `test_batch_decode_aiter_sliding_window_vs_fa2` — AITER↔FA2 parity over `window_left ∈ {0, 31, 127, 1023}`, fp16/bf16, batch sizes, GQA ratios, including the saturation regime (`window_left >= max_kv_len-1`) that exercises the kernel's no-op masking branch. - `test_batch_decode_aiter_return_lse_via_fa2` — verifies (a) `return_lse=False` still runs through AITER, (b) `return_lse=True` falls back to the shadow FA2 plan and returns `(output, lse)` matching the pure-FA2 reference, with and without sliding window. - `test_batch_decode_auto_routes_cuda_graph_to_fa2` — `backend="auto"` + `use_cuda_graph=True` resolves to `fa2`. - Extended `test_batch_decode_aiter_rejects_invalid_config` — explicit `backend="aiter"` + `use_cuda_graph=True` raises a `ValueError` mentioning "CUDA-graph". ## Test plan - [x] `pytest tests/rocm_tests/test_batch_decode_aiter_hip.py -v` — **178 passed** in 69 s. New parity + LSE-fallback + CUDA-graph rejection cases. - [x] `pytest tests/rocm_tests/test_sliding_window_hip.py -m "not slow"` — **1248 passed** in 60 s. Exercises the AITER path on every sliding-window decode shape that was previously silently wrong on trunk. - [x] `pytest tests/rocm_tests/test_batch_decode_kernels_hip.py -m "not slow" -n auto --reruns 2` — **1872 passed** in 174 s. Covers the broader decode matrix including `return_lse=True` (which on trunk raised `NotImplementedError` under AITER) and CUDA-graph wrappers (which now route to FA2 cleanly). ## API impact - `BatchDecodeWithPagedKVCacheWrapper` docstring updated to document the AITER-specific constraints (CUDA-graph incompatible; sliding-window supported transparently; `return_lse` falls back to FA2). - No public signature changes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>

demandal25 closed this Apr 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1

fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1
demandal25 wants to merge 1 commit into
amd-integrationfrom
fix/single-decode-sigfpe-cpx

demandal25 commented Apr 29, 2026

Uh oh!

demandal25 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

demandal25 commented Apr 29, 2026

Summary

Test plan

Uh oh!

demandal25 commented Apr 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant