Skip to content

fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1

Closed
demandal25 wants to merge 1 commit into
amd-integrationfrom
fix/single-decode-sigfpe-cpx
Closed

fix(hip): clamp max_num_kv_chunks to avoid SIGFPE in single decode on CPX devices#1
demandal25 wants to merge 1 commit into
amd-integrationfrom
fix/single-decode-sigfpe-cpx

Conversation

@demandal25

Copy link
Copy Markdown
Owner

Summary

  • tests/attention/test_logits_cap.py::test_single_decode_logits_soft_cap SIGFPEs at (seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0) on MI308X CPX (20 CUs).
  • Root cause: in SingleDecodeWithKVCacheDispatched's partition-KV path, max_num_kv_chunks = max_grid_size / num_kv_heads underflows to 0 when num_kv_heads > max_grid_size (e.g. 20 CUs × 1 block/SM = 20 < 32 kv-heads). The next line then calls ceil_div(seq_len, 0) → SIGFPE in the host launch code.
  • Fix: clamp max_num_kv_chunks to >= 1. With the clamp, the kernel falls back to one CTA per kv-head (no further KV split) — the correct behavior when the device can't fit all kv-heads simultaneously.

The existing guard at line 700 only catches num_blocks_per_sm == 0; it does not cover this divisor-underflow case.

Test plan

  • Minimal repro (seq_len ∈ {256, 257, 320, 729, 33001}, head_dim=256, num_heads=32) returns valid output instead of crashing.
  • pytest tests/attention/test_logits_cap.py — all 450 cases pass (was crashing on first head_dim=256 / num_heads=32 / seq_len>256 case).
  • Run full ROCm test suite in CI to confirm no regressions in non-CPX paths.

🤖 Generated with Claude Code

In the partition-KV path of SingleDecodeWithKVCacheDispatched, max_num_kv_chunks
was computed as `max_grid_size / num_kv_heads` without a lower bound. When
num_kv_heads exceeds max_grid_size — e.g. MI308X CPX exposes 20 CUs while a
shape uses 32 kv-heads — the integer division underflows to 0 and the
subsequent `ceil_div(seq_len, 0)` raises SIGFPE in the kernel-launch host code.

The existing guard only catches `num_blocks_per_sm == 0`, not this divisor
underflow. Clamp the result to >=1 so the path falls back to one CTA per
kv-head (no further KV split), which is the correct behavior when the device
cannot fit all kv-heads simultaneously.

Reproduces with `tests/attention/test_logits_cap.py::test_single_decode_logits_soft_cap`
at (seq_len=257, num_heads=32, head_dim=256, soft_cap=1.0) on a 20-CU device.
After the fix, all 450 tests in test_logits_cap.py pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@demandal25

Copy link
Copy Markdown
Owner Author

Wrong base — re-opening against ROCm/flashinfer:amd-integration.

@demandal25 demandal25 closed this Apr 29, 2026
demandal25 added a commit that referenced this pull request May 20, 2026
…raph, return_lse (ROCm#234)

## Summary

The AITER PA v1 decode backend on `amd-integration` has three call
patterns that produce wrong output, hard crashes, or unhelpful
`NotImplementedError`s. This PR fixes each one at the level it's
actually broken at, rather than blanket-disabling AITER.

| Case | Behavior on `amd-integration` | This PR |
|---|---|---|
| Sliding-window attention (`window_left >= 0`) | AITER selected.
Wrapper passes `sliding_window = window_left` (off-by-one), and
`window_left = 0` collides with AITER's "disabled" sentinel — silently
wrong output. | **AITER runs** with corrected convention mapping
(`window_left + 1`). |
| `use_cuda_graph=True` with explicit `backend="aiter"` | AITER
selected. Per-plan scalars (`max_kv_len`, `max_blocks_per_seq`) are
baked into the captured graph; replay against a larger batch launches
with an under-sized grid. | Clear `ValueError` at `plan()` time.
(auto-select already routes to `fa2`.) |
| `run(return_lse=True)` | Raises `NotImplementedError("AITER decode
backend does not currently return LSE")`. | Transparent dispatch through
a pre-built FA2 shadow plan; one-time warning at `plan()` time so the
per-call backend switch is not silent. |

## Why each fix

### 1. Sliding window — wrapper convention bug, not a kernel gap

AITER PA v1's kernel *does* implement window masking
(`csrc/cpp_itfs/pa/pa_kernels.cuh:457`):

```cpp
if (local_token_idx + i < context_len - sliding_window)
    tmp = -FLT_MAX;
```

gated by the `sliding_window_enabled` template flag (set by the compile
step at `csrc/cpp_itfs/pa/pa_v1.py:144`). The wrapper already plumbs
`sliding_window` through `_aiter_pa_v1_resolve` and the run-time call.

The bug on trunk is a **convention difference**:

- FlashInfer: `window_left = W` → query at position `kv_len-1` sees
positions `[kv_len-1-W, kv_len-1]` = `W+1` tokens.
- AITER: `sliding_window = S` (with `S > 0` enabling the mask) → admits
`S` tokens.
- `S = 0` is AITER's compile-time "disabled" sentinel.

Trunk passes `sliding_window = window_left` — off by one, plus
`window_left = 0` (one visible token) collides with AITER's disabled
sentinel and silently returns full attention. Fixed: `sliding_window =
window_left + 1` when `window_left >= 0`, else 0.

This keeps AITER on the hot path for sliding-window models (Gemma,
Mistral, etc.) instead of giving up perf to FA2.

### 2. CUDA graph — wrapper-level limitation, hard-rejected in explicit
path

AITER's launch grid is computed at `plan()` time from `max_kv_len` and
`max_blocks_per_seq` of the *current* batch and passed by value to the
kernel launch. Under CUDA-graph capture these scalars are baked into the
captured graph and can't be widened on replay against a larger batch.

Supporting this properly would require capturing with worst-case
dimensions — a new API parameter (e.g. `max_seq_len_per_request`) —
which is out of scope for this PR. The auto-select fallback to FA2 stays
in place; the explicit `backend="aiter"` path (which on trunk silently
produces broken launches) now raises:

```
ValueError: AITER decode backend is incompatible with CUDA-graph capture:
the kernel's launch grid is sized from per-plan scalars (max_kv_len,
max_blocks_per_seq) that are baked into the captured graph at capture time.
Use backend='fa2' for CUDA-graph workflows, or backend='auto' which routes
around this automatically.
```

### 3. return_lse — replace NotImplementedError with transparent
fallback

AITER PA v1 does not output LSE (only `out`; the kernel computes
per-partition `exp_sums`/`max_logits` internally for split-K but does
not expose normalized LSE). Trunk raises `NotImplementedError` at
`run()` time, breaking any caller that needs LSE under an AITER plan.

This PR pre-builds an FA2 decode plan at AITER `plan()` time and
dispatches through it whenever `return_lse=True` arrives at `run()`.
This is the only correct option since `return_lse` is per-call, not
per-plan. Two details worth flagging:

- The shadow plan uses the real `window_left` (and the corresponding
template flag), so it produces correct LSE under sliding-window AITER
plans (now supported per fix #1).
- A one-time-per-device warning is emitted at AITER plan() time. Without
it, a user toggling `return_lse=True` on a hot path would silently move
from AITER → FA2 with no signal.

## Tests added

`tests/rocm_tests/test_batch_decode_aiter_hip.py`:

- `test_batch_decode_aiter_sliding_window_vs_fa2` — AITER↔FA2 parity
over `window_left ∈ {0, 31, 127, 1023}`, fp16/bf16, batch sizes, GQA
ratios, including the saturation regime (`window_left >= max_kv_len-1`)
that exercises the kernel's no-op masking branch.
- `test_batch_decode_aiter_return_lse_via_fa2` — verifies (a)
`return_lse=False` still runs through AITER, (b) `return_lse=True` falls
back to the shadow FA2 plan and returns `(output, lse)` matching the
pure-FA2 reference, with and without sliding window.
- `test_batch_decode_auto_routes_cuda_graph_to_fa2` — `backend="auto"` +
`use_cuda_graph=True` resolves to `fa2`.
- Extended `test_batch_decode_aiter_rejects_invalid_config` — explicit
`backend="aiter"` + `use_cuda_graph=True` raises a `ValueError`
mentioning "CUDA-graph".

## Test plan

- [x] `pytest tests/rocm_tests/test_batch_decode_aiter_hip.py -v` —
**178 passed** in 69 s. New parity + LSE-fallback + CUDA-graph rejection
cases.
- [x] `pytest tests/rocm_tests/test_sliding_window_hip.py -m "not slow"`
— **1248 passed** in 60 s. Exercises the AITER path on every
sliding-window decode shape that was previously silently wrong on trunk.
- [x] `pytest tests/rocm_tests/test_batch_decode_kernels_hip.py -m "not
slow" -n auto --reruns 2` — **1872 passed** in 174 s. Covers the broader
decode matrix including `return_lse=True` (which on trunk raised
`NotImplementedError` under AITER) and CUDA-graph wrappers (which now
route to FA2 cleanly).

## API impact

- `BatchDecodeWithPagedKVCacheWrapper` docstring updated to document the
AITER-specific constraints (CUDA-graph incompatible; sliding-window
supported transparently; `return_lse` falls back to FA2).
- No public signature changes.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant