Skip to content

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076

Closed
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-2058-gemma-clean
Closed

Fix Gemma softcap F16 overflow NaN and scheduler hang (#2058)#2076
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix-2058-gemma-clean

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

@glaziermag glaziermag commented Apr 8, 2026

Fixes #2058.

Two independent fixes

1. Softcap NaN in F16/BF16 (naive attention backend)

Gemma models use an attention softcap (logit / softcap → tanh → logit * softcap). The naive SDPA backend was performing the tanh in the input dtype (F16 or BF16). For F16, values outside approximately ±65504 become ±Inf before tanh, and tanh(Inf) = NaN when computed in reduced precision, producing silent NaN logits.

Fix: Promote attention scores to F32 before computing the softcap tanh, then cast back to the original dtype.

Scope note: This fix applies to the CPU/naive fallback path in attention/backends/naive.rs. The primary CUDA (FlashAttention) and Metal SDPA backends have their own softcap handling and are not changed here.

2. SequenceState::Error missing from is_finished_paged_attn

When a sequence enters SequenceState::Error, it was not recognized as "finished" by the paged attention scheduler. This caused KV blocks to remain allocated and the scheduler to stall waiting for capacity that would never be returned.

Fix: Add SequenceState::Error to the match arms in is_finished_paged_attn().

Files changed

  • mistralrs-core/src/attention/backends/naive.rs
  • mistralrs-core/src/sequence.rs

@glaziermag glaziermag marked this pull request as ready for review April 8, 2026 02:41
@glaziermag
Copy link
Copy Markdown
Contributor Author

Update (2026-04-15): Rebased onto origin/master (upstream upstream HEAD). The branch was previously based on the fork's master, which was behind by ~40 commits. No code changes — both fixes (softcap F32 promotion in naive.rs and SequenceState::Error in is_finished_paged_attn) are unchanged.

@glaziermag
Copy link
Copy Markdown
Contributor Author

Closing in favor of atomic split PRs for single-responsibility review:

The two fixes are independent and should be reviewable/mergeable separately.

@glaziermag glaziermag closed this Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gemma 4 E2B: inference hangs on complex prompts via MultimodalModelBuilder (GPU confirmed)

1 participant