Skip to content

Fix: Resolve NaN Logit Memory Leak During Sequential Benchmarks (Issue #2095)#2108

Open
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix/bench-iteration-leak
Open

Fix: Resolve NaN Logit Memory Leak During Sequential Benchmarks (Issue #2095)#2108
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix/bench-iteration-leak

Conversation

@glaziermag
Copy link
Copy Markdown
Contributor

Fixes memory corruption and sequence state leakage during sequential mistralrs bench iterations (#2095).

Cause

When benchmarking long generations offline (e.g., --gen-len >= 512) over sequential iterations, the engine accumulates orphan sequences. Because mistralrs engine daemon recycles the exact same context array for sequential throughput loops on a per-benchmark thread, unevicted KV caches overlap bounds into the multi-nomial sampler over sequence accumulation, forcing the engine core to panic dynamically.

Changes

  • mistralrs-cli/src/commands/bench.rs: Explicitly drops the engine state variables by triggering Request::TerminateAllSeqsNextStep dynamically between consecutive benchmark iteration stages and immediately following the warmup sequence block.

Empirical Execution Baseline (GCP g2-standard-32 L4 Tensor Verification)

Before

When running raw pipelines against L4 instances utilizing native f16 or GGUF:

cargo run --release --features cuda,flash-attn --bin mistralrs -- bench -m mistralai/Mistral-7B-Instruct-v0.1 --from-uqff /home/gabe/mistral-7b-instruct-v0.1.Q4_K_M.gguf --paged-attn off --prompt-len 512 --gen-len 512 --iterations 5 --warmup 1
2026-04-14T03:58:22.059553Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2026-04-14T03:58:22.080989Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2026-04-14T03:58:22.178496Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
thread 'main' panicked at /home/gabe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.7.1/src/tensor.rs:328:18:
Tensors contains NaN.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

After

cargo run --release --features cuda,flash-attn --bin mistralrs -- bench -m mistralai/Mistral-7B-Instruct-v0.1 --from-uqff /home/gabe/mistral-7b-instruct-v0.1.Q4_K_M.gguf --paged-attn off --prompt-len 512 --gen-len 512 --iterations 5 --warmup 1

(Successfully completes exactly 5/5 generation loops without logging numerical instability or NaN panics across GPU threading models)

@glaziermag glaziermag force-pushed the fix/bench-iteration-leak branch from c77e476 to 816eeda Compare April 14, 2026 04:59
@glaziermag glaziermag force-pushed the fix/bench-iteration-leak branch from 816eeda to ea333da Compare April 14, 2026 05:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant