Fix: Resolve NaN Logit Memory Leak During Sequential Benchmarks (Issue #2095) by glaziermag · Pull Request #2108 · EricLBuehler/mistral.rs

glaziermag · 2026-04-14T04:51:05Z

Fixes memory corruption and sequence state leakage during sequential mistralrs bench iterations (#2095).

Cause

When benchmarking long generations offline (e.g., --gen-len >= 512) over sequential iterations, the engine accumulates orphan sequences. Because mistralrs engine daemon recycles the exact same context array for sequential throughput loops on a per-benchmark thread, unevicted KV caches overlap bounds into the multi-nomial sampler over sequence accumulation, forcing the engine core to panic dynamically.

Changes

mistralrs-cli/src/commands/bench.rs: Explicitly drops the engine state variables by triggering Request::TerminateAllSeqsNextStep dynamically between consecutive benchmark iteration stages and immediately following the warmup sequence block.

Empirical Execution Baseline (GCP `g2-standard-32` L4 Tensor Verification)

Before

When running raw pipelines against L4 instances utilizing native f16 or GGUF:

cargo run --release --features cuda,flash-attn --bin mistralrs -- bench -m mistralai/Mistral-7B-Instruct-v0.1 --from-uqff /home/gabe/mistral-7b-instruct-v0.1.Q4_K_M.gguf --paged-attn off --prompt-len 512 --gen-len 512 --iterations 5 --warmup 1

2026-04-14T03:58:22.059553Z  INFO mistralrs_core::pipeline::normal: Prompt chunk size is 1024.
2026-04-14T03:58:22.080989Z  INFO mistralrs_core::utils::normal: Detected minimum CUDA compute capability 8.9
2026-04-14T03:58:22.178496Z  INFO mistralrs_core::utils::normal: DType selected is BF16.
thread 'main' panicked at /home/gabe/.cargo/registry/src/index.crates.io-6f17d22bba15001f/candle-core-0.7.1/src/tensor.rs:328:18:
Tensors contains NaN.
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

After

cargo run --release --features cuda,flash-attn --bin mistralrs -- bench -m mistralai/Mistral-7B-Instruct-v0.1 --from-uqff /home/gabe/mistral-7b-instruct-v0.1.Q4_K_M.gguf --paged-attn off --prompt-len 512 --gen-len 512 --iterations 5 --warmup 1

(Successfully completes exactly 5/5 generation loops without logging numerical instability or NaN panics across GPU threading models)

glaziermag force-pushed the fix/bench-iteration-leak branch from c77e476 to 816eeda Compare April 14, 2026 04:59

fix(bench): gracefully flush sequences between benchmark iterations

ea333da

glaziermag force-pushed the fix/bench-iteration-leak branch from 816eeda to ea333da Compare April 14, 2026 05:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: Resolve NaN Logit Memory Leak During Sequential Benchmarks (Issue #2095)#2108

Fix: Resolve NaN Logit Memory Leak During Sequential Benchmarks (Issue #2095)#2108
glaziermag wants to merge 1 commit intoEricLBuehler:masterfrom
glaziermag:fix/bench-iteration-leak

glaziermag commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glaziermag commented Apr 14, 2026

Cause

Changes

Empirical Execution Baseline (GCP g2-standard-32 L4 Tensor Verification)

Before

After

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Empirical Execution Baseline (GCP `g2-standard-32` L4 Tensor Verification)