feat: add NIAH eval to cache_rate_tester#9
Open
ziqifan617 wants to merge 7 commits into
Open
Conversation
Thinking models (e.g. Qwen3) consume their entire 200-token budget in <think> reasoning before writing the answer, causing eval prompts to always fail. Separate eval_output_tokens (default 512) lets eval prompts get a larger budget without inflating the output budget for regular perf-measurement requests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Aggregate eval pass/fail across all requests (not just peak concurrency) so every NIAH probe in the run contributes to the summary number. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sustained mode builds AggregatedMetrics from period DataFrames rather than calling calculate_aggregated_metrics, so eval fields were never populated. Sum eval_total and eval_passed across all periods. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Shows eval_accuracy in the terminal at end of run for both fixed and sustained mode. Prints in warning color when accuracy < 100%. Shows '-' when eval is not enabled (--eval-mode none). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
run_command_*.sh was missing --eval-mode, --eval-fraction, --eval-passkey-digits, --eval-output-tokens since save_run_command was not updated when the eval flags were added. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Qwen3-32B at high concurrency (c512 TP=4) exhausted the 512-token budget mid-think, causing 2 consistent failures on the same prompt. 1024 gives thinking models sufficient headroom. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
At high concurrency, KV cache corruption or block-reuse bugs can cause the model to emit wrong or garbage
output — a failure mode that is completely invisible in TTFT and throughput metrics alone. This PR makes it
detectable with zero external dependencies and no separate eval phase.
What this adds
An optional in-band output correctness eval for
cache_rate_tester.py, based on the passkey retrieval taskintroduced in Landmark Attention: Random-Access Infinite Context Length for Transformers (Mohtashami & Jaggi,
2023).
When
--eval-mode niahis enabled, a configurable fraction of working-set prompts (default 10%) are replacedwith needle-in-a-haystack probes: a coherent English haystack of the same context length as the rest of the
test, with a random N-digit passkey embedded at a random position (10–90% depth), followed by a retrieval
question. Eval prompts are interleaved with the regular synthetic prompts throughout the timed run, so they
exercise the same KV cache behavior and the same concurrency pressure as every other request.
Each response is graded by substring match against the expected passkey. With greedy decoding and a healthy
cache, any modern model trivially retrieves the passkey — a sustained drop below 100% accuracy is a strong
signal that something is wrong with the inference path, not noise.
We use 7-digit passkeys by default, giving a false-positive rate of ~10⁻⁷ per response — effectively zero
across hundreds of eval probes per run.
New flags
--eval-mode {none,niah}nonenonepreserves all existing behavior.--eval-fraction FLOAT0.1--eval-passkey-digits INT7--eval-output-tokens INT512<think>reasoning before writing the answer, so eval prompts need a larger budget than regular requests without inflating the perf-measurement budget.Output
detailed_results_*.csv— per-request columns on every eval probe:eval_expected— the passkey that should appear in the responseeval_passed—True/Falseeval_response_excerpt— first 300 chars of the response (for debugging failures)sustained_periods_*.csv— per 30-second assessment window:eval_total,eval_passed,eval_accuracy— shows when during the run accuracy degradedsummary_*.csv— single aggregate across all requests in the run:eval_total,eval_passed,eval_accuracySample run