feat: add NIAH eval to cache_rate_tester by ziqifan617 · Pull Request #9 · callanjfox/kv-cache-tester

ziqifan617 · 2026-05-15T00:17:29Z

Motivation

At high concurrency, KV cache corruption or block-reuse bugs can cause the model to emit wrong or garbage
output — a failure mode that is completely invisible in TTFT and throughput metrics alone. This PR makes it
detectable with zero external dependencies and no separate eval phase.

What this adds

An optional in-band output correctness eval for cache_rate_tester.py, based on the passkey retrieval task
introduced in Landmark Attention: Random-Access Infinite Context Length for Transformers (Mohtashami & Jaggi,
2023).

When --eval-mode niah is enabled, a configurable fraction of working-set prompts (default 10%) are replaced
with needle-in-a-haystack probes: a coherent English haystack of the same context length as the rest of the
test, with a random N-digit passkey embedded at a random position (10–90% depth), followed by a retrieval
question. Eval prompts are interleaved with the regular synthetic prompts throughout the timed run, so they
exercise the same KV cache behavior and the same concurrency pressure as every other request.

Each response is graded by substring match against the expected passkey. With greedy decoding and a healthy
cache, any modern model trivially retrieves the passkey — a sustained drop below 100% accuracy is a strong
signal that something is wrong with the inference path, not noise.

We use 7-digit passkeys by default, giving a false-positive rate of ~10⁻⁷ per response — effectively zero
across hundreds of eval probes per run.

New flags

Flag	Default	Description
`--eval-mode {none,niah}`	`none`	Enable NIAH eval. `none` preserves all existing behavior.
`--eval-fraction FLOAT`	`0.1`	Fraction of working-set prompts replaced with eval probes.
`--eval-passkey-digits INT`	`7`	Digits in the random passkey. Higher = lower false-positive rate.
`--eval-output-tokens INT`	`512`	Output token budget for eval prompts. Thinking models (e.g. Qwen3) consume 200–400 tokens in `<think>` reasoning before writing the answer, so eval prompts need a larger budget than regular requests without inflating the perf-measurement budget.

Output

detailed_results_*.csv — per-request columns on every eval probe:

eval_expected — the passkey that should appear in the response
eval_passed — True / False
eval_response_excerpt — first 300 chars of the response (for debugging failures)

sustained_periods_*.csv — per 30-second assessment window:

eval_total, eval_passed, eval_accuracy — shows when during the run accuracy degraded

summary_*.csv — single aggregate across all requests in the run:

eval_total, eval_passed, eval_accuracy

Eval grading only activates at cache_hit_rate=100. At mixed cache rates the tester appends random
gibberish to each prompt to drive the desired miss fraction, which would clobber the retrieval question; this
is enforced automatically.

Sample run

   Period 3: Running at concurrency 32 for 30.0s
      Prefills: 7, Contributing: 7, Launched: 9
      Input: 25,641 tok/s | Output: 1,510 tok/s (streaming-based)
      Avg TTFT: 12.676s | P95 TTFT: 16.847s | P99 TTFT: 17.203s
      Avg ITL: 22.18ms | avg_output_tokens: 374.1 tok/s
      Eval: 7/7 passed (100.0%)
      Concurrency 32 = max-concurrency → MAX_REACHED

    [... periods 4–18 omitted for brevity, all Eval: X/X passed (100.0%) ...]

    Period 19: Running at concurrency 32 for 30.0s
      Prefills: 2, Contributing: 2, Launched: 3
      Input: 21,463 tok/s | Output: 1,265 tok/s (streaming-based)
      Avg TTFT: 16.441s | P95 TTFT: 19.832s | P99 TTFT: 20.104s
      Avg ITL: 24.61ms | avg_output_tokens: 391.2 tok/s
      Eval: 2/2 passed (100.0%)
      Concurrency 32 = max-concurrency → MAX_REACHED

  ================================================================================
  ✓ All tests complete!
  ================================================================================
  Results saved to: /tmp/cache_rate_qwen3-32b-tp1-kvbm-v2-eval_c32_g1g2

  Total continuous tests completed: 1

  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════
  Final Summary - All Test Results
  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════
     Context   Cache%   Requests    Input Tok   Output Tok      Input/s     Output/s   Avg TTFT    Conc   EvalAcc
  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ─────────────────
      32,000     100%         437       12.71M        0.75M      23,309        1,372     13.410s      32    100.0%
  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ─────────────────
       TOTAL              437          12.71M        0.75M
  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════

Thinking models (e.g. Qwen3) consume their entire 200-token budget in <think> reasoning before writing the answer, causing eval prompts to always fail. Separate eval_output_tokens (default 512) lets eval prompts get a larger budget without inflating the output budget for regular perf-measurement requests. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Aggregate eval pass/fail across all requests (not just peak concurrency) so every NIAH probe in the run contributes to the summary number. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Sustained mode builds AggregatedMetrics from period DataFrames rather than calling calculate_aggregated_metrics, so eval fields were never populated. Sum eval_total and eval_passed across all periods. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Shows eval_accuracy in the terminal at end of run for both fixed and sustained mode. Prints in warning color when accuracy < 100%. Shows '-' when eval is not enabled (--eval-mode none). Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

run_command_*.sh was missing --eval-mode, --eval-fraction, --eval-passkey-digits, --eval-output-tokens since save_run_command was not updated when the eval flags were added. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Qwen3-32B at high concurrency (c512 TP=4) exhausted the 512-token budget mid-think, causing 2 consistent failures on the same prompt. 1024 gives thinking models sufficient headroom. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

feat: add NIAH eval to cache_rate_tester

4b4870b

ziqifan617 marked this pull request as draft May 15, 2026 00:17

ziqifan617 and others added 2 commits May 14, 2026 18:17

Add eval_accuracy to AggregatedMetrics and summary CSV

c7727c4

Aggregate eval pass/fail across all requests (not just peak concurrency) so every NIAH probe in the run contributes to the summary number. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

ziqifan617 marked this pull request as ready for review May 15, 2026 02:54

ziqifan617 and others added 4 commits May 15, 2026 08:56

Add eval flags to save_run_command output

e1f8d30

run_command_*.sh was missing --eval-mode, --eval-fraction, --eval-passkey-digits, --eval-output-tokens since save_run_command was not updated when the eval flags were added. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add NIAH eval to cache_rate_tester#9

feat: add NIAH eval to cache_rate_tester#9
ziqifan617 wants to merge 7 commits into
callanjfox:masterfrom
ziqifan617:ziqif/add-eval

ziqifan617 commented May 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ziqifan617 commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What this adds

New flags

Output

Sample run

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ziqifan617 commented May 15, 2026 •

edited

Loading