Skip to content

feat: add NIAH eval to cache_rate_tester#9

Open
ziqifan617 wants to merge 7 commits into
callanjfox:masterfrom
ziqifan617:ziqif/add-eval
Open

feat: add NIAH eval to cache_rate_tester#9
ziqifan617 wants to merge 7 commits into
callanjfox:masterfrom
ziqifan617:ziqif/add-eval

Conversation

@ziqifan617
Copy link
Copy Markdown

@ziqifan617 ziqifan617 commented May 15, 2026

Motivation

At high concurrency, KV cache corruption or block-reuse bugs can cause the model to emit wrong or garbage
output — a failure mode that is completely invisible in TTFT and throughput metrics alone. This PR makes it
detectable with zero external dependencies and no separate eval phase.

What this adds

An optional in-band output correctness eval for cache_rate_tester.py, based on the passkey retrieval task
introduced in Landmark Attention: Random-Access Infinite Context Length for Transformers (Mohtashami & Jaggi,
2023)
.

When --eval-mode niah is enabled, a configurable fraction of working-set prompts (default 10%) are replaced
with needle-in-a-haystack probes: a coherent English haystack of the same context length as the rest of the
test, with a random N-digit passkey embedded at a random position (10–90% depth), followed by a retrieval
question. Eval prompts are interleaved with the regular synthetic prompts throughout the timed run, so they
exercise the same KV cache behavior and the same concurrency pressure as every other request.

Each response is graded by substring match against the expected passkey. With greedy decoding and a healthy
cache, any modern model trivially retrieves the passkey — a sustained drop below 100% accuracy is a strong
signal that something is wrong with the inference path, not noise.

We use 7-digit passkeys by default, giving a false-positive rate of ~10⁻⁷ per response — effectively zero
across hundreds of eval probes per run.

New flags

Flag Default Description
--eval-mode {none,niah} none Enable NIAH eval. none preserves all existing behavior.
--eval-fraction FLOAT 0.1 Fraction of working-set prompts replaced with eval probes.
--eval-passkey-digits INT 7 Digits in the random passkey. Higher = lower false-positive rate.
--eval-output-tokens INT 512 Output token budget for eval prompts. Thinking models (e.g. Qwen3) consume 200–400 tokens in <think> reasoning before writing the answer, so eval prompts need a larger budget than regular requests without inflating the perf-measurement budget.

Output

detailed_results_*.csv — per-request columns on every eval probe:

  • eval_expected — the passkey that should appear in the response
  • eval_passedTrue / False
  • eval_response_excerpt — first 300 chars of the response (for debugging failures)

sustained_periods_*.csv — per 30-second assessment window:

  • eval_total, eval_passed, eval_accuracy — shows when during the run accuracy degraded

summary_*.csv — single aggregate across all requests in the run:

  • eval_total, eval_passed, eval_accuracy

Eval grading only activates at cache_hit_rate=100. At mixed cache rates the tester appends random
gibberish to each prompt to drive the desired miss fraction, which would clobber the retrieval question; this
is enforced automatically.

Sample run

   Period 3: Running at concurrency 32 for 30.0s
      Prefills: 7, Contributing: 7, Launched: 9
      Input: 25,641 tok/s | Output: 1,510 tok/s (streaming-based)
      Avg TTFT: 12.676s | P95 TTFT: 16.847s | P99 TTFT: 17.203s
      Avg ITL: 22.18ms | avg_output_tokens: 374.1 tok/s
      Eval: 7/7 passed (100.0%)
      Concurrency 32 = max-concurrency → MAX_REACHED

    [... periods 4–18 omitted for brevity, all Eval: X/X passed (100.0%) ...]

    Period 19: Running at concurrency 32 for 30.0s
      Prefills: 2, Contributing: 2, Launched: 3
      Input: 21,463 tok/s | Output: 1,265 tok/s (streaming-based)
      Avg TTFT: 16.441s | P95 TTFT: 19.832s | P99 TTFT: 20.104s
      Avg ITL: 24.61ms | avg_output_tokens: 391.2 tok/s
      Eval: 2/2 passed (100.0%)
      Concurrency 32 = max-concurrency → MAX_REACHED

  ================================================================================
  ✓ All tests complete!
  ================================================================================
  Results saved to: /tmp/cache_rate_qwen3-32b-tp1-kvbm-v2-eval_c32_g1g2

  Total continuous tests completed: 1

  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════
  Final Summary - All Test Results
  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════
     Context   Cache%   Requests    Input Tok   Output Tok      Input/s     Output/s   Avg TTFT    Conc   EvalAcc
  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ─────────────────
      32,000     100%         437       12.71M        0.75M      23,309        1,372     13.410s      32    100.0%
  ───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
  ─────────────────
       TOTAL              437          12.71M        0.75M
  ═══════════════════════════════════════════════════════════════════════════════════════════════════════════════════════
  ═════════════════

@ziqifan617 ziqifan617 marked this pull request as draft May 15, 2026 00:17
ziqifan617 and others added 2 commits May 14, 2026 18:17
Thinking models (e.g. Qwen3) consume their entire 200-token budget in
<think> reasoning before writing the answer, causing eval prompts to
always fail. Separate eval_output_tokens (default 512) lets eval prompts
get a larger budget without inflating the output budget for regular
perf-measurement requests.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Aggregate eval pass/fail across all requests (not just peak concurrency)
so every NIAH probe in the run contributes to the summary number.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@ziqifan617 ziqifan617 marked this pull request as ready for review May 15, 2026 02:54
ziqifan617 and others added 4 commits May 15, 2026 08:56
Sustained mode builds AggregatedMetrics from period DataFrames rather
than calling calculate_aggregated_metrics, so eval fields were never
populated. Sum eval_total and eval_passed across all periods.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Shows eval_accuracy in the terminal at end of run for both fixed and
sustained mode. Prints in warning color when accuracy < 100%. Shows '-'
when eval is not enabled (--eval-mode none).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
run_command_*.sh was missing --eval-mode, --eval-fraction,
--eval-passkey-digits, --eval-output-tokens since save_run_command
was not updated when the eval flags were added.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Qwen3-32B at high concurrency (c512 TP=4) exhausted the 512-token
budget mid-think, causing 2 consistent failures on the same prompt.
1024 gives thinking models sufficient headroom.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant