Skip to content

callanjfox/kv-cache-tester

Repository files navigation

Inference Server Cache Performance Testing Suite

A toolkit for testing and benchmarking LLM inference servers and their KV Cache offload performance.

Quick Start

# Install dependencies (using uv - recommended)
uv pip install -r requirements.txt

# Or with pip
pip install -r requirements.txt

# Run a simple single-prompt test
uv run python single_prompt_tester.py \
    --api-endpoint http://localhost:8000 \
    --min-tokens 1000 \
    --max-tokens 128000 \
    --output-dir results

# Test various cache hit rates (sustained mode - default)
uv run python cache_rate_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 8000 32000 64000 \
    --working-set-size 2000000 \
    --max-ttft 2.0 \
    --output-dir output

# Test performance across different memory tiers
uv run python working_set_tester.py \
    --api-endpoint http://localhost:8000 \
    --context-sizes 30000 \
    --min-working-set-size 100000 \
    --max-working-set-size 5000000 \
    --working-set-increments 10

# Replay real agentic coding traces
uv run python trace_replay_tester.py \
    --api-endpoint http://localhost:8000 \
    --trace-directory traces \
    --output-dir output \
    --start-users 5 \
    --max-users 50 \
    --max-ttft 2.0 \
    --test-duration 300

Core Testing Tools

Tool Purpose Documentation
single_prompt_tester.py Baseline cache performance with cold start vs cached comparisons Docs
cache_rate_tester.py Performance at various cache hit rates (0-100%) with fixed working set Docs
working_set_tester.py Performance across varying working set sizes for memory tier analysis Docs
trace_replay_tester.py Replay real agentic coding traces with realistic cache patterns and timing (beta) Docs

Included Traces

The traces/ directory contains 739 traces with 59,204 requests, including 19 traces with nested sub-agent conversations (70 sub-agents total). Generated from real Claude Code sessions captured via claude-code-proxy as part of research at WEKA on the Augmented Memory Grid product. See agentic-coding-analysis for the trace generation tools.

Traces include:

  • Local hash_ids (hash_id_scope: "local") — hash IDs are scoped per conversation, with sub-agents sharing the parent's namespace
  • Timing breakdownapi_time (server processing) and think_time (client delay) per request, enabling flexible replay timing strategies
  • Sub-agent nesting — sub-agents embedded in parent traces with their own tool/system tokens and request arrays
  • One request per turn — no streaming/non-streaming pairing (proxy bug fixed upstream)

All traces are fully anonymized — no conversation IDs, timestamps, or real agent IDs.

Metric Min P25 Median P75 Max Mean
Requests per trace 3 19 48 101 1,178 80
Starting input tokens 6,343 18,726 24,775 105,061 851,858 92,473
Ending input tokens 11,053 94,686 137,529 208,960 894,371 181,659
Input tokens per request (P10/P50/P90) 43,475 109,903 300,118 139,924
Output tokens per request (P10/P50/P90) 103 218 937 446
Conversation duration (min) 0.1 23 62 141 4,952 240
Think time per request (s) 0 6 10 30 78,638 199
API time per request (s) 1.2 4.4 6.5 11.0 300.0 10.2

Cache hit rates

Cache rates are computed from hash_id overlap between consecutive requests within each conversation.

Scenario Min P25 Median P75 Max Mean
Conversation-only (no warm prefix) 61% 92% 96% 97% 100% 93%
With 75% warm prefix 67% 93% 96% 97% 100% 94%

System and tools prefix impact: Each trace includes tool_tokens (median ~10,500) and system_tokens (median ~2,600) — a combined prefix of ~15,000 tokens (median), representing ~60% of the median first-request input. In a production deployment, this prefix is shared across all conversations and stays warm in the KV cache, so the first request of any conversation gets a significant cache hit that isn't captured by the conversation-only rate above. The "75% warm prefix" row models 75% of the tools+system prefix being cached on the first request — this primarily benefits short conversations where the first-request miss dominates. For longer conversations, the prefix is a small fraction of total tokens and the impact is minimal (~1pp).

Utility Scripts

Tool Purpose
generate_index.py Generate HTML dashboard from test results
regenerate_graphs.py Regenerate graphs from existing CSV data
combine_graphs.py Combine results from multiple runs for comparison
kv_cache_size.py Calculate KV cache memory requirements

See Utilities Documentation for details.

Installation

We recommend using uv for fast, reliable Python package management.

# Clone the repository
git clone https://github.com/callanjfox/kv-cache-tester.git
cd kv-cache-tester

# Install dependencies with uv (recommended)
uv pip install -r requirements.txt

# Or with pip
pip install -r requirements.txt

# Verify installation
uv run python single_prompt_tester.py --help

Understanding the Metrics

  • TTFT (Time To First Token): Latency before first token generation - primary metric for cache effectiveness
  • TTLT (Time To Last Token): Total request completion time
  • ITL (Inter-Token Latency): Time between generated tokens
  • Input/Output Throughput: Tokens processed/generated per second

Testing Methodology

For a detailed guide on testing methodology and how to use these tools effectively, see: Evaluating Management of KV Cache within an Inference System

Note: This guide will be updated soon with the latest methodology covering trace replay testing and multi-endpoint support.

Real-World Trace Replay

Beta: trace_replay_tester.py is currently in beta with active development underway. Updates to come.

The trace_replay_tester.py tool replays real agentic coding traces to benchmark inference performance with realistic cache hit patterns, timing, and message structures. See the Trace Replay Documentation for full details.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages