TurboQuantKV Cache Evaluation

A simple benchmarking framework comparing Baseline and TurboQuant-simulated KVcache strategies on CPU-only vLLM inference using TinyLlama-1.1B.

What this does: Runs two vLLM configurations, measures latency (E2E p50/p95/p99), throughput (tok/s), RAM usage, and output quality (Token-F1).

Overview

TurboQuant is a GPU-native KV cache quantization technique in vLLM that compresses Key-Value tensors to INT8/FP8, reducing memory by 4–8x. This repo simulates its effects using CPU-available knobs and documents all assumptions made.

Configuration	dtype	max_model_len	Memory budget	What it simulates
Baseline	float32	2048	90%	Standard vLLM, no optimization
TurboQuant	float16	1024	50%	FP16 ≈ 2x KV memory reduction; truncated context ≈ eviction pressure

Key result on CPU: TurboQuant uses ~6% less RAM, but has higher latency compared to the baseline. This is expected — FP16 has no hardware acceleration on x86 CPUs.

Repository Structure

vLLMTurboQuantKVCacheCPU
│
├── setup.sh              # Environment setup 
├── Dockerfile        
│
├── results/
│   ├── results.json        # results json saved
│   ├── figures/              # Auto-generated PNG plots│
│
├── scripts/
│   ├── main.py              # Main script 
│   ├── config.yaml          # all parameters here
│   ├── prompts.json         # all benchmark prompts - one can edit it here        
│   ├── generate_plots.py     #uses the results files to generate nice plots 
│   └── compute_metrics.py        # detailed metrics
└── .gitignore

Prerequisites

Requirement	Minimum	Recommended
OS	WSL2 Ubuntu 22.04	WSL2 Ubuntu 22.04
RAM	8 GB	16 GB
CPU	Any x86_64 with AVX2	4+ cores
Disk	5 GB free	10 GB free
Python	3.12	3.12
conda	Any	Miniconda
GPU	Not required	Not required

Check AVX2 support (required by the vLLM CPU wheel):
grep -m1 avx2 /proc/cpuinfo && echo "AVX2 supported" || echo "AVX2 NOT found"

Installation

Pick one of two paths:

Option A — Local (WSL2 / Linux)

Requirements: WSL2 Ubuntu 22.04, 8 GB+ RAM, x86_64 CPU with AVX2 (any CPU after ~2013), Python 3.12, conda.

1. If using WSL2, increase the memory limit first (open PowerShell, not WSL):

notepad "$env:USERPROFILE\.wslconfig"

Paste this (adjust memory to ~75% of your total RAM):

[wsl2]
memory=12GB
swap=8GB
processors=4

Then restart WSL2:

wsl --shutdown

2. Run the setup script:

chmod +x setup.sh && ./setup.sh

This installs conda, creates the environment, installs the vLLM CPU wheel, and downloads the model (~2.2 GB). Takes 5–10 minutes.

3. Run the benchmark:

conda activate vllm-cpu
VLLM_CPU_KVCACHE_SPACE=4 python3 main.py

Results saved to results/results.json.

Option B — Docker (easiest, no setup required)

docker build -t turboquant-bench .
docker run --rm -v $[addpwd]/results:/app/results turboquant-bench

Results land in ./results/results.json. Done.

To pass CLI flags:

docker run --rm -v $(pwd)/results:/app/results turboquant-bench
python3 main.py --prompt-types short --output results/results_001.json

Common Commands

# Preview what will run without loading the model
cd scripts
python3 main.py --dry-run

# Run only short prompts (~5 min)
VLLM_CPU_KVCACHE_SPACE=4 python3 main.py --prompt-types short

# Run only one config
VLLM_CPU_KVCACHE_SPACE=4 python3 main.py --only baseline
VLLM_CPU_KVCACHE_SPACE=4 python3 main.py --only turboquant

# Custom output file
VLLM_CPU_KVCACHE_SPACE=4 python3 main.py --output results/run_01.json

# Generate plots (after results.json exists)
python3 generate_plots.py

Configuration

All parameters are in config.yaml — no code changes needed. Key settings:

Setting	Default	What it does
`model`	TinyLlama-1.1B-Chat-v1.0	Model to benchmark
`max_tokens`	200	Max tokens to generate per prompt
`temperature`	0.0	0 = deterministic/reproducible
`warmup_requests`	1	Warmup passes before measuring

The two configs being compared:

	Baseline	TurboQuant
dtype	float32	float16
max_model_len	2048	1024
Memory budget	90%	50%

To add or edit prompts, edit prompts.json. Each entry:

{
  "my_prompt": {
    "type": "short",
    "note": "Description",
    "text": "The actual prompt sent to the model"
  }
}

Valid types: short, long, multiturn.

Understanding Results

The benchmark prints a comparison table and one of three verdicts:

ADOPT — quality preserved, meaningful RAM/speed gains
CONDITIONAL — acceptable quality, validate on your own prompts
REJECT — quality degraded too much

Results are also saved as JSON with per-prompt breakdowns.

Note on CPU results: TurboQuant will appear ~20% slower on CPU — this is expected. FP16 has no hardware acceleration on x86. On a GPU with INT8 Tensor Cores, the results invert and TurboQuant shows 15–40% throughput improvement.

Troubleshooting

Process killed silently during model load → WSL2 ran out of memory. Increase memory= in .wslconfig and run wsl --shutdown.

No module named 'vllm._C' → vLLM not installed correctly. Run:

pip uninstall vllm vllm-cpu -y
pip install vllm-cpu --extra-index-url https://download.pytorch.org/whl/cpu

AVX2 not supported → Use the Docker path (Option A above).

Plots don't display in WSL2 → Expected. Find the PNG files in Windows Explorer at \\wsl$\Ubuntu\home\YOUR_USER\...\results\figures\.

conda: command not found → Run source ~/.bashrc or restart your terminal.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TurboQuantKV Cache Evaluation

Overview

Repository Structure

Prerequisites

Installation

Option A — Local (WSL2 / Linux)

Option B — Docker (easiest, no setup required)

Common Commands

Configuration

Understanding Results

Troubleshooting

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
benchmark		benchmark
results		results
scripts		scripts
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
setup.sh		setup.sh

Folders and files

Latest commit

History

Repository files navigation

TurboQuantKV Cache Evaluation

Overview

Repository Structure

Prerequisites

Installation

Option A — Local (WSL2 / Linux)

Option B — Docker (easiest, no setup required)

Common Commands

Configuration

Understanding Results

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages