| Project Page | Paper |
Tangram is a serving system that makes non-uniform KV cache compression practical for multi-turn LLM serving. It is built on top of vLLM.
Highlights
- Up to ~5× memory savings — head group page reclaims fragmented KV cache memory
- Up to 2.3× throughput — static budget allocation removes scheduling overhead
- Minimal accuracy loss — non-uniform KV compression preserves the heads that matter
Core techniques
- Deterministic Budget Allocation — static per-head memory footprint, no runtime scheduling overhead
- Head Group Page — clusters heads by retention demand with independent, vectorized page tables
- Ahead-of-Time (AOT) Load Balancing — offline workload partitioning for uniform SM utilization
Tangram is built on top of vLLM, a fast and easy-to-use library for LLM inference and serving. See the vLLM documentation for the underlying engine and supported models.
Install from source:
pip install -e .Head Group paging is on by default; add --enable-compression for non-uniform compression.
vllm serve /path/to/Qwen2.5-7B-Instruct-1M \
--enable-compression \
--compression-ratio 0.3By default --max-model-len follows the model's max_position_embeddings
and the server listens on port 8000; override with the standard vLLM
flags if you need a different value.
--enable-compression/--compression-ratio R— non-uniform KV cache compression with retention fractionR.--page-group-size N— head group page (default4).
enforce_eager, max_num_batched_tokens, and enable_prefix_caching are
auto-set when these features are on. Compression sizing
(--compression-chunk-size, --compression-window-size,
--compression-n-sink-tokens) can be tuned if needed.
- Qwen3-4B
- Qwen2.5-7B (Remove DCA Config on config.json)
- Qwen3-32B
- Llama-3.1-8B-Instruct
- FlashAttention
Test a completion:
curl -s http://127.0.0.1:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{"model": "/path/to/Qwen2.5-7B-Instruct-1M",
"prompt": "Tangram is",
"max_tokens": 32,
"temperature": 0}'