Supported Attention Backend

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Tangram is a serving system that makes non-uniform KV cache compression practical for multi-turn LLM serving. It is built on top of vLLM.

Highlights

Up to ~5× memory savings — head group page reclaims fragmented KV cache memory
Up to 2.3× throughput — static budget allocation removes scheduling overhead
Minimal accuracy loss — non-uniform KV compression preserves the heads that matter

Core techniques

Deterministic Budget Allocation — static per-head memory footprint, no runtime scheduling overhead
Head Group Page — clusters heads by retention demand with independent, vectorized page tables
Ahead-of-Time (AOT) Load Balancing — offline workload partitioning for uniform SM utilization

Built on vLLM

Tangram is built on top of vLLM, a fast and easy-to-use library for LLM inference and serving. See the vLLM documentation for the underlying engine and supported models.

Install from source:

pip install -e .

Quick Start

Head Group paging is on by default; add --enable-compression for non-uniform compression.

vllm serve /path/to/Qwen2.5-7B-Instruct-1M \
    --enable-compression \
    --compression-ratio 0.3

By default --max-model-len follows the model's max_position_embeddings and the server listens on port 8000; override with the standard vLLM flags if you need a different value.

--enable-compression / --compression-ratio R — non-uniform KV cache compression with retention fraction R.
--page-group-size N — head group page (default 4).

enforce_eager, max_num_batched_tokens, and enable_prefix_caching are auto-set when these features are on. Compression sizing (--compression-chunk-size, --compression-window-size, --compression-n-sink-tokens) can be tuned if needed.

Supported Models

Qwen3-4B
Qwen2.5-7B (Remove DCA Config on config.json)
Qwen3-32B
Llama-3.1-8B-Instruct

Supported Attention Backend

FlashAttention

Test a completion:

curl -s http://127.0.0.1:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{"model": "/path/to/Qwen2.5-7B-Instruct-1M",
         "prompt": "Tangram is",
         "max_tokens": 32,
         "temperature": 0}'

Name		Name	Last commit message	Last commit date
Latest commit History 11,592 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
requirements		requirements
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
NOTICE		NOTICE
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Built on vLLM

Quick Start

Supported Models

Supported Attention Backend

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

Built on vLLM

Quick Start

Supported Models

Supported Attention Backend

Citation

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages