|
| 1 | +# TurboQuant |
| 2 | + |
| 3 | +First open-source implementation of Google's TurboQuant KV cache compression. |
| 4 | + |
| 5 | +Compress your LLM's KV cache to 4 bits. Save VRAM. Run longer contexts. Drop-in for HuggingFace. |
| 6 | + |
| 7 | +```python |
| 8 | +from turboquant import TurboQuantCache |
| 9 | + |
| 10 | +cache = TurboQuantCache(bits=4) |
| 11 | +outputs = model.generate(..., past_key_values=cache) |
| 12 | +``` |
| 13 | + |
| 14 | +That's it. Three lines to compress your KV cache. |
| 15 | + |
| 16 | +## What is this? |
| 17 | + |
| 18 | +When LLMs generate text, they store key-value pairs for every token they've seen. This KV cache grows with context length and eats your VRAM. At 32K tokens on an 8B model, the KV cache alone uses ~4.6 GB. |
| 19 | + |
| 20 | +TurboQuant compresses this cache to 4 bits per element (from 16), cutting memory by ~4x. It does this using a clever trick from Google's paper: rotate the vectors randomly, then quantize each coordinate independently using an optimal codebook derived from probability theory. |
| 21 | + |
| 22 | +The result: same quality output, way less VRAM. |
| 23 | + |
| 24 | +## Install |
| 25 | + |
| 26 | +```bash |
| 27 | +pip install turboquant |
| 28 | +``` |
| 29 | + |
| 30 | +Or from source: |
| 31 | + |
| 32 | +```bash |
| 33 | +git clone https://github.com/back2matching/turboquant |
| 34 | +cd turboquant |
| 35 | +pip install -e . |
| 36 | +``` |
| 37 | + |
| 38 | +## Quick Start |
| 39 | + |
| 40 | +### Drop into any HuggingFace model |
| 41 | + |
| 42 | +```python |
| 43 | +from transformers import AutoModelForCausalLM, AutoTokenizer |
| 44 | +from turboquant import TurboQuantCache |
| 45 | +import torch |
| 46 | + |
| 47 | +model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-3B-Instruct", dtype=torch.float16, device_map="auto") |
| 48 | +tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-3B-Instruct") |
| 49 | + |
| 50 | +# Create compressed cache |
| 51 | +cache = TurboQuantCache(bits=4) |
| 52 | + |
| 53 | +# Use it like normal |
| 54 | +inputs = tokenizer("Hello, how are you?", return_tensors="pt").to(model.device) |
| 55 | +outputs = model(**inputs, past_key_values=cache, use_cache=True) |
| 56 | +``` |
| 57 | + |
| 58 | +### Run the inference server |
| 59 | + |
| 60 | +TurboQuant ships with an OpenAI-compatible inference server. Point any OpenAI client at it. |
| 61 | + |
| 62 | +```bash |
| 63 | +turboquant-server --model Qwen/Qwen2.5-3B-Instruct --bits 4 --port 8000 |
| 64 | +``` |
| 65 | + |
| 66 | +```bash |
| 67 | +curl http://localhost:8000/v1/chat/completions \ |
| 68 | + -H "Content-Type: application/json" \ |
| 69 | + -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":100}' |
| 70 | +``` |
| 71 | + |
| 72 | +### Use the core algorithms directly |
| 73 | + |
| 74 | +```python |
| 75 | +from turboquant import TurboQuantMSE |
| 76 | + |
| 77 | +# Quantize any vectors (KV cache heads, embeddings, etc.) |
| 78 | +tq = TurboQuantMSE(dim=128, bits=4, device='cuda') |
| 79 | + |
| 80 | +# Quantize |
| 81 | +indices, norms = tq.quantize(vectors) # vectors: (N, 128) |
| 82 | + |
| 83 | +# Dequantize |
| 84 | +vectors_hat = tq.dequantize(indices, norms) |
| 85 | +``` |
| 86 | + |
| 87 | +## Benchmarks (RTX 4080 16GB) |
| 88 | + |
| 89 | +### Qwen2.5-3B-Instruct |
| 90 | + |
| 91 | +| KV Mode | Peak VRAM | VRAM Saved | Speed | Output Quality | |
| 92 | +|---------|-----------|------------|-------|---------------| |
| 93 | +| FP16 (baseline) | 6,922 MB | -- | 28 tok/s | Perfect | |
| 94 | +| **TurboQuant 4-bit** | **6,448 MB** | **474 MB** | 17 tok/s | Good | |
| 95 | +| TurboQuant 3-bit | 6,448 MB | 474 MB | 20 tok/s | Degraded on small models | |
| 96 | + |
| 97 | +VRAM savings scale linearly with context length. At 32K tokens on an 8B model, expect ~3 GB saved. |
| 98 | + |
| 99 | +### Algorithm Verification |
| 100 | + |
| 101 | +| Bits | MSE | Theoretical Bound | Compression | |
| 102 | +|------|-----|-------------------|-------------| |
| 103 | +| 1 | 0.362 | 0.680 | 12.8x | |
| 104 | +| 2 | 0.129 | 0.170 | 7.1x | |
| 105 | +| 3 | 0.049 | 0.043 | 4.9x | |
| 106 | +| 4 | 0.020 | 0.011 | 3.8x | |
| 107 | + |
| 108 | +## How It Works |
| 109 | + |
| 110 | +TurboQuant uses three ideas from the paper: |
| 111 | + |
| 112 | +1. **Random rotation**: Multiply each KV vector by a random orthogonal matrix. This spreads the information evenly across all coordinates, making them nearly independent. |
| 113 | + |
| 114 | +2. **Optimal codebook**: Each coordinate now follows a predictable Beta distribution. We compute the mathematically optimal quantization levels for this distribution. No training data needed. |
| 115 | + |
| 116 | +3. **Residual window**: The most recent 128 tokens stay in full FP16 precision. Only older tokens get compressed. This preserves quality for the tokens attention focuses on most. |
| 117 | + |
| 118 | +The rotation is computed once (not per-token) and the codebook is derived analytically. No calibration, no fine-tuning, works with any model out of the box. |
| 119 | + |
| 120 | +## When to Use This |
| 121 | + |
| 122 | +**Good fit:** |
| 123 | +- You're running long contexts (8K+ tokens) on a VRAM-constrained GPU |
| 124 | +- You're serving multiple users and need to fit more KV caches in memory |
| 125 | +- You want to run a bigger model by freeing VRAM from KV cache |
| 126 | +- Standard transformer models (Llama, Mistral, Qwen2.5) |
| 127 | + |
| 128 | +**Not a good fit:** |
| 129 | +- Very short contexts (< 1K tokens) where KV cache is tiny anyway |
| 130 | +- Hybrid architectures with recurrent layers (Qwen3.5, Mamba) that already have small KV caches |
| 131 | +- Tasks requiring exact bit-level precision (use FP16) |
| 132 | +- 3-bit on models smaller than 8B (quality degrades noticeably) |
| 133 | + |
| 134 | +## Comparison with Alternatives |
| 135 | + |
| 136 | +| Method | Where It Runs | Bits | Setup | |
| 137 | +|--------|---------------|------|-------| |
| 138 | +| **TurboQuant** | Any HuggingFace model | 3-4 | `pip install turboquant` | |
| 139 | +| Ollama q8_0 KV | Ollama only | 8 | `OLLAMA_KV_CACHE_TYPE=q8_0` | |
| 140 | +| Ollama q4_0 KV | Ollama only | 4 | `OLLAMA_KV_CACHE_TYPE=q4_0` | |
| 141 | +| vLLM FP8 KV | vLLM only | 8 | `kv_cache_dtype="fp8"` | |
| 142 | +| KIVI | Research code | 2 | Not pip-installable | |
| 143 | + |
| 144 | +TurboQuant is the only pip-installable sub-8-bit KV cache compression that works with any HuggingFace model. |
| 145 | + |
| 146 | +## Paper |
| 147 | + |
| 148 | +This implements the algorithm from: |
| 149 | + |
| 150 | +**TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate** |
| 151 | +Amir Zandieh, Majid Daliri, Majid Hadian, Vahab Mirrokni |
| 152 | +ICLR 2026 | [arXiv:2504.19874](https://arxiv.org/abs/2504.19874) |
| 153 | + |
| 154 | +This is an independent implementation, not affiliated with Google Research. |
| 155 | + |
| 156 | +## License |
| 157 | + |
| 158 | +Apache 2.0 |
0 commit comments