QuashKV

Near-optimal KV cache compression and vector quantization based on the TurboQuant paper (ICLR 2026, Google Research).

QuashKV compresses high-dimensional vectors (LLM key/value caches, embedding databases) to 2–4 bits per coordinate with proven distortion guarantees, enabling:

4–5× KV cache compression with minimal quality loss
Zero-training vector search (180,000× faster indexing than Product Quantization)
Drop-in HuggingFace integration (vLLM scaffolding included)

Key Features

Feature	Description
Lloyd-Max codebooks	Optimal scalar quantizers solved against the exact coordinate PDF
MSE quantizer	Random rotation Π + per-coordinate Lloyd-Max quantization
Inner product quantizer	(b-1)-bit MSE + 1-bit QJL for unbiased IP estimation
Bit-packing	1/2/3/4-bit packing with lossless round-trip
Mixed precision	Per-channel adaptive bit allocation (outlier channels get more bits)
Fused kernels	PyTorch fallbacks + Triton GPU kernels (compress, decompress, fused attention)
HF cache	Drop-in `DynamicCache` replacement for HuggingFace models
vLLM backend	`QuashKVModelManager` architectural scaffolding (registration requires GPU testing)
NN search	`QuashIndex` for approximate maximum inner-product search

Installation

pip install -e .

With optional dependencies:

pip install -e ".[dev]"      # pytest, benchmarks
pip install -e ".[triton]"   # Triton GPU kernels
pip install -e ".[bench]"    # transformers, datasets, matplotlib
pip install -e ".[all]"      # everything

Requirements: Python ≥ 3.10, PyTorch ≥ 2.1, SciPy ≥ 1.10

Quick Start

KV Cache Compression

import torch
from quashkv import QuashKVEngine

engine = QuashKVEngine(head_dim=128, key_bits=3, value_bits=3)

# Compress and store KV pairs
keys = torch.randn(1, 8, 512, 128)    # (batch, heads, seq_len, head_dim)
values = torch.randn(1, 8, 512, 128)
engine.append(keys, values)

# Attend over compressed cache
queries = torch.randn(1, 8, 1, 128)
output = engine.attention(queries)     # (1, 8, 1, 128)

print(f"Compression ratio: {engine.compression_ratio():.1f}x")
# → Compression ratio: 4.9x

HuggingFace Integration

from quashkv.integrations.hf_cache import QuashKVCache

# Drop-in replacement for DynamicCache
cache = QuashKVCache(num_layers=32, head_dim=128)

# Use with any HuggingFace model:
# outputs = model.generate(**inputs, past_key_values=cache)

# Or auto-detect from model config:
# cache = QuashKVCache.from_model(model)

Vector Similarity Search

from quashkv import QuashIndex

# Build index (no training needed!)
index = QuashIndex(dim=128, bits=3)
index.add(database_vectors)  # (N, 128) float tensor

# Search
scores, indices = index.search(queries, k=10)

# Evaluate
recall = index.recall_at_k(queries, database_vectors, k=10)
print(f"Recall@10: {recall:.3f}")

Mixed Precision (Outlier Handling)

from quashkv import MixedPrecisionMSEQuantizer

# 2.5 effective bits: 32 outlier channels at 3b, 96 regular at 2b
quantizer = MixedPrecisionMSEQuantizer(
    d=128, outlier_bits=3, regular_bits=2, n_outliers=32,
)
quantizer.calibrate(calibration_data)  # (N, 128) sample

indices, norms = quantizer.compress(vectors)
reconstructed = quantizer.decompress(indices, norms)

Architecture

quashkv/
├── codebook.py          # Lloyd-Max optimal scalar quantizer
├── quantizer.py         # MSEQuantizer + InnerProductQuantizer
├── engine.py            # QuashKVEngine orchestrator
├── packing.py           # Bit-packing (1-4 bit → bytes)
├── mixed_precision.py   # Per-channel adaptive bit allocation
├── constants.py         # Block sizes, defaults
├── integrations/
│   ├── hf_cache.py      # HuggingFace DynamicCache replacement
│   └── vllm_backend.py  # vLLM backend scaffolding (not yet registered)
├── nn_search/
│   └── index.py         # QuashIndex for vector search
└── triton_kernels/
    ├── compress.py      # Fused normalize→rotate→quantize
    ├── decompress.py    # Fused dequantize→unrotate→rescale
    └── attention.py     # Online-softmax quantized attention

Theoretical Guarantees

From the TurboQuant paper (Zandieh et al., ICLR 2026):

Metric	Bound	3-bit value
MSE distortion	$\leq \sqrt{\frac{3\pi}{2}} \cdot 4^{-b}$	≤ 0.042
IP distortion	$\leq \sqrt{\frac{3\pi}{2}} \cdot \frac{\|\|y\|\|^2}{d} \cdot 4^{-b}$	≤ 0.042/d
Optimality gap	$\frac{\text{upper bound}}{\text{lower bound}} \approx 2.72$	Near-optimal

The inner product estimator is unbiased: $E[\langle y, \tilde{x} \rangle] = \langle y, x \rangle$.

Real-Model Validation

Validated on real LLM KV caches using a Tesla T4 GPU (Kaggle). Both models use head_dim=64.

3-bit Compression (Key: IP Quantizer, Value: MSE Quantizer)

Model	Layers	KV Heads	Key Cosine	Value Cosine	Attention Cosine
Qwen/Qwen2.5-0.5B	24	2 (GQA)	0.9409	0.9838	0.8869
TinyLlama/TinyLlama-1.1B-Chat-v1.0	22	4 (GQA)	0.9420	0.9836	0.9237

Bit-Width Sweep

Bits	Key Cosine (Qwen / TinyLlama)	Value Cosine (Qwen / TinyLlama)	Compression
2-bit	0.7988 / 0.8011	0.9421 / 0.9417	8.0x
3-bit	0.9409 / 0.9420	0.9838 / 0.9836	5.3x
4-bit	0.9835 / 0.9837	0.9956 / 0.9956	4.0x

Memory Savings (3-bit, packed)

Metric	Value
Compression ratio	4.57x
Memory saved	78.1%

Long-Sequence Stability (TinyLlama 1.1B)

Sequence Length	Key Cosine	Value Cosine	Original	Compressed
122 tokens	0.9420	0.9836	2.6 MB	0.6 MB
1,040 tokens	0.9420	0.9835	22.3 MB	5.0 MB

Quality is perfectly stable — identical cosine similarity at 8.5x longer sequences. Compression ratio holds at 4.57x regardless of sequence length.

Key cosine similarity (not raw MSE) is the correct quality metric for keys because the IP quantizer preserves direction. Value MSE stays well within the theoretical bound (0.002 vs 0.035).

See notebooks/kaggle_validation.ipynb for the full reproducible notebook.

A100 GPU Benchmarks

All numbers collected on an NVIDIA A100-SXM4-80GB, PyTorch 2.8.0+cu128, Triton enabled.

Fused Attention Speedup

The fused Triton attention kernel (online softmax, no KV materialization) vs. the engine's standard decompress-then-attend path:

KV Length	Engine (naive)	Fused Attention	Speedup
64	0.40 ms	0.20 ms	2.0×
256	1.06 ms	0.24 ms	4.4×
1,024	3.71 ms	0.50 ms	7.4×
4,096	14.65 ms	1.55 ms	9.4×

Fused compression is also ~2× faster: 0.13 ms (fused) vs. 0.27 ms (standard) at seq_len=256.

WikiText-2 Perplexity — TinyLlama 1.1B (22 layers, head_dim=64)

Baseline perplexity: 6.97 (4096 tokens, sliding window, no compression)

Bits	Key Cosine	Value Cosine	Compressed PPL	Δ PPL	Compression	Memory Saved
2	0.8011	0.9411	31.56	+24.59	6.40×	84.4%
3	0.9411	0.9831	7.09	+0.12	4.57×	78.1%
4	0.9832	0.9954	6.97	+0.00	3.56×	71.9%

2-bit degrades badly on TinyLlama (head_dim=64) because each coordinate carries more information. With Mistral 7B (head_dim=128), 2-bit is far more usable — see below.

WikiText-2 Perplexity — Mistral 7B (32 layers, head_dim=128)

Baseline perplexity: 5.06 (4096 tokens, fp16, no compression)

Bits	Key Cosine	Value Cosine	Compressed PPL	Δ PPL	Compression	Memory Saved
2	0.7997	0.9403	5.88	+0.82	7.11×	85.9%
3	0.9404	0.9829	5.10	+0.04	4.92×	79.7%
4	0.9829	0.9953	5.07	+0.01	3.76×	73.4%

3-bit compression adds only +0.04 perplexity (5.06 → 5.10) while saving 80% memory — effectively lossless. 4-bit is indistinguishable from baseline (+0.01). Even 2-bit only adds +0.82 with 86% memory savings.

End-to-End Generation — TinyLlama 1.1B (A100)

Side-by-side text generation comparing standard (uncompressed) vs. compressed KV cache:

Metric	Standard	3-bit Compressed	2-bit Compressed
Decode speed	65 tok/s	3.3 tok/s	3.4 tok/s
KV cache size	2.4 MB	0.5 MB	0.4 MB
Compression	1.0×	4.6×	6.4×
Memory saved	—	78%	84%

3-bit sample (prompt: "In a small village nestled between mountains,"):

Standard: a young woman named Lily lives a simple life. She works as a cook in a local inn, and her days are filled with the sounds of birds chirping and the rustling of leaves in the wind…

3-bit compressed: a young woman named Lily had a secret. She had been raised by her grandmother, who had been a kind and loving woman, but Lily had never known her mother…

3-bit outputs are fluent and coherent — the autoregressive path diverges but produces equally valid continuations. 2-bit (1-bit MSE + 1-bit QJL for keys) shows visible quality degradation with repetitive tokens, confirming that 3-bit is the practical sweet spot for generation quality.

End-to-End Generation — Mistral 7B (A100)

Metric	Standard	3-bit Compressed
Decode speed	47 tok/s	2.3 tok/s
KV cache size	13.8 MB	2.8 MB
Compression	1.0×	4.9×
Memory saved	—	80%

Sample output (prompt: "The key insight behind transformer models is that"):

Standard: the order of words in a sentence doesn't matter as much as the relationships between words. This is a powerful idea, but it's also a bit of a double-edged sword…

3-bit compressed: the order of words in a sentence doesn't matter as much as the relationships between words. This is a powerful idea, and it's the reason why transformer models can be trained on a relatively small amount of data…

28% token match on this prompt — the first 25 tokens are identical. Mistral 7B's larger head_dim (128 vs 64) means the compression noise is proportionally smaller, producing higher-quality compressed output than TinyLlama.

The compressed decode speed reflects per-token decompress→forward→recompress overhead; the fused Triton kernel eliminates decompression entirely for attention-bound workloads (9.4× speedup benchmarked separately).

# Run generation demo
python benchmarks/generation_demo.py
python benchmarks/generation_demo.py --bits 2      # 2-bit aggressive
python benchmarks/generation_demo.py --model mistralai/Mistral-7B-Instruct-v0.3 --dtype float16

Run Benchmarks

# Latency (GPU required, --triton-compare for head-to-head)
python benchmarks/latency_bench.py --device cuda --triton-compare

# Perplexity (downloads model + WikiText-2)
python benchmarks/perplexity_eval.py --model TinyLlama/TinyLlama-1.1B-Chat-v1.0 --bits 2 3 4 --max-tokens 2048
python benchmarks/perplexity_eval.py --model mistralai/Mistral-7B-v0.3 --bits 2 3 4 --max-tokens 4096 --dtype float16

# Quality (MSE, cosine similarity, compression ratio)
python -m benchmarks.quality_bench

# NN search (recall@k, indexing time, memory)
python -m benchmarks.nn_search_bench
python -m benchmarks.nn_search_bench --large  # 100K vectors

Testing

pip install -e ".[dev]"
pytest

175 tests covering all modules: codebook, quantizer, engine, packing, HF cache, vLLM backend, NN search, Triton fallbacks, mixed precision, and end-to-end integration.

Comparison with Related Work

Feature Matrix

Aspect	QuashKV	Anirudh's cuTile impl	KIVI	PolarQuant
Hardware	Any GPU (A100/H100/4090)	B200 only	Any	Any
Kernel framework	PyTorch + Triton kernels	cuTile (Blackwell-locked)	PyTorch	PyTorch
End-to-end generation	✅ Side-by-side demo	✅ Qwen 2.5 demo	❌	❌
Compressed perplexity eval	✅ WikiText-2 (2 models)	❌	❌	❌
Vector search	Full NN search module	❌	N/A	N/A
Outlier handling	Mixed-precision adaptive	❌	Per-channel	None
Serving integration	HuggingFace (production) + vLLM (scaffolding)	Standalone notebook	HuggingFace	HuggingFace
Theoretical guarantees	Full paper bounds verified	Partial	None	None
Bit-width configs	2, 2.5, 3, 3.5, 4-bit	3-bit only	2-bit	4-bit
Multi-model validation	TinyLlama, Mistral 7B	Qwen 2.5-1.5B only	Llama	Llama
Test coverage	175 tests	~5 tests	Minimal	Minimal
Lines of code	~3,280	~2,000	—	—

Where cuTile Is Stronger (and why)

Raw kernel performance. Anirudh's fused attention kernel runs on NVIDIA B200 with cuTile — a Blackwell-native DSL that compiles directly to Tensor Core instructions, hardware TMA (Tensor Memory Accelerator) prefetch, and TMEM→MMA paths that don't exist on older architectures. His generation demo hits 144.7 tok/s on B200 vs our 47 tok/s on A100 for a comparable model size. This isn't close.

Specific Blackwell advantages cuTile exploits that we can't touch:

Hardware TMA prefetch with latency hints (latency=2, latency=4) — overlaps memory loads with Tensor Core compute at the hardware level
Block swizzling for L2 cache line optimization (~6% latency reduction at 16K tokens)
Native exp2 with flush-to-zero — Blackwell's fast-path softmax in hardware
Approximate division with rounding_mode=APPROX — hardware reciprocal instead of software division
Pi matrix resident in shared memory (32KB) — stays loaded for the entire fused attention pass, zero reloads

These are real hardware-level wins that Triton on Ampere/Hopper simply cannot replicate. cuTile generates code that's purpose-built for Blackwell's memory hierarchy.

On-chip V decompression. His fused kernel decompresses value indices → centroids → un-rotates via Pi → weighted accumulation in a single pass without ever writing decompressed values to HBM. Our Triton fused attention kernel operates on compressed KV directly and avoids the decompress round-trip, but we don't have the same level of on-chip data reuse because Triton's programming model is more constrained than cuTile's explicit tile-level control.

Where QuashKV Is Stronger (and why)

1. Portability. cuTile is locked to Blackwell (B200/B100). QuashKV runs on any GPU with Triton support — A100, H100, L40S, RTX 4090, even older V100s for the PyTorch fallback path. This matters because:

Most researchers and startups don't have B200 access
Most cloud GPU instances are still A100/H100
Reproducibility requires hardware accessibility

2. Completeness of the paper. We implement every component from the TurboQuant paper:

MSE quantizer (Algorithm 1) ✅
Inner product quantizer with QJL (Algorithm 2) ✅
Mixed-precision / outlier channel splitting (Section 4.2) ✅
Approximate nearest neighbor search (Section 5) ✅
Full theoretical bound verification ✅

cuTile implements the core compress/decompress/attention pipeline (which is the most important part) but doesn't cover NN search, mixed precision, or multi-bitwidth configurations.

3. Compressed perplexity — the metric that actually matters. Cosine similarity (cuTile reports ~0.985) tells you reconstruction quality, but it doesn't tell you whether the model's output is degraded. We measured actual WikiText-2 perplexity with compressed KV cache:

Mistral 7B 3-bit: +0.04 perplexity (5.06 → 5.10) — effectively lossless
TinyLlama 3-bit: +0.12 perplexity (6.97 → 7.09)
Mistral 7B 4-bit: +0.01 perplexity — indistinguishable from baseline

No other TurboQuant implementation has published compressed perplexity numbers. This is the strongest evidence that the theoretical guarantees translate to real model quality.

4. Multi-model validation. cuTile tested on Qwen 2.5-1.5B only. We validated on two architecturally different models (TinyLlama head_dim=64, Mistral 7B head_dim=128), which revealed that 2-bit quality depends heavily on head_dim — a finding you can't discover from a single model.

5. Production integration. QuashKV provides a drop-in replacement for HuggingFace's DynamicCache that works with model.generate(). A vLLM backend is architecturally implemented (QuashKVModelManager, QuashKVPageManager) but registration requires GPU testing with vLLM's attention metadata builders — it's not yet plug-and-play for vLLM.

6. Flexible bit-widths. 2, 2.5, 3, 3.5, and 4-bit configurations, with per-channel mixed precision that allocates extra bits to outlier channels. cuTile is hardcoded to 3-bit (2-bit MSE + 1-bit QJL for keys, 3-bit MSE for values).

7. Test coverage. 175 tests covering every module end-to-end. This isn't about vanity — it means any researcher can fork the repo, change something, and know immediately if they broke the math.

Where We're Honest About Limitations

1. Decode speed. Our compressed generation runs at 2.3–3.4 tok/s (decompress → forward → recompress per token) vs 47–65 tok/s uncompressed. cuTile's fused kernel avoids this by operating directly on compressed data during generation. Our fused Triton attention kernel (benchmarked at 9.4× speedup) isn't yet wired into the generation loop — it's benchmarked standalone. This is our biggest gap.

2. Kernel throughput. Triton generates good GPU code, but it can't match cuTile's Blackwell-specific optimizations. On equivalent hardware, cuTile would be faster for the core attention+decompress path. We trade peak performance for portability.

3. Scale. We validated on TinyLlama 1.1B and Mistral 7B. cuTile's Qwen 2.5-1.5B is comparable in size. Neither implementation has been tested on truly large models (70B+, MoE architectures) where per-layer error accumulation across 80+ layers is the real challenge.

4. No benchmark on matching hardware. We can't do an apples-to-apples latency comparison because we don't have B200 access. Our A100 numbers and their B200 numbers measure different things.

Why We Built This

The short answer: Anirudh built the right thing on hardware nobody has. We built the version anyone can use.

TurboQuant is a beautiful paper — near-optimal vector quantization with provable guarantees, elegant math (random rotation → independent coordinates → Lloyd-Max is optimal), and practical KV cache compression. When we saw Anirudh's implementation, three things stood out:

B200-only is a dealbreaker for most people. The paper's algorithms are hardware-agnostic. The math works on any GPU — why should the implementation be locked to one chip that costs $30K+ and has limited cloud availability?
The paper has more to offer than KV compression. TurboQuant's NN search application (180,000× faster indexing than Product Quantization) and mixed-precision adaptive quantization are significant contributions that deserved implementation.
Benchmarks need to go deeper. Cosine similarity and generation demos are great proof-of-concepts, but the research community needs compressed perplexity — the actual metric that determines whether you'd deploy this in production. We measured it, and the results (+0.04 PPL at 3-bit on 7B) are strong enough to publish.

The implementations are complementary. If you have a B200 and want maximum throughput, use cuTile. If you want to understand, experiment with, or deploy TurboQuant on accessible hardware, use QuashKV.

API Reference

Core Classes

QuashKVEngine — Main orchestrator. append(keys, values) compresses, attention(queries) attends.
MSEQuantizer — Rotation + Lloyd-Max for MSE-optimal compression.
InnerProductQuantizer — Two-stage (MSE + QJL) for unbiased IP estimation.
LloydMaxCodebook — Precomputed optimal scalar quantizer.
QuashIndex — Approximate MIPS index with add(), search(), recall_at_k().
MixedPrecisionMSEQuantizer / MixedPrecisionIPQuantizer — Adaptive bit allocation.

Integration Classes

QuashKVCache — HuggingFace DynamicCache replacement.
QuashKVModelManager / QuashKVPageManager — vLLM integration scaffolding (not yet registered).

Utilities

pack_bits(indices, bits) / unpack_bits(packed, bits, d) — Bit-packing.
fused_compress_mse() / fused_compress_ip() — Fused compression kernels.
fused_quantized_attention() — Online-softmax attention over compressed cache.

Citation

Based on the TurboQuant paper:

@inproceedings{zandieh2025turboquant,
  title={TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate},
  author={Zandieh, Amir and Daliri, Majid and Hadian, Majid and Mirrokni, Vahab},
  booktitle={ICLR},
  year={2026}
}

License

Apache 2.0

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
benchmarks		benchmarks
notebooks		notebooks
quashkv		quashkv
scripts		scripts
tests		tests
.gitignore		.gitignore
BLOG_POST.md		BLOG_POST.md
README.md		README.md
RUNPOD_GUIDE.txt		RUNPOD_GUIDE.txt
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

QuashKV

Key Features

Installation

Quick Start

KV Cache Compression

HuggingFace Integration

Vector Similarity Search

Mixed Precision (Outlier Handling)

Architecture

Theoretical Guarantees

Real-Model Validation

3-bit Compression (Key: IP Quantizer, Value: MSE Quantizer)

Bit-Width Sweep

Memory Savings (3-bit, packed)

Long-Sequence Stability (TinyLlama 1.1B)

A100 GPU Benchmarks

Fused Attention Speedup

WikiText-2 Perplexity — TinyLlama 1.1B (22 layers, head_dim=64)

WikiText-2 Perplexity — Mistral 7B (32 layers, head_dim=128)

End-to-End Generation — TinyLlama 1.1B (A100)

End-to-End Generation — Mistral 7B (A100)

Run Benchmarks

Testing

Comparison with Related Work

Feature Matrix

Where cuTile Is Stronger (and why)

Where QuashKV Is Stronger (and why)

Where We're Honest About Limitations

Why We Built This

API Reference

Core Classes

Integration Classes

Utilities

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages