AdapTQ: Adaptive Streaming Vector Quantization

AdapTQ is a production-grade C++17 KV cache quantization engine for LLM inference on edge and memory-constrained systems.

The Integration Pitch: AdapTQ is an optional KV-cache backend. It runs entirely on the CPU, requires no model changes, and fits into existing inference pipelines with minimal adapter-style wrapper logic.

🚀 Quickstart

import torch; from adaptq import AdaptQAttention
# 1. Initialize drop-in PyTorch wrapper (4-bit default)
layer = AdaptQAttention(dim=128, heads=4)
# 2. Forward pass dynamically routes continuous BxHxD generation tensors
out = layer(q=torch.randn(1, 4, 128), k=torch.randn(1, 4, 128), v=torch.randn(1, 4, 128))

📊 Real-world Benchmarks

Tested on standard AVX2 desktop hardware (4 heads, dim=128, caching up to 4096 tokens). Note: We ignore sequence lengths < 256 in these claims, as short sequences are explicitly routed to standard FP32 execution via our hybrid fallback.

Metric	Result (Seq ≥ 256)
Latencies	p50: `877.9 µs` \| p95: `2161.8 µs`
Stable Speedup	~10.18x vs NumPy FP32 equivalent
Throughput	~1,139 tokens/sec
Memory	2.10 MB vs FP16's 8.39 MB (4.0x smaller)

Quantization Fidelity (Honest Metrics)

We use a targeted $\pm 3\sigma$ variance soft-clipping on FWHT distributions without altering Max-Lloyd codebooks. Our strictly measured empirical quality against baseline FP32:

Cosine Similarity: ~0.947 (1.000 = exact identical match)
Mean Squared Error (MSE): ~1.8e-04

🏗 Architecture & Features

Unified SIMD Pipeline: 2, 3, and 4-bit decoding share a single, quad-unrolled branchless loop using AVX2 intrinsics. No scalar fallbacks in the hot path.
Fast Hadamard Rotation (HAR): $O(d \log d)$ fully in-place rotation minimizes outliers gracefully before codebook matching.
Zero Heap Allocations: Pure stack/thread-local memory buffers in the hot path.
Pre-Compute LUTs: Dot products execute directly against packed indices in SIMD registers—avoiding full dequantization inside the attention kernel.

Your install section is fine, just add the PyPI path cleanly so users don’t get confused.

Use this:

🛠 Installation & Integration

🔹 Install from PyPI (recommended)

pip install adaptq

🔹 Install from source (latest/dev)

git clone https://github.com/l3tchupkt/adaptq.git
cd adaptq

# Build Python bindings (PyBind11)
pip install .

Python Native Application

Use the native generic python API to bypass neural-network tensors explicitly:

import numpy as np
from adaptq import Engine

engine = Engine(dim=128, heads=4, bits=4, capacity=2048)
k, v, q = np.random.randn(4, 128), np.random.randn(4, 128), np.random.randn(4, 128)

engine.append(k, v)
output = engine.compute(q)

llama.cpp Adapter

Using AdapTQ as the native KV Cache replacement during computation phase over GGML. (Requires using llm_build_kqv hooks. See /integration/llama_cpp_patch.md for full unified patch details.)

#include "adapters/adapter_llamacpp.h"
LlamaCppAdaptQAdapter adapter(n_heads, head_dim, bits, capacity, seed, v_mass, hybrid_thr);
adapter.feed_kv(head, key_array, val_array, token_pos);
adapter.attention(head, query_array, out_array);

📝 License

See active repository license policies. Developed based on AdapTQ: Adaptive Streaming Vector Quantization for Edge-Deployed Large Language Models.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
adapters		adapters
adaptq		adaptq
attention		attention
cache		cache
core		core
include		include
integration		integration
python		python
tests		tests
utils		utils
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
Makefile		Makefile
README.md		README.md
demo_pytorch.py		demo_pytorch.py
demo_streaming.py		demo_streaming.py
main.cpp		main.cpp
pyproject.toml		pyproject.toml
setup.py		setup.py
test_dummy.cpp		test_dummy.cpp
test_realtime.py		test_realtime.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AdapTQ: Adaptive Streaming Vector Quantization

🚀 Quickstart

📊 Real-world Benchmarks

Quantization Fidelity (Honest Metrics)

🏗 Architecture & Features

🛠 Installation & Integration

🔹 Install from PyPI (recommended)

🔹 Install from source (latest/dev)

Python Native Application

llama.cpp Adapter

📝 License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AdapTQ: Adaptive Streaming Vector Quantization

🚀 Quickstart

📊 Real-world Benchmarks

Quantization Fidelity (Honest Metrics)

🏗 Architecture & Features

🛠 Installation & Integration

🔹 Install from PyPI (recommended)

🔹 Install from source (latest/dev)

Python Native Application

llama.cpp Adapter

📝 License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages