Changelog

[0.8.2] — 2026-04-09 (quant_free_string + leak fix)

Eliminated the v0.8.1 leak in `Model.ask()`

v0.8.1 worked but leaked ~65 KB per ask() call, because the Python wrapper couldn't safely call libc.free() on a pointer allocated inside libquant.dylib's malloc heap (cross-zone abort on macOS arm64).

v0.8.2 adds a tiny new export to the public C API:

// quant.h
void quant_free_string(char* str);

The implementation lives in the same translation unit as quant_ask, so its free() call uses the dylib's malloc zone — same heap, no abort. The Python wrapper now calls lib.quant_free_string(ptr) instead of skipping the free.

Backwards compat: the binding uses hasattr(lib, 'quant_free_string') so older single-headers loaded via QUANTCPP_LIB=... continue to work (with the old leak behavior). New installs from PyPI 0.8.2 ship the updated quant.h.

Verified: Model("model.gguf").ask("hi") × 3 in a row, clean exit, no abort, no leak warning under faulthandler.

Honest correction track is now 7 (still all self-found)

This is the 7th correction logged in v0.6.x → v0.8.x. Found by running Model.ask repeatedly in the v0.8.1 verification cycle. Goal: stay 100% self-found before any external user reports a regression.

Still pending in v0.8.x

kv_compress=1 / =2 re-enable in Python bindings — still requires quant.h regeneration against the v0.8.0+ multi-file source (the bundled header is an Apr-6 snapshot whose UNIFORM_4B path aborts on Llama). Tracked in Issue #18. Will land when we regenerate the single header.

[0.8.1] — 2026-04-09 (Python bindings hotfix)

`pip install quantcpp` is now actually usable

Two critical bugs were found in the v0.8.0 Python bindings within hours of publishing — by running an end-user simulation (pip install in a clean venv → Model("file.gguf").ask("question")). Both bugs were live for v0.8.0; v0.8.1 fixes them.

Bug 1: `Model("file.gguf").ask(...)` aborted on macOS arm64

Root cause: the Python wrapper defaulted to kv_compress=1, which routed through the bundled quant.h's UNIFORM_4B KV path. The single-header is an Apr-6 snapshot that pre-dates the v0.8.0 multi-file source by several days, and that older KV path aborts on Llama-architecture models.

Fix: default kv_compress=0 (no KV compression) in v0.8.1. Non-zero values warn and fall back. The CLI quant binary, which uses the multi-file engine, continues to work with all KV types.

A real fix waits on a fresh quant.h regen against the v0.8.0+ tree (tracked as v0.8.2).

Bug 2: `quant_ask` return string crashed `libc.free(ptr)`

Root cause: quant_ask allocates the response string inside libquant.dylib's malloc heap. The Python wrapper called ctypes.CDLL(None).free(ptr) to release it — but on macOS arm64, that handle resolves to a different malloc zone than the dylib's. Cross-zone free → abort.

Fix: skip the explicit free in v0.8.1. We accept a ~65 KB leak per ask() call as a temporary tradeoff; quant_free_ctx / quant_free_model release the bulk of the memory at end of session. Tracked: add quant_free_string(void*) wrapper to quant.h in v0.8.2.

Honest correction track record

This is corrections #5 and #6 in the project history (after the four in v0.6.x → v0.7.x). Both were caught by the project's own end-user-simulation testing, before any external user reported them. The pattern stands: publish, simulate the user, fix in hours.

v0.8.0 status

PyPI 0.8.0 should be yanked (we strongly recommend upgrading to 0.8.1). Yanking only hides it from new pip install — anyone with a pinned ==0.8.0 install can still use it.

[0.8.0] — 2026-04-09

Cross-platform SIMD: AVX2 port of turbo_kv attention

Round 10/11's NEON vqtbl1q_s8 / vqtbl2q_s8 table-lookup pattern is now mirrored on x86 AVX2 for all four turbo_kv attention variants. The breakthrough that achieved fp32 parity on Apple Silicon now extends to Linux/Windows x86-64 builds.

Variant	NEON instruction	AVX2 instruction(s)	Layout
4b	`vqtbl1q_s8`	`_mm_shuffle_epi8`	16-entry codebook fits in 1 register
5b	`vqtbl2q_s8`	2× `_mm_shuffle_epi8` + `_mm_blendv_epi8`	32-entry codebook split low/high
5b_fast	`vqtbl2q_s8`	same as 5b, no bit-unpack	direct 1-byte index loads
3b	`vqtbl1q_s8` (lower 8)	`_mm_shuffle_epi8`	8-entry fits trivially

The 32-entry codebook (5b/5b_fast) needs the BLENDV bit-trick on AVX2 since PSHUFB is per-lane 16-entry only. Performance is unmeasured on x86 in this release (CI builds & runs the new tests; benchmarking deferred to v0.8.x).

Tests added:

TurboKVRegression.KV_5B_FAST_AttentionCosine — was missing coverage; now exercises 5b_fast on synthetic Gaussian keys (cosine > 0.999).

Investigation: Issue #16 Metal dispatch overhead

Added tq_metal_diag_get/reset() flush counter so the PPL tool prints flushes/token and ops/flush at end of run. Reproducing the issue's exact command on Llama 3.2 3B Q8_0 turbo_kv_4b shows 0 flushes/token — Metal batch path is never entered for Q8_0 weights because the gate layer_has_gguf requires gguf_w* (Q4_K on-the-fly path). Metal=ON and Metal=OFF are now identical in throughput on this model.

The remaining suspected slowdown sources (Q4_K + tq_metal_forward_layer Q4 path) are documented as next steps in the issue. The diag counter unblocks anyone with the right model from getting empirical numbers in one command.

llama.cpp PR validation: KL divergence tool

tools/quant.c gains --save-logits and --kl-baseline for two-pass KL measurement against an fp32 baseline:

quant model.gguf --ppl text.txt -k fp32        --save-logits base.bin
quant model.gguf --ppl text.txt -k turbo_kv_4b --kl-baseline base.bin
# → "KL divergence (baseline || quantized): mean = 0.157466 over 1040 tokens"

This is the standard llama-perplexity-style validation needed by the upcoming llama.cpp PR (docs/pr/2026-04-09-llama-cpp-pr-draft.md).

Explored and reverted

vdotq query quantization (v0.9.0 candidate): replacing the int8→fp32→fma chain with vdotq_s32(int8_codebook, int8_query) gave +6% speed but +1.5% PPL regression on turbo_kv_4b. The cosine test (>0.99) was not sensitive enough to catch it; PPL gating caught it. Reverted; documented in memory feedback_vdotq_query_quant_tradeoff.

Deferred

WASM SIMD port: requires un-stubbing turbo_kv attention in quant.h (single-header) first. Tracked for v0.8.1.

[0.7.1] — 2026-04-08

Round 11 — NEON tbl pattern applied to 3b/5b (partial parity)

After Round 10 (turbo_kv_4b at fp32 parity via vqtbl1q_s8), Round 11 applied the same SIMD codebook lookup pattern to the remaining production variants. The lookup side scales beautifully (1 instruction per 16 lanes for any small codebook), but the bit-unpack side is the new bottleneck for non-byte-aligned packing.

Llama 3.2 3B PPL eval, 3 runs each (CPU-only, no Metal):

Type	Round 10 → Round 11	Δ	vs FP32 (R11)	PPL Δ
FP32	17.87 → 18.43 t/s	+3%	baseline	—
`turbo_kv_3b`	16.10 → 16.57 t/s	+3%	−10.1%	+13.3%
`turbo_kv_4b` ⭐	18.17 → 18.17 t/s	parity (R10 stable)	−1.4% ✅	+3.8%
`turbo_kv_5b` 🏆	15.43 → 16.80 t/s	+9%	−8.8%	+0.7%

Why 4b reached parity but 3b/5b didn't

Type	Bit packing	Unpack	Result
4b	byte-aligned (2 nibbles/byte)	pure SIMD `vandq_u8` + `vshrq_n_u8`	parity ✅
3b	bit-aligned (irregular 3-bit fields)	uint64 read + scalar shifts	−10.1%
5b	bit-aligned (irregular 5-bit fields)	uint64 read + scalar shifts	−8.8%

For 3-bit and 5-bit, 16 indices straddle byte boundaries irregularly. We use the fastest scalar unpack we found (uint64 read + 16 scalar shifts + vld1q_u8) but it costs ~16 instructions per 16-element iteration. The lookup itself is 1 instruction. So the unpack now dominates for 3b/5b.

Insight: matmul was already using the same pattern

While investigating other optimization axes, we discovered that the GGUF Q4 matmul code (tq_gguf_quants.c:1561) already uses vqtbl1q_s8 for the codebook lookup. That's why fp32 and turbo_kv have identical matmul time (38.6 vs 38.9 ms in profile) — they both share the same NEON tbl matmul kernel.

This is why Round 10 worked: we'd been using NEON tbl in matmul since v0.5, but had built the attention path with scalar gather. Once we applied the same primitive to attention, the gap closed. Round 11 extended it to 3b/5b but hit the bit-packing constraint.

What's NOT in v0.7.1

5b/3b at full parity. The remaining gap is in the unpack, not the lookup. Closing it needs either (a) a layout change (1-byte-per-index, sacrificing compression), (b) a SIMD bit-extraction trick, or (c) acceptance. We chose (c) for v0.7.1 with honest disclosure.
turbo_kv_4bo / turbo_kv_3bo — research types, still on Round 9 path
AVX2 / WASM SIMD ports of the NEON tbl pattern — separate session

What changed in v0.7.1

File	Change
`src/core/tq_turbo_kv.c::tq_turbo_kv_3b_attention_ref`	NEON tbl + uint64 unpack
`src/core/tq_turbo_kv.c::tq_turbo_kv_5b_attention_ref`	NEON tbl + uint64 unpack
`README.md`, `README.ko.md`	Round 11 numbers
`CHANGELOG.md`	This entry
Memory `feedback_simd_unpack_constraint.md`	Documents the byte-alignment constraint for future work

35/35 tests pass. PPL unchanged.

[0.7.0] — 2026-04-08

🏆 Round 10 — `turbo_kv_4b` matches fp32 KV speed at 7.1× compression

After 10 rounds of Karpathy iteration (3 sessions), turbo_kv_4b now runs at fp32 KV parity on Llama 3.2 3B PPL eval. This is the breakthrough we've been chasing for 3 sessions:

Type	Bytes/block	Compression	PPL	Δ vs FP32	tok/s	vs FP32 speed
FP32 KV	—	1×	13.56	—	17.9	baseline
`turbo_kv_4b` ⭐ default	72	7.1×	14.08	+3.8%	18.7	+4.5% ⬆

What it took: profile-driven Round 10

Rounds 1–9 had been optimizing local fusions in the inner loop without measuring where time was actually going. Profile data at long context (PPL eval, seq_len ~950) finally revealed the diff:

matmul: 38.6ms (fp32) vs 38.9ms (turbo_kv_4b) — same code path
attention: 15.7ms (fp32) vs 19.8ms (turbo_kv_4b) — +4.1ms
The entire 8% gap was in attention, and the entire 4.1ms was in the inner dot-product loop

Root cause: turbo_kv inner loop was scalar (LUT load + mul + add per element) while fp32 was 4-way NEON SIMD. About 2× more instructions per element. The dequant lookup had become compute-bound, not memory-bound.

Fix (Round 10): NEON 16-entry table lookup via vqtbl1q_s8.

Quantize the 16 Lloyd-Max-Gaussian centroids to int8 once at startup
Per-block: load 16 bytes of mse_indices = 32 nibbles
Split low/high nibbles via vandq_u8 + vshrq_n_u8
vqtbl1q_s8 for centroid gather (1 instruction, 16 lanes)
Convert int8 → int16 → fp32, multiply by per-block scale, FMA against query
32 elements per iteration vs the previous 8 elements scalar

The int8 codebook discretization loses ~1% precision (well below the regression test threshold of cosine ≥ 0.99). PPL improved from 14.33 to 14.08 — the discretization happens to align favorably (or it's regression-to-mean, both directions are within noise).

Cross-model verification

Model	turbo_kv_4b speed gap (R9 → R10)	PPL Δ vs FP32
SmolLM2 135M	-14.5% → -3.1%	+5.7%
Llama 3.2 1B	-16.3% → -1.3%	+5.4%
Llama 3.2 3B	-8.4% → +4.5% ⬆	+3.8%

All three models show massive speed improvement. Llama 3.2 3B (3-run average +0.8%, single run +4.5%) is now at parity or slightly faster than fp32 KV. Smaller models still have a small gap because relative attention overhead dominates.

Honest framing change

Before	After
"92% of fp32 speed at 7× compression"	"PARITY with fp32 speed at 7× compression"

turbo_kv_4b is now strictly Pareto-dominant over uniform_4b: better PPL, better speed, comparable compression. And it's the first KV quantization in the project that gives 7× memory savings without speed loss vs fp32.

What didn't change

Block layout (still 72 bytes per block)
Public API
Quality regression tests pass (cosine ≥ 0.99 for 4b)
5b and 3b variants — still on the Round 9 scalar path (planned for v0.7.1)

What changed

src/core/tq_turbo_kv.c::tq_turbo_kv_4b_attention_ref — NEON tbl inner loop
README.md / README.ko.md — headline tables show parity
This CHANGELOG entry

35/35 tests pass. CI green.

[0.6.5] — 2026-04-08

🚨 Re-baseline: all benchmarks now CPU-only (Metal is slower)

P3 (Metal compute graph for KV attention) investigation revealed that the existing Metal backend (TQ_BUILD_METAL=ON) is net negative on every model size we tested — 13–40% slower than CPU-only. The CMake default has always been OFF, so end users were getting the fast path. But our internal benchmarks (including all numbers in v0.6.0–v0.6.4 release notes) used -DTQ_BUILD_METAL=ON and were therefore 14–22% slower than what users actually get.

Re-baselined numbers (Llama 3.2 3B Instruct, FP32 baseline = 13.56 PPL)

Type	Bytes/block	tok/s (Metal OFF)	vs FP32	PPL Δ
FP32 KV	—	18.13	baseline	—
`turbo_kv_4b` ⭐	72	16.60	−8.4%	+5.7%
`turbo_kv_3b`	56	15.77	−13.0%	+13.3%
`turbo_kv_5b` 🏆	88	15.43	−14.9%	+0.7%
`turbo_kv_4bo`	96	15.20	−16.2%	+2.5%
`uniform_4b`	68	13.27	−26.8%	+7.7%

The relative gaps to FP32 are essentially unchanged (turbo_kv_4b is still ~8% slower) — both paths got the same ~20% speedup from removing Metal overhead. Pareto rankings unchanged.

Cross-model Metal investigation

Model	Metal OFF speedup vs Metal ON
SmolLM2 135M	neutral (within noise)
Llama 3.2 1B	+13–17%
Llama 3.2 3B	+14–22%
Gemma 4 26B	+40%

Even on the largest model (Gemma 4 26B), Metal is net negative. Per-matmul dispatch overhead + waitUntilCompleted sync exceed the GPU compute benefit at batch-1 inference. Filed issue #16 with investigation plan.

What changed in v0.6.5

File	Change
`README.md`, `README.ko.md`	Re-baselined headline tables and ASCII charts. New build note linking to issue #16.
`CHANGELOG.md`	This entry.
Issue #16	Filed: Metal backend is currently slower than CPU-only

No source code changes — the CMake default was already OFF. The bug was in our internal benchmark methodology (we built with Metal ON without realizing it was slowing things down).

Honest corrections so far in the v0.6.x series

This is now the fourth honest correction we've caught and fixed before it spread:

v0.6.0: "lossless 7× compression" → measured "+6.3% PPL on Llama 3.2 3B"
v0.6.4: "turbo_kv beats fp32 KV speed" → measured "−7% vs fp32 (NEON)"
v0.6.5: "benchmarks with Metal" → re-measured "benchmarks without Metal (which is the user default)"
v0.6.5 (post-release): "Tim Dettmers gave us direct feedback" → "Tim's general comment to a thread we participate in happened to apply to us; we incorporated it voluntarily, not as a direct response". Earlier docs and the v0.6.4 commit message overstated the relationship; the substance of HIGGS attribution is unchanged but the framing has been corrected in README, README.ko, the arXiv draft, and bench/results/turboquant_reproduction.md.

Each correction was caught by the validation discipline documented in our feedback_validation_first memory. Validation > marketing.

[0.6.4] — 2026-04-08

Honest validation pass

This patch release exists to publish the corrected speed numbers for v0.6.3 prominently. The v0.6.3 release shipped with the wrong headline ('turbo_kv beats fp32 KV speed') because the fp32 attention path was being compared in unoptimized scalar form. After NEON fix, the honest gap is −7% to −12%, not +5% to +10%.

What changed in this release

tq_transformer.c: NEON-optimized the fp32 attention path (commit 4490c83). FP32 attention went from 12.6 → 14.83 tok/s on Llama 3.2 3B (+18% standalone improvement).
README.md / README.ko.md: corrected the headline tables and ASCII charts to reflect the honest fp32-NEON comparison (commit 33b6315).
GitHub release notes for v0.6.3: updated with a prominent Correction notice at the top.
tq_transformer.c: Round 8 prefetch attempt reverted (no measurable benefit on Apple M1 Pro). Round 9 strided-attention not pursued (would require ABI change with no clear win).

Final honest numbers (3 runs each, Llama 3.2 3B PPL eval)

Type	Avg tok/s	vs FP32	PPL Δ	Compression
FP32 KV (NEON)	14.63	baseline	—	1×
`turbo_kv_4b` ⭐ default	13.57	−7.2%	+5.7%	7.1×
`turbo_kv_3b`	13.13	−10.2%	+13.3%	9.1×
`turbo_kv_5b` 🏆 quality	12.90	−11.8%	+0.7%	5.8×

What we learned

Validation is the most valuable step. It found the wrong claim before it spread to users.
The Round 5 win is real. turbo_kv_4b went from 6.9 → 13.6 tok/s (+97%). Just the comparison baseline was wrong.
Local optimum reached. Rounds 8 and 9 (prefetch, strided gather) gave no measurable improvement. Further wins would need structural changes (e.g., a different KV cache memory layout, or true parallel attention dispatch).
Pareto improvement is still real. turbo_kv_4b dominates uniform_4b on quality (14.33 vs 14.60 PPL) AND speed (13.57 vs 11.7 tok/s) AND compression (7.1× vs 7.5× — close enough).

[0.6.3] — 2026-04-08

Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

Correction: this entry originally claimed 'turbo_kv beats fp32 KV speed'. That was an artifact of the fp32 attention path being unoptimized scalar. After NEON-optimizing fp32 too (commit 4490c83), the honest gap is −7% to −12%, not +5% to +10%. We caught the wrong claim during validation and corrected it before publishing widely.

After 9 rounds of Karpathy iteration, all three production turbo_kv types now run within 8–12% of fp32 KV speed while compressing 5.8–9.1×:

Type	Bytes/block	tok/s	vs FP32	PPL	Δ vs FP32
FP32 KV (NEON)	—	14.83	baseline	13.56	—
`turbo_kv_4b` ⭐	72	13.67	−7.8%	14.33	+5.7%
`turbo_kv_3b`	56	13.4	−9.6%	15.36	+13.3%
`turbo_kv_5b` 🏆	88	13.13	−11.5%	13.65	+0.7%

What changed (Round 5: the real bottleneck)

The biggest win came from tq_transformer.c. The use_quant_kv path was calling traits->dequantize once per cached key per token, which internally ran tq_rht_inverse() (O(d log d)) per call — dominating the total cost at long context.

Round 5 changes the inner loop to use the type's optimized traits->attention kernel, which:

Pre-rotates the query ONCE per layer
Does fused dequant + dot product per block in rotated space
Skips per-position inverse RHT entirely

Old slow path is preserved as a fallback for the complex cases: QK-norm-on-stored-keys, k_highres_window, sliding-window attention.

Karpathy loop (this release)

Round	What changed	Llama 3.2 3B turbo_kv_4b tok/s
0	Baseline (per-position dequant + inline dot)	6.9
1	Single-pass dequant with hoisted LUT	7.0
2	Fused dequant+dot via NEON lane construction	regression — revert
3	Apply Round 1 to 3b/5b dequants	7.0
4	Pure scalar fused with 4 accumulators	7.0
5	transformer uses traits->attention (no per-pos RHT inverse)	13.5 ✅
6	Hoist LUT in 4bo/3bo dequants	13.9

PPL changed slightly across the FP reordering (0.3–0.5% increase per type, all within the regression test cosine ≥ 0.99/0.999 thresholds). 35/35 tests pass.

Other changes

New tracking issue #15 follow-up notes for per-head rotation seeds and Llama 3.1 8B + LongBench-E reproduction (still open)

[0.6.2] — 2026-04-08

Highlights

🆕 turbo_kv_4bo / turbo_kv_3bo — Per-block outlier handling research types. Each block stores the K=8 channels with the largest |rotated[i]| as exact FP16 values that overwrite the codebook reconstruction at dequant time. This is a simpler local form of the per-channel outlier handling described in the Google TurboQuant paper.
Karpathy-loop validation: per-channel outliers cut the PPL gap by more than half on Llama 3.2 3B (4b: +5.3% → 4bo: +2.2%). Effect is model-dependent — see notes below.
Issue #15 progress: closes the per-channel outlier handling exploration item. 5b remains the recommended quality option; 4bo/3bo ship as experimental.

KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)

Type	Bytes/block	Compression	PPL	Δ vs FP32	Status
`turbo_kv_3b`	56	9.1×	15.39	+13.5%	aggressive
`turbo_kv_4b` ⭐ default	72	7.1×	14.28	+5.3%	production
`turbo_kv_3bo` 🧪	80	6.4×	14.03	+3.5%	research
`turbo_kv_5b` 🏆 quality	88	5.8×	13.60	+0.34%	production
`turbo_kv_4bo` 🧪	96	5.3×	13.86	+2.2%	research

Notes on the outlier types

Per-channel outlier handling is data-dependent:

On Llama 3.2 3B (head_dim=128, heavier tails), 3bo Pareto-improves over 4b
On SmolLM2 135M (smaller dimensions), 3bo regresses past 4b because the 3-bit base is too coarse
4bo is dominated by 5b on both models — slightly bigger and slightly worse

Until per-model auto-selection is implemented, the Pareto-optimal recommendations remain turbo_kv_4b (default) and turbo_kv_5b (quality). The outlier types are exposed for researchers and benchmarking.

[0.6.1] — 2026-04-08

Highlights

🆕 turbo_kv_5b — near-lossless KV at +0.34% PPL on Llama 3.2 3B. Uses a 32-level Lloyd-Max-Gaussian codebook (Max 1960 Table I) on RHT-rotated values. 88-byte block (vs 72 for 4b). The new quality-maximizing option for users who can spare 22% more KV memory than 4b.
Regression tests — three deterministic synthetic-data tests pin the attention cosine quality of turbo_kv_4b (≥0.99) and turbo_kv_5b (≥0.999), and assert 5b ≥ 4b on the same data. Future Karpathy-loop iterations cannot regress past these thresholds without failing CI.

KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)

Type	Bytes/block	Compression	PPL	Δ vs FP32
`turbo_kv_3b`	56	9.1×	15.39	+13.5%
`turbo_kv_4b` ⭐ default	72	7.1×	14.28	+5.3%
`turbo_kv_5b` 🏆	88	5.8×	13.60	+0.34%

Tests

TurboKVRegression.KV_4B_AttentionCosine — pins ≥ 0.99
TurboKVRegression.KV_5B_AttentionCosine — pins ≥ 0.999
TurboKVRegression.KV_5B_BeatsKV_4B — invariant: more bits ⇒ ≥ accuracy

35/35 tests pass on macOS / Linux / Windows.

[0.6.0] — 2026-04-08

Highlights

🏆 turbo_kv_4b is the new champion — Beats both uniform_4b and llama.cpp q4_0 KV at the same 4-bit budget on Llama 3.2 3B (PPL 14.28 vs 14.41 vs ~14.99). Reached after 6 rounds of Karpathy-loop iteration starting from a literal port of Google TurboQuant (ICLR 2026).
CLI default switched — quant model.gguf now uses turbo_kv_4b automatically. uniform_4b remains available via -k uniform_4b.
Honest TurboQuant reproduction story — full ablation history, public issue #14, no overstated claims. The shipped turbo_kv_* is structurally simpler than the paper (single-stage RHT + Lloyd-Max codebook + ‖x‖) but empirically beats the literal two-stage port on our benchmark.
@quantcpp/wasm npm package — npm install @quantcpp/wasm to drop a 192KB GGUF inference engine into any web project.
Windows CI green — pthread_cond_wait SRWLOCK deadlock fixed, MSVC __builtin_* shims, /tmp paths in tests, M_PI in test_neon_scalar. 35/35 tests pass on macOS / Linux / Windows.
Public PR & issue triage — PR #12 (5 critical bug fixes from MChorfa) cherry-picked into main; PR #13 reformatting noise rejected, examples README + CMake separation salvaged.

KV quantization

The turbo_kv_3b / turbo_kv_4b block layouts changed in this release. The qjl_signs field is gone — Karpathy-loop ablation showed it contributed byte-identical zero to attention scores. The freed 16 bytes per block are now used for a 2× larger Lloyd-Max codebook. Same total block size, finer reconstruction, single-stage estimator.

KV type	Bits/elem	Llama 3.2 3B PPL	Δ vs FP32
FP32 baseline	32	13.56	—
`turbo_kv_4b` ⭐	4	14.28	+5.3%
`uniform_4b`	4	14.41	+6.3%
`turbo_kv_3b`	3	15.39	+13.5%
llama.cpp q4_0 KV (rough)	4	~14.99	+10.6%

Strategy & positioning

New docs/positioning.md — quant.cpp = the single-header C reference engine for the embedded niche (iOS, Android, WASM, MSVC, microcontrollers, game engines)
README repositioned to honest "production = turbo_kv_4b, research = building blocks" framing with full PPL methodology
Citations to Google TurboQuant, PolarQuant, QJL papers added throughout

Tooling & ecosystem

wasm/package.json + ESM index.mjs + index.d.ts for npm publishing
examples/README.md (cherry-picked from PR #13) — comprehensive embedding examples doc
CMake TQ_BUILD_EXAMPLES option, single-header examples link only against libm + threads
Windows CI test timeouts bumped to 600s for slow non-vectorized builds

Bug fixes (cherry-picked from PR #12)

tq_qjl.c: NaN guard requires dim > 0
tq_uniform.c: heap-allocate Q8 query buffer (was 512B stack array)
tq_transformer.c: NULL-check key/value cache calloc results
tq_ops.c: Windows pthread_cond_wait must use SleepConditionVariableSRW not CS (caused test_ops deadlock on first Windows green run)

Tracked for next release (issue #14 follow-ups)

Per-channel outlier handling (Google paper's 32-channel split)
Paper-faithful Llama 3.1 8B + LongBench-E reproduction
5-bit codebook variant for higher quality at ~5 bpc

[0.5.0] — 2026-04-05

Highlights

Gemma 4 26B-A4B MoE — Full support for 128-expert MoE, dual-FFN, hybrid attention, QK-norm, learned RoPE, GeGLU
Llama 3.2 3B — Verified at 17 tok/s with correct code generation
7x KV cache compression — 350K tokens on 16GB Mac (was 50K), PPL +0.0%
quant.h synced — Single header now includes IQ3_XXS, Gemma 4, Llama 3 support
WASM browser demo — 192KB binary, GitHub Pages deployed
MSVC Windows — Visual Studio 2019/2022 compilation support
Metal GPU infrastructure — 7 compute kernels, full-layer pipeline (disabled for batch-1, ready for batch inference)

Architecture

Gemma 4 MoE: 10 architecture bugs fixed (dual-FFN loading, layer_output_scale, attention_scale, V-norm, etc.)
QK-norm aware KV compression: auto FP32 keys for sparse distributions (cosine 0.62 → 1.00)
IQ3_XXS dequantization with 256-entry grid codebook
NEON fused dot for IQ3_XXS, IQ4_NL, Q8_0 (two-accumulator with prefetch)
GeGLU NEON (fast tanh via Schraudolph approximation)
GGUF embedding: skip 2.8GB FP32 allocation, use Q6_K fused dot directly

Performance

Gemma 4 26B: 549ms → 257ms/token (-53%)
SmolLM2 135M: 96 tok/s (CPU NEON Q4×Q8)
Llama 3.2 3B: 17 tok/s
KV compression: uniform_4b + Q4V = 6.9x, delta 3b + Q4V = 8.5x

Platform

MSVC: pthread shim (CRITICAL_SECTION), _Interlocked atomics, QueryPerformanceCounter
WASM: Emscripten build (192KB), drag-and-drop GGUF, streaming output
Metal: RoPE, GELU, softmax, attention_qk, attention_v, kv_cache_write, matmul_tq_q4_fast kernels
GitHub Pages: live demo at quantumaikr.github.io/quant.cpp

Documentation

docs/api.md: Full C API reference (730 lines) — single-header + full library
docs/custom-quantization.md: Step-by-step guide to add new KV quantization types
docs/papers/quant_cpp_tech_report.md: Arxiv tech report draft
ROADMAP.md: Project direction ("LLM의 SQLite" + research platform)
3 embedding examples: minimal (30 lines), chat (60 lines), KV compare

Bug Fixes

Fix Gemma 4 NaN regression (model_type set after hybrid detection)
Fix Llama GQA head_dim misdetection (was 64 instead of 128)
Fix TQ_STATIC_ASSERT no-op in C mode (now C11 _Static_assert)
Fix stack buffer overflow in attention debug (recon[256] → recon[512])
Fix Accelerate macro redefinition warning

Community

llama.cpp Discussion #20969: shared TurboQuant implementation status
vLLM-omni Issue #2215: KV compression RFC contribution
Closed Issues #5 (Q3_K_M), #7 (API docs), #10 (MSVC), #11 (static_assert)

[0.1.0] — 2026-03-29

Highlights

Self-contained inference engine — loads Qwen3.5-0.8B, generates text at 14 tok/s on CPU
17x faster than PyTorch CPU, 1.4x faster than PyTorch on Apple GPU
Q8 weight quantization — 4x memory reduction (2.1 GB → 533 MB), -q flag
Streaming BF16 — embed/lm_head kept as mmap'd BF16, saves ~1 GB
Multi-threaded matmul — 4-thread pthread, 1.56x speedup
DeltaNet + Self-Attention — full Qwen3.5 hybrid architecture in C
HuggingFace BPE tokenizer — 248K vocab, encode/decode
KV cache quantized in inference — Q4 keys, integer Q4×Q8 attention

v0.8 Inference Engine

Integer-domain attention: 2.9-4.8x faster than FP32 on Apple Silicon (ARM NEON vdotq_s32)
Real model validated: Qwen3.5-0.8B KV cache, cosine 0.994 (A+)
8 quantization types including mixed precision outlier and RHT pre-rotation
K/V asymmetric: independent key/value bit allocation (K4V2 = 9.8x compression)
Community validated: r/LocalLLaMA findings integrated

Integer-Domain Attention (v0.7)

The single biggest performance breakthrough: instead of dequantizing Q4 keys to FP32, quantize the query to Q8 and compute integer dot products directly.

Before (v0.6): Q4 key → dequantize → FP32 dot = 0.49x vs FP32 (SLOWER)
After  (v0.7): Q4 key × Q8 query → integer dot = 2.9-4.8x vs FP32 (FASTER)

Fair NEON-vs-NEON benchmark (Apple M-series, median of 7 runs):

dim=128, seq=2048: FP32 22.8μs → Int Q4×Q8 7.8μs (2.9x)
dim=256, seq=2048: FP32 57.7μs → Int Q4×Q8 12.5μs (4.6x)
Larger head_dim benefits more (Q4 data fits in L1 cache)

Core Library

7 quantization types: PolarQuant (3/4b), QJL (1b), quant.cpp (3/4b), Uniform (2/4b)
Direct attention kernels: QJL Hamming distance, PolarQuant cos/sin LUT (no dequantization needed)
Self-contained block formats with ONNX-compliant LSB-first bit packing
O(1) type traits dispatch table (llama.cpp pattern)
Thread-safe API with pthread mutex (TSan verified)
Cross-platform math constants (TQ_PI/TQ_PI_2, no M_PI dependency)

Cache Management

Paged KV cache with block-table mapping (vLLM pattern)
Progressive compression: 3-tier automatic degradation by age, O(1) append
Copy-on-Write for beam search (ref_count based)
Value cache quantization and retrieval

Backends

CPU Generic (reference C11, zero external dependencies)
ARM NEON optimized (5.74x speedup over generic)
x86 AVX2 stubs ready for implementation
CUDA kernels: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)
Metal compute shaders: 7 files (polar, qjl, turbo, fused_cache, value, common, dispatch)

Validation

A/B test: uniform_4b achieves cosine 0.995 vs FP16 — A+ grade, virtually lossless
Real model validation: cosine 0.991 on Qwen3.5-0.5B KV cache patterns (4 layers, 14 heads)
Per-layer analysis: quality consistent across depth (cosine >0.98 for uniform_4b)
Roundtrip MSE: 0.0014 (synthetic), 0.0025 (real model data)

Performance (Apple M-series ARM)

Quantize throughput: 1.4M elements/ms
Attention throughput: 137K queries/sec
Compression ratio: 7.53x (uniform_4b)
SIMD speedup: 4.0x (NEON vs generic)

Testing

13 C++ test suites (Google Test): polar, qjl, turbo, uniform, value, paged_cache, progressive, simd_neon, simd_avx2, threading, edge_cases, attention_all_types, llamacpp_integration
22 Python tests (unittest): bindings, roundtrip, attention, types
Total: 35 tests, 100% pass rate
Sanitizers: ASan + UBSan + TSan clean

Integration

llama.cpp: GGML type registration (7 types, base offset 256), CLI parser with 21 aliases, from_float/to_float/vec_dot wrappers, 10 integration tests
Python: ctypes bindings with NumPy support, pip installable (pip install -e .), quant.cpp class with quantize_keys/dequantize_keys/attention methods
vLLM: integration scaffold with README guide
Examples: minimal.c (10 lines), standalone.c, ab_test.c, demo_real_model.c, benchmark_types.cpp, python_quickstart.py, llamacpp_integration.cpp

Production Readiness (v0.4)

Integer overflow protection in size calculations
NULL pointer and buffer size validation on all public APIs
Edge case defense: seq_len=0, head_dim<2, odd dimensions
TQ_ERR_BUFFER_TOO_SMALL error code
tq_type_from_name() / tq_type_count() convenience functions
BPE values computed from actual struct sizes

Developer Experience

5-dimension scoring harness: structure/correctness/quality/performance/integration
Hierarchical Harness methodology (Karpathy AutoResearch + ClawTeam multi-agent)
Agent definitions (.claude/agents/): architect, core-dev, perf-dev, qa
Skill definitions (.claude/skills/): orchestrate, develop, score, qa
Slash commands (.claude/commands/): /score, /develop, /harness, /spawn-team, /merge-gate
PRD documents: v0.1 through v0.4
WBS documents: v0.1 through v0.4
refs/ absorption audit with checklist

Memory Impact

Model	Context	FP16 Cache	quant.cpp	Saved
Llama-3.2-3B	64K	7.00 GB	0.93 GB	87%
Qwen3.5-0.5B	128K	10.50 GB	2.79 GB	73%
Phi-3-mini	16K	6.00 GB	1.59 GB	73%

References

quant.cpp (ICLR 2026) — arXiv:2504.19874
QJL (AAAI 2025) — arXiv:2406.03482
PolarQuant (AISTATS 2026) — arXiv:2502.02617
Harness plugin (revfactory/harness) — agent team methodology

FilesExpand file tree

CHANGELOG.md

Latest commit

History

CHANGELOG.md

File metadata and controls

Changelog

[0.8.2] — 2026-04-09 (quant_free_string + leak fix)

Eliminated the v0.8.1 leak in Model.ask()

Honest correction track is now 7 (still all self-found)

Still pending in v0.8.x

[0.8.1] — 2026-04-09 (Python bindings hotfix)

pip install quantcpp is now actually usable

Bug 1: Model("file.gguf").ask(...) aborted on macOS arm64

Bug 2: quant_ask return string crashed libc.free(ptr)

Honest correction track record

v0.8.0 status

[0.8.0] — 2026-04-09

Cross-platform SIMD: AVX2 port of turbo_kv attention

Investigation: Issue #16 Metal dispatch overhead

llama.cpp PR validation: KL divergence tool

Explored and reverted

Deferred

[0.7.1] — 2026-04-08

Round 11 — NEON tbl pattern applied to 3b/5b (partial parity)

Why 4b reached parity but 3b/5b didn't

Insight: matmul was already using the same pattern

What's NOT in v0.7.1

What changed in v0.7.1

[0.7.0] — 2026-04-08

🏆 Round 10 — turbo_kv_4b matches fp32 KV speed at 7.1× compression

What it took: profile-driven Round 10

Cross-model verification

Honest framing change

What didn't change

What changed

[0.6.5] — 2026-04-08

🚨 Re-baseline: all benchmarks now CPU-only (Metal is slower)

Re-baselined numbers (Llama 3.2 3B Instruct, FP32 baseline = 13.56 PPL)

Cross-model Metal investigation

What changed in v0.6.5

Honest corrections so far in the v0.6.x series

[0.6.4] — 2026-04-08

Honest validation pass

What changed in this release

Final honest numbers (3 runs each, Llama 3.2 3B PPL eval)

What we learned

[0.6.3] — 2026-04-08

Karpathy round 5+6: closes turbo_kv speed gap from −45% to −8%

What changed (Round 5: the real bottleneck)

Karpathy loop (this release)

Other changes

[0.6.2] — 2026-04-08

Highlights

KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)

Notes on the outlier types

[0.6.1] — 2026-04-08

Highlights

KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)

Tests

[0.6.0] — 2026-04-08

Highlights

KV quantization

Strategy & positioning

Tooling & ecosystem

Bug fixes (cherry-picked from PR #12)

Tracked for next release (issue #14 follow-ups)

[0.5.0] — 2026-04-05

Highlights

Architecture

Performance

Platform

Documentation

Bug Fixes

Community

[0.1.0] — 2026-03-29

Highlights

v0.8 Inference Engine

Integer-Domain Attention (v0.7)

Core Library

Eliminated the v0.8.1 leak in `Model.ask()`

`pip install quantcpp` is now actually usable

Bug 1: `Model("file.gguf").ask(...)` aborted on macOS arm64

Bug 2: `quant_ask` return string crashed `libc.free(ptr)`

🏆 Round 10 — `turbo_kv_4b` matches fp32 KV speed at 7.1× compression