Skip to content

Commit 475872c

Browse files
unamedkrclaude
andcommitted
v0.6.1: regression tests + turbo_kv_5b in CHANGELOG/ROADMAP
Three new deterministic regression tests in test_turbo_kv.cpp using synthetic Gaussian-with-outliers key/query vectors: TurboKVRegression.KV_4B_AttentionCosine pins cos ≥ 0.99 TurboKVRegression.KV_5B_AttentionCosine pins cos ≥ 0.999 TurboKVRegression.KV_5B_BeatsKV_4B invariant: more bits ≥ accuracy These tests are deterministic (don't need a model file), run in < 1s, and catch any future Karpathy-loop iteration that would regress past the Variant F quality thresholds. The synthetic data generator (synth_keys) injects ~3% outliers at ±5x scale to mimic real transformer KV statistics. Also documents turbo_kv_5b in CHANGELOG.md and ROADMAP.md as a v0.6.1 patch release on top of v0.6.0. 35/35 tests pass. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 87e14cb commit 475872c

File tree

3 files changed

+186
-5
lines changed

3 files changed

+186
-5
lines changed

CHANGELOG.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,28 @@
11
# Changelog
22

3+
## [0.6.1] — 2026-04-08
4+
5+
### Highlights
6+
7+
- **🆕 `turbo_kv_5b` — near-lossless KV** at +0.34% PPL on Llama 3.2 3B. Uses a 32-level Lloyd-Max-Gaussian codebook (Max 1960 Table I) on RHT-rotated values. 88-byte block (vs 72 for 4b). The new quality-maximizing option for users who can spare 22% more KV memory than 4b.
8+
- **Regression tests** — three deterministic synthetic-data tests pin the attention cosine quality of `turbo_kv_4b` (≥0.99) and `turbo_kv_5b` (≥0.999), and assert 5b ≥ 4b on the same data. Future Karpathy-loop iterations cannot regress past these thresholds without failing CI.
9+
10+
### KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)
11+
12+
| Type | Bytes/block | Compression | PPL | Δ vs FP32 |
13+
|---|---:|---:|---:|---:|
14+
| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
15+
| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
16+
| **`turbo_kv_5b`** 🏆 | 88 | 5.8× | **13.60** | **+0.34%** |
17+
18+
### Tests
19+
20+
- `TurboKVRegression.KV_4B_AttentionCosine` — pins ≥ 0.99
21+
- `TurboKVRegression.KV_5B_AttentionCosine` — pins ≥ 0.999
22+
- `TurboKVRegression.KV_5B_BeatsKV_4B` — invariant: more bits ⇒ ≥ accuracy
23+
24+
35/35 tests pass on macOS / Linux / Windows.
25+
326
## [0.6.0] — 2026-04-08
427

528
### Highlights

ROADMAP.md

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -49,14 +49,16 @@ The world's simplest way to add LLM to a C/C++ project.
4949
A C reference engine for KV cache quantization research.
5050

5151
### Production-ready
52-
- [x] **`turbo_kv_4b`** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
52+
- [x] **`turbo_kv_5b` 🏆** — RHT + 5-bit (32-level) Lloyd-Max codebook, near-lossless (Llama 3.2 3B PPL 13.60, +0.34% vs FP32). Quality-maximizing option.
53+
- [x] **`turbo_kv_4b` ⭐ default** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
5354
- [x] **`turbo_kv_3b`** — RHT + 3-bit Lloyd-Max codebook (PPL 15.39, +13.5%)
5455
- [x] `uniform_4b` KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
5556
- [x] `uniform_4b` + Q4 V combo (6.9x KV memory reduction)
5657
- [x] Delta compression (P-frame encoding)
5758
- [x] QK-norm aware compression (Gemma 4 / hybrid attention models)
5859
- [x] Plugin architecture (3 functions to add new type)
59-
- [x] 35 unit tests
60+
- [x] Regression tests pinning `turbo_kv_4b/5b` quality
61+
- [x] 35 unit tests across macOS / Linux / Windows
6062

6163
### Building blocks
6264
- [x] Random Hadamard Transform (`tq_rht.c`)
@@ -67,9 +69,10 @@ A C reference engine for KV cache quantization research.
6769
### TurboQuant paper reproduction (issue #14, partially resolved)
6870
- [x] Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
6971
- [x] Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
70-
- [ ] Per-channel outlier handling (Google paper's 32-channel split)
71-
- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction
72-
- [ ] 5-bit codebook variant for ~5 bpc quality budget
72+
- [x] 5-bit codebook variant for ~5 bpc quality budget (commit 87e14cb)
73+
- [x] Regression tests pinning quality (commit on this release)
74+
- [ ] Per-channel outlier handling (Google paper's 32-channel split) — issue #15
75+
- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction — issue #15
7376

7477
### Planned (after Direction 2 reproduction)
7578
- [ ] "Add Your Own Type" tutorial polish (docs/custom-quantization.md)

tests/test_turbo_kv.cpp

Lines changed: 155 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -471,3 +471,158 @@ TEST(TurboKV, ZeroInput) {
471471
EXPECT_NEAR(output_4b[i], 0.0f, 1e-4f);
472472
}
473473
}
474+
475+
/* ============================================================
476+
* Regression tests: Variant F quality must not regress.
477+
*
478+
* These tests synthesize attention scores from realistic key/query
479+
* vectors and assert that quantized scores are highly correlated with
480+
* FP32 reference scores. The thresholds are calibrated to the
481+
* Variant F implementation that achieves Llama 3.2 3B PPL 14.28 (4b)
482+
* and 13.60 (5b). Any regression that drops below these thresholds
483+
* will fail CI before it reaches users.
484+
*
485+
* Update history:
486+
* 2026-04-08 Initial calibration after Variant F shipped.
487+
* ============================================================ */
488+
489+
extern "C" {
490+
void tq_turbo_kv_5b_quantize_ref(const float* src, void* dst, int n);
491+
void tq_turbo_kv_5b_attention_ref(const float* query, const void* kv,
492+
float* scores, int seq_len, int head_dim);
493+
}
494+
495+
namespace {
496+
497+
/* Generate realistic key/query vectors: per-coordinate Gaussian + a few
498+
* scaled outliers, mimicking real transformer KV statistics. */
499+
static void synth_keys(std::vector<std::vector<float>>& keys, int n_keys,
500+
int dim, uint32_t seed) {
501+
std::mt19937 rng(seed);
502+
std::normal_distribution<float> nd(0.0f, 1.0f);
503+
keys.resize(n_keys);
504+
for (int k = 0; k < n_keys; k++) {
505+
keys[k].resize(dim);
506+
for (int i = 0; i < dim; i++) keys[k][i] = nd(rng) * 0.1f;
507+
/* Inject a few outliers (~3% of dims, ±5x scale) */
508+
for (int o = 0; o < dim / 32; o++) {
509+
int idx = rng() % dim;
510+
keys[k][idx] *= ((rng() & 1) ? 5.0f : -5.0f);
511+
}
512+
}
513+
}
514+
515+
static void synth_query(std::vector<float>& q, int dim, uint32_t seed) {
516+
std::mt19937 rng(seed);
517+
std::normal_distribution<float> nd(0.0f, 1.0f);
518+
q.resize(dim);
519+
for (int i = 0; i < dim; i++) q[i] = nd(rng) * 0.1f;
520+
}
521+
522+
/* Compute FP32 reference attention scores: scores[s] = <q, keys[s]> */
523+
static void fp32_attention(const std::vector<float>& q,
524+
const std::vector<std::vector<float>>& keys,
525+
std::vector<float>& scores) {
526+
int n = (int)keys.size();
527+
int dim = (int)q.size();
528+
scores.resize(n);
529+
for (int s = 0; s < n; s++) {
530+
float dot = 0.0f;
531+
for (int d = 0; d < dim; d++) dot += q[d] * keys[s][d];
532+
scores[s] = dot;
533+
}
534+
}
535+
536+
} // namespace
537+
538+
TEST(TurboKVRegression, KV_4B_AttentionCosine) {
539+
const int dim = TQ_BK; // 128
540+
const int n_keys = 256; // realistic context length
541+
542+
std::vector<std::vector<float>> keys;
543+
synth_keys(keys, n_keys, dim, /*seed=*/0xC0FFEE);
544+
std::vector<float> q;
545+
synth_query(q, dim, /*seed=*/0xBADC0DE);
546+
547+
/* FP32 reference */
548+
std::vector<float> ref_scores;
549+
fp32_attention(q, keys, ref_scores);
550+
551+
/* Quantize keys with turbo_kv_4b */
552+
std::vector<block_tq_turbo_kv_4b> blocks(n_keys);
553+
for (int s = 0; s < n_keys; s++) {
554+
memset(&blocks[s], 0, sizeof(blocks[s]));
555+
tq_turbo_kv_4b_quantize_ref(keys[s].data(), &blocks[s], dim);
556+
}
557+
558+
/* Compute estimated attention scores */
559+
std::vector<float> est_scores(n_keys);
560+
tq_turbo_kv_4b_attention_ref(q.data(), blocks.data(), est_scores.data(),
561+
n_keys, dim);
562+
563+
double cos = compute_cosine(ref_scores.data(), est_scores.data(), n_keys);
564+
/* Variant F achieves cos > 0.999 on this synthetic distribution.
565+
* Calibrated threshold: 0.99 to allow noise but catch any major regression. */
566+
EXPECT_GT(cos, 0.99) << "turbo_kv_4b attention cosine regressed below 0.99";
567+
}
568+
569+
TEST(TurboKVRegression, KV_5B_AttentionCosine) {
570+
const int dim = TQ_BK;
571+
const int n_keys = 256;
572+
573+
std::vector<std::vector<float>> keys;
574+
synth_keys(keys, n_keys, dim, /*seed=*/0xC0FFEE);
575+
std::vector<float> q;
576+
synth_query(q, dim, /*seed=*/0xBADC0DE);
577+
578+
std::vector<float> ref_scores;
579+
fp32_attention(q, keys, ref_scores);
580+
581+
std::vector<block_tq_turbo_kv_5b> blocks(n_keys);
582+
for (int s = 0; s < n_keys; s++) {
583+
memset(&blocks[s], 0, sizeof(blocks[s]));
584+
tq_turbo_kv_5b_quantize_ref(keys[s].data(), &blocks[s], dim);
585+
}
586+
587+
std::vector<float> est_scores(n_keys);
588+
tq_turbo_kv_5b_attention_ref(q.data(), blocks.data(), est_scores.data(),
589+
n_keys, dim);
590+
591+
double cos = compute_cosine(ref_scores.data(), est_scores.data(), n_keys);
592+
/* 5-bit is near-lossless: must beat 4-bit threshold by a wide margin. */
593+
EXPECT_GT(cos, 0.999) << "turbo_kv_5b attention cosine regressed below 0.999";
594+
}
595+
596+
TEST(TurboKVRegression, KV_5B_BeatsKV_4B) {
597+
/* Strict invariant: 5-bit must always be at least as accurate as 4-bit
598+
* on the same data, otherwise something is structurally wrong. */
599+
const int dim = TQ_BK;
600+
const int n_keys = 256;
601+
602+
std::vector<std::vector<float>> keys;
603+
synth_keys(keys, n_keys, dim, /*seed=*/42);
604+
std::vector<float> q;
605+
synth_query(q, dim, /*seed=*/137);
606+
607+
std::vector<float> ref;
608+
fp32_attention(q, keys, ref);
609+
610+
std::vector<block_tq_turbo_kv_4b> b4b(n_keys);
611+
std::vector<block_tq_turbo_kv_5b> b5b(n_keys);
612+
for (int s = 0; s < n_keys; s++) {
613+
memset(&b4b[s], 0, sizeof(b4b[s]));
614+
memset(&b5b[s], 0, sizeof(b5b[s]));
615+
tq_turbo_kv_4b_quantize_ref(keys[s].data(), &b4b[s], dim);
616+
tq_turbo_kv_5b_quantize_ref(keys[s].data(), &b5b[s], dim);
617+
}
618+
std::vector<float> sc4b(n_keys), sc5b(n_keys);
619+
tq_turbo_kv_4b_attention_ref(q.data(), b4b.data(), sc4b.data(), n_keys, dim);
620+
tq_turbo_kv_5b_attention_ref(q.data(), b5b.data(), sc5b.data(), n_keys, dim);
621+
622+
double cos4 = compute_cosine(ref.data(), sc4b.data(), n_keys);
623+
double cos5 = compute_cosine(ref.data(), sc5b.data(), n_keys);
624+
625+
EXPECT_GE(cos5, cos4)
626+
<< "5-bit must be at least as accurate as 4-bit (5b=" << cos5
627+
<< ", 4b=" << cos4 << ")";
628+
}

0 commit comments

Comments
 (0)