v0.6.1: regression tests + turbo_kv_5b in CHANGELOG/ROADMAP

unamedkr · claude · unamedkr · commit 475872ca891f · 2026-04-08T07:33:02.000+09:00
Three new deterministic regression tests in test_turbo_kv.cpp using
synthetic Gaussian-with-outliers key/query vectors:

  TurboKVRegression.KV_4B_AttentionCosine  pins cos ≥ 0.99
  TurboKVRegression.KV_5B_AttentionCosine  pins cos ≥ 0.999
  TurboKVRegression.KV_5B_BeatsKV_4B       invariant: more bits ≥ accuracy

These tests are deterministic (don't need a model file), run in &lt; 1s,
and catch any future Karpathy-loop iteration that would regress past
the Variant F quality thresholds. The synthetic data generator
(synth_keys) injects ~3% outliers at ±5x scale to mimic real
transformer KV statistics.

Also documents turbo_kv_5b in CHANGELOG.md and ROADMAP.md as a v0.6.1
patch release on top of v0.6.0.

35/35 tests pass.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,28 @@
 # Changelog
 
+## [0.6.1] — 2026-04-08
+
+### Highlights
+
+- **🆕 `turbo_kv_5b` — near-lossless KV** at +0.34% PPL on Llama 3.2 3B. Uses a 32-level Lloyd-Max-Gaussian codebook (Max 1960 Table I) on RHT-rotated values. 88-byte block (vs 72 for 4b). The new quality-maximizing option for users who can spare 22% more KV memory than 4b.
+- **Regression tests** — three deterministic synthetic-data tests pin the attention cosine quality of `turbo_kv_4b` (≥0.99) and `turbo_kv_5b` (≥0.999), and assert 5b ≥ 4b on the same data. Future Karpathy-loop iterations cannot regress past these thresholds without failing CI.
+
+### KV quantization quality (Llama 3.2 3B, FP32 = 13.56 PPL)
+
+| Type | Bytes/block | Compression | PPL | Δ vs FP32 |
+|---|---:|---:|---:|---:|
+| `turbo_kv_3b` | 56 | 9.1× | 15.39 | +13.5% |
+| `turbo_kv_4b` ⭐ default | 72 | 7.1× | 14.28 | +5.3% |
+| **`turbo_kv_5b`** 🏆 | 88 | 5.8× | **13.60** | **+0.34%** |
+
+### Tests
+
+- `TurboKVRegression.KV_4B_AttentionCosine` — pins ≥ 0.99
+- `TurboKVRegression.KV_5B_AttentionCosine` — pins ≥ 0.999
+- `TurboKVRegression.KV_5B_BeatsKV_4B` — invariant: more bits ⇒ ≥ accuracy
+
+35/35 tests pass on macOS / Linux / Windows.
+
 ## [0.6.0] — 2026-04-08
 
 ### Highlights
diff --git a/ROADMAP.md b/ROADMAP.md
@@ -49,14 +49,16 @@ The world's simplest way to add LLM to a C/C++ project.
 A C reference engine for KV cache quantization research.
 
 ### Production-ready
-- [x] **`turbo_kv_4b` ⭐** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
+- [x] **`turbo_kv_5b` 🏆** — RHT + 5-bit (32-level) Lloyd-Max codebook, near-lossless (Llama 3.2 3B PPL 13.60, +0.34% vs FP32). Quality-maximizing option.
+- [x] **`turbo_kv_4b` ⭐ default** — RHT + 4-bit Lloyd-Max codebook, beats `uniform_4b` and llama.cpp `q4_0` KV at the same bit budget (Llama 3.2 3B PPL 14.28, +5.3% vs FP32)
 - [x] **`turbo_kv_3b`** — RHT + 3-bit Lloyd-Max codebook (PPL 15.39, +13.5%)
 - [x] `uniform_4b` KV quantization (4–7x compression, +6.3% PPL on Llama 3.2 3B)
 - [x] `uniform_4b` + Q4 V combo (6.9x KV memory reduction)
 - [x] Delta compression (P-frame encoding)
 - [x] QK-norm aware compression (Gemma 4 / hybrid attention models)
 - [x] Plugin architecture (3 functions to add new type)
-- [x] 35 unit tests
+- [x] Regression tests pinning `turbo_kv_4b/5b` quality
+- [x] 35 unit tests across macOS / Linux / Windows
 
 ### Building blocks
 - [x] Random Hadamard Transform (`tq_rht.c`)
@@ -67,9 +69,10 @@ A C reference engine for KV cache quantization research.
 ### TurboQuant paper reproduction (issue #14, partially resolved)
 - [x] Identify the gap in literal port (commit 4da6915 — QJL contributes byte-identical zero)
 - [x] Variant F: drop QJL stage, double codebook size (commit ac3c46a — beats baseline)
-- [ ] Per-channel outlier handling (Google paper's 32-channel split)
-- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction
-- [ ] 5-bit codebook variant for ~5 bpc quality budget
+- [x] 5-bit codebook variant for ~5 bpc quality budget (commit 87e14cb)
+- [x] Regression tests pinning quality (commit on this release)
+- [ ] Per-channel outlier handling (Google paper's 32-channel split) — issue #15
+- [ ] Paper-faithful Llama 3.1 8B + LongBench-E reproduction — issue #15
 
 ### Planned (after Direction 2 reproduction)
 - [ ] "Add Your Own Type" tutorial polish (docs/custom-quantization.md)
diff --git a/tests/test_turbo_kv.cpp b/tests/test_turbo_kv.cpp
@@ -471,3 +471,158 @@ TEST(TurboKV, ZeroInput) {
         EXPECT_NEAR(output_4b[i], 0.0f, 1e-4f);
     }
 }
+
+/* ============================================================
+ * Regression tests: Variant F quality must not regress.
+ *
+ * These tests synthesize attention scores from realistic key/query
+ * vectors and assert that quantized scores are highly correlated with
+ * FP32 reference scores. The thresholds are calibrated to the
+ * Variant F implementation that achieves Llama 3.2 3B PPL 14.28 (4b)
+ * and 13.60 (5b). Any regression that drops below these thresholds
+ * will fail CI before it reaches users.
+ *
+ * Update history:
+ *   2026-04-08  Initial calibration after Variant F shipped.
+ * ============================================================ */
+
+extern "C" {
+void tq_turbo_kv_5b_quantize_ref(const float* src, void* dst, int n);
+void tq_turbo_kv_5b_attention_ref(const float* query, const void* kv,
+                                    float* scores, int seq_len, int head_dim);
+}
+
+namespace {
+
+/* Generate realistic key/query vectors: per-coordinate Gaussian + a few
+ * scaled outliers, mimicking real transformer KV statistics. */
+static void synth_keys(std::vector<std::vector<float>>& keys, int n_keys,
+                       int dim, uint32_t seed) {
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> nd(0.0f, 1.0f);
+    keys.resize(n_keys);
+    for (int k = 0; k < n_keys; k++) {
+        keys[k].resize(dim);
+        for (int i = 0; i < dim; i++) keys[k][i] = nd(rng) * 0.1f;
+        /* Inject a few outliers (~3% of dims, ±5x scale) */
+        for (int o = 0; o < dim / 32; o++) {
+            int idx = rng() % dim;
+            keys[k][idx] *= ((rng() & 1) ? 5.0f : -5.0f);
+        }
+    }
+}
+
+static void synth_query(std::vector<float>& q, int dim, uint32_t seed) {
+    std::mt19937 rng(seed);
+    std::normal_distribution<float> nd(0.0f, 1.0f);
+    q.resize(dim);
+    for (int i = 0; i < dim; i++) q[i] = nd(rng) * 0.1f;
+}
+
+/* Compute FP32 reference attention scores: scores[s] = <q, keys[s]> */
+static void fp32_attention(const std::vector<float>& q,
+                            const std::vector<std::vector<float>>& keys,
+                            std::vector<float>& scores) {
+    int n = (int)keys.size();
+    int dim = (int)q.size();
+    scores.resize(n);
+    for (int s = 0; s < n; s++) {
+        float dot = 0.0f;
+        for (int d = 0; d < dim; d++) dot += q[d] * keys[s][d];
+        scores[s] = dot;
+    }
+}
+
+} // namespace
+
+TEST(TurboKVRegression, KV_4B_AttentionCosine) {
+    const int dim = TQ_BK;       // 128
+    const int n_keys = 256;       // realistic context length
+
+    std::vector<std::vector<float>> keys;
+    synth_keys(keys, n_keys, dim, /*seed=*/0xC0FFEE);
+    std::vector<float> q;
+    synth_query(q, dim, /*seed=*/0xBADC0DE);
+
+    /* FP32 reference */
+    std::vector<float> ref_scores;
+    fp32_attention(q, keys, ref_scores);
+
+    /* Quantize keys with turbo_kv_4b */
+    std::vector<block_tq_turbo_kv_4b> blocks(n_keys);
+    for (int s = 0; s < n_keys; s++) {
+        memset(&blocks[s], 0, sizeof(blocks[s]));
+        tq_turbo_kv_4b_quantize_ref(keys[s].data(), &blocks[s], dim);
+    }
+
+    /* Compute estimated attention scores */
+    std::vector<float> est_scores(n_keys);
+    tq_turbo_kv_4b_attention_ref(q.data(), blocks.data(), est_scores.data(),
+                                  n_keys, dim);
+
+    double cos = compute_cosine(ref_scores.data(), est_scores.data(), n_keys);
+    /* Variant F achieves cos > 0.999 on this synthetic distribution.
+     * Calibrated threshold: 0.99 to allow noise but catch any major regression. */
+    EXPECT_GT(cos, 0.99) << "turbo_kv_4b attention cosine regressed below 0.99";
+}
+
+TEST(TurboKVRegression, KV_5B_AttentionCosine) {
+    const int dim = TQ_BK;
+    const int n_keys = 256;
+
+    std::vector<std::vector<float>> keys;
+    synth_keys(keys, n_keys, dim, /*seed=*/0xC0FFEE);
+    std::vector<float> q;
+    synth_query(q, dim, /*seed=*/0xBADC0DE);
+
+    std::vector<float> ref_scores;
+    fp32_attention(q, keys, ref_scores);
+
+    std::vector<block_tq_turbo_kv_5b> blocks(n_keys);
+    for (int s = 0; s < n_keys; s++) {
+        memset(&blocks[s], 0, sizeof(blocks[s]));
+        tq_turbo_kv_5b_quantize_ref(keys[s].data(), &blocks[s], dim);
+    }
+
+    std::vector<float> est_scores(n_keys);
+    tq_turbo_kv_5b_attention_ref(q.data(), blocks.data(), est_scores.data(),
+                                  n_keys, dim);
+
+    double cos = compute_cosine(ref_scores.data(), est_scores.data(), n_keys);
+    /* 5-bit is near-lossless: must beat 4-bit threshold by a wide margin. */
+    EXPECT_GT(cos, 0.999) << "turbo_kv_5b attention cosine regressed below 0.999";
+}
+
+TEST(TurboKVRegression, KV_5B_BeatsKV_4B) {
+    /* Strict invariant: 5-bit must always be at least as accurate as 4-bit
+     * on the same data, otherwise something is structurally wrong. */
+    const int dim = TQ_BK;
+    const int n_keys = 256;
+
+    std::vector<std::vector<float>> keys;
+    synth_keys(keys, n_keys, dim, /*seed=*/42);
+    std::vector<float> q;
+    synth_query(q, dim, /*seed=*/137);
+
+    std::vector<float> ref;
+    fp32_attention(q, keys, ref);
+
+    std::vector<block_tq_turbo_kv_4b> b4b(n_keys);
+    std::vector<block_tq_turbo_kv_5b> b5b(n_keys);
+    for (int s = 0; s < n_keys; s++) {
+        memset(&b4b[s], 0, sizeof(b4b[s]));
+        memset(&b5b[s], 0, sizeof(b5b[s]));
+        tq_turbo_kv_4b_quantize_ref(keys[s].data(), &b4b[s], dim);
+        tq_turbo_kv_5b_quantize_ref(keys[s].data(), &b5b[s], dim);
+    }
+    std::vector<float> sc4b(n_keys), sc5b(n_keys);
+    tq_turbo_kv_4b_attention_ref(q.data(), b4b.data(), sc4b.data(), n_keys, dim);
+    tq_turbo_kv_5b_attention_ref(q.data(), b5b.data(), sc5b.data(), n_keys, dim);
+
+    double cos4 = compute_cosine(ref.data(), sc4b.data(), n_keys);
+    double cos5 = compute_cosine(ref.data(), sc5b.data(), n_keys);
+
+    EXPECT_GE(cos5, cos4)
+        << "5-bit must be at least as accurate as 4-bit (5b=" << cos5
+        << ", 4b=" << cos4 << ")";
+}