quantcpp 0.9.0: KV compression ON by default in Python bindings

unamedkr · claude · unamedkr · commit 72ae6b2034ca · 2026-04-09T17:00:06.000+09:00
BREAKTHROUGH: kv_compress=1 was never broken in quant.h — the v0.8.1 abort was caused by the libc.free() cross-heap bug (fixed in v0.8.2 via quant_free_string), not by the UNIFORM_4B KV path. We isolated the wrong variable because kv_compress=0 AND skip-free were changed simultaneously in the v0.8.1 hotfix. Verified in standalone C AND Python ctypes: kv_compress=1 (UNIFORM_4B) works cleanly on SmolLM2-135M with quant_free_string. This is honest correction #8: "we disabled a working feature because of incorrect root cause analysis." Changes: - kv_compress default restored to 1 (was 0 since v0.8.1) - kv_compress warning/fallback guard removed - Version bumped to 0.9.0 (major: KV compression is now the default experience for all pip users) The headline value proposition now flows through both distribution channels identically: CLI: quant model.gguf -k turbo_kv_4b → 7x KV compression Python: Model("model.gguf") → 4-bit KV compression Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/bindings/python/pyproject.toml b/bindings/python/pyproject.toml
@@ -7,7 +7,7 @@ build-backend = "setuptools.build_meta"
 
 [project]
 name = "quantcpp"
-version = "0.8.3"
+version = "0.9.0"
 description = "Single-header LLM inference engine with KV cache compression (7× compression at fp32 parity)"
 readme = "README.md"
 license = { text = "Apache-2.0" }
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -19,7 +19,7 @@
     from importlib.metadata import version as _pkg_version
     __version__ = _pkg_version("quantcpp")
 except Exception:
-    __version__ = "0.8.3"  # fallback for editable / source-tree imports
+    __version__ = "0.9.0"  # fallback for editable / source-tree imports
 
 import os
 import sys
@@ -181,28 +181,8 @@ def __init__(
         top_p: float = 0.9,
         max_tokens: int = 256,
         n_threads: int = 4,
-        kv_compress: int = 0,
+        kv_compress: int = 1,
     ):
-        """
-        .. note::
-           ``kv_compress=1`` and ``kv_compress=2`` are temporarily disabled in
-           the Python bindings (v0.8.x) — the bundled ``quant.h`` single
-           header carries an older KV compression path that aborts on Llama
-           architectures. The CLI ``quant`` binary uses the multi-file engine
-           and works with all KV types. KV compression will be re-enabled in
-           the bindings once ``quant.h`` is re-generated against the v0.8.0+
-           tree (tracked as v0.8.1: WASM SIMD / un-stub turbo_kv).
-        """
-        if kv_compress not in (0,):
-            import warnings
-            warnings.warn(
-                "kv_compress != 0 is not supported in the Python bindings of "
-                "quantcpp 0.8.x — falling back to kv_compress=0. Use the CLI "
-                "binary for KV compression until v0.8.2.",
-                RuntimeWarning,
-                stacklevel=2,
-            )
-            kv_compress = 0
         if not os.path.isfile(path):
             raise FileNotFoundError(f"Model file not found: {path}")