docs: v0.8.1 CHANGELOG retrospective + README pip install / PyPI badges

unamedkr · claude · unamedkr · commit 1bc4611605de · 2026-04-09T15:20:55.000+09:00
CHANGELOG: lead with 0.8.1 hotfix entry that names the two bugs (kv_compress=1 default abort + cross-heap libc.free abort), explains how they were caught (end-user simulation in clean venv), and ties them into the project's honest-correction track record (#5 and #6). README: surface `pip install quantcpp` at the top of the page, with PyPI version + python-versions badges replacing the stale v0.5.0 release badge. Quick-start code example uses Model().ask() and the streaming generate(). A NOTE flags the temporary kv_compress=0 default in the bindings and points readers at the CHANGELOG for context. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,35 @@
 # Changelog
 
+## [0.8.1] — 2026-04-09 (Python bindings hotfix)
+
+### `pip install quantcpp` is now actually usable
+
+Two critical bugs were found in the v0.8.0 Python bindings within hours of publishing — by running an end-user simulation (`pip install` in a clean venv → `Model("file.gguf").ask("question")`). Both bugs were live for v0.8.0; v0.8.1 fixes them.
+
+#### Bug 1: `Model("file.gguf").ask(...)` aborted on macOS arm64
+
+Root cause: the Python wrapper defaulted to `kv_compress=1`, which routed through the bundled `quant.h`'s UNIFORM_4B KV path. The single-header is an Apr-6 snapshot that pre-dates the v0.8.0 multi-file source by several days, and that older KV path aborts on Llama-architecture models.
+
+Fix: default `kv_compress=0` (no KV compression) in v0.8.1. Non-zero values warn and fall back. The CLI `quant` binary, which uses the multi-file engine, continues to work with all KV types.
+
+A real fix waits on a fresh `quant.h` regen against the v0.8.0+ tree (tracked as v0.8.2).
+
+#### Bug 2: `quant_ask` return string crashed `libc.free(ptr)`
+
+Root cause: `quant_ask` allocates the response string inside `libquant.dylib`'s malloc heap. The Python wrapper called `ctypes.CDLL(None).free(ptr)` to release it — but on macOS arm64, that handle resolves to a different malloc zone than the dylib's. Cross-zone free → abort.
+
+Fix: skip the explicit free in v0.8.1. We accept a ~65 KB leak per `ask()` call as a temporary tradeoff; `quant_free_ctx` / `quant_free_model` release the bulk of the memory at end of session. Tracked: add `quant_free_string(void*)` wrapper to `quant.h` in v0.8.2.
+
+### Honest correction track record
+
+This is corrections #5 and #6 in the project history (after the four in v0.6.x → v0.7.x). Both were caught by the project's own end-user-simulation testing, before any external user reported them. The pattern stands: **publish, simulate the user, fix in hours.**
+
+### v0.8.0 status
+
+PyPI 0.8.0 should be yanked (we strongly recommend upgrading to 0.8.1). Yanking only hides it from new `pip install` — anyone with a pinned `==0.8.0` install can still use it.
+
+---
+
 ## [0.8.0] — 2026-04-09
 
 ### Cross-platform SIMD: AVX2 port of turbo_kv attention
diff --git a/README.md b/README.md
@@ -12,18 +12,42 @@
 </p>
 
 <p align="center">
-  <a href="https://github.com/quantumaikr/quant.cpp/releases/tag/v0.5.0"><img src="https://img.shields.io/badge/release-v0.5.0-blue" alt="Release"></a>
+  <a href="https://pypi.org/project/quantcpp/"><img src="https://img.shields.io/pypi/v/quantcpp.svg?label=PyPI&color=blue" alt="PyPI"></a>
+  <a href="https://pypi.org/project/quantcpp/"><img src="https://img.shields.io/pypi/pyversions/quantcpp.svg" alt="Python versions"></a>
+  <a href="https://github.com/quantumaikr/quant.cpp/releases/latest"><img src="https://img.shields.io/github/v/release/quantumaikr/quant.cpp?label=release" alt="Release"></a>
   <a href="#"><img src="https://img.shields.io/badge/license-Apache%202.0-blue" alt="License"></a>
-  <a href="#"><img src="https://img.shields.io/badge/tests-34%20pass-brightgreen" alt="Tests"></a>
-  <a href="#"><img src="https://img.shields.io/badge/score-99.2%25-brightgreen" alt="Score"></a>
+  <a href="#"><img src="https://img.shields.io/badge/tests-35%20pass-brightgreen" alt="Tests"></a>
   <br>
   <a href="#"><img src="https://img.shields.io/badge/models-7%20verified-blue" alt="Models"></a>
   <a href="https://quantumaikr.github.io/quant.cpp/"><img src="https://img.shields.io/badge/WASM_demo-192KB-purple" alt="WASM"></a>
-  <a href="#"><img src="https://img.shields.io/badge/platforms-macOS%20%7C%20Linux%20%7C%20Windows%20%7C%20WASM-orange" alt="Platforms"></a>
+  <a href="#"><img src="https://img.shields.io/badge/platforms-macOS%20%7C%20Linux%20%7C%20WASM-orange" alt="Platforms"></a>
 </p>
 
 ---
 
+## Install
+
+```bash
+pip install quantcpp
+```
+
+```python
+from quantcpp import Model
+
+m = Model("model.gguf")
+print(m.ask("What is 2+2?"))
+
+# Streaming
+for tok in m.generate("Once upon a time"):
+    print(tok, end="", flush=True)
+```
+
+Pre-built wheels for Linux x86_64, Linux aarch64, macOS arm64 (Python 3.9–3.13). Other platforms fall back to source distribution which compiles `quant.h` automatically — no external dependencies, just a C compiler.
+
+> **Note (v0.8.x):** the Python bindings currently default to `kv_compress=0` (no KV compression). KV compression is fully working in the CLI `quant` binary; bringing it to the bindings is tracked for v0.8.2 (regenerated single-header). See [CHANGELOG](CHANGELOG.md#081--2026-04-09-python-bindings-hotfix) for details.
+
+---
+
 ## The Problem
 
 LLM memory is dominated by the **KV cache**, not model weights. At 32K context, a 8B model's KV cache consumes **4GB** — more than the model itself. Every existing engine stores KV in FP16. We compress it.