Background
v0.8.1 shipped with two known limitations in the Python bindings:
-
kv_compress=0 default (no KV compression). The bundled quant.h is an Apr-6 snapshot whose UNIFORM_4B path aborts on Llama-architecture models. The CLI quant binary uses the multi-file engine and works fine with all KV types.
-
quant_ask returns a leaked string (~65 KB per call). Calling libc.free(ptr) from ctypes crashes on macOS arm64 because Python's libc handle resolves to a different malloc zone than libquant.dylib's. We currently skip the free.
Both are documented in CHANGELOG 0.8.1 and the bindings README.
Scope of v0.8.2
1. Regenerate quant.h
The single header at quant.h is dated 2026-04-06 and pre-dates several fixes that landed on main between then and v0.8.0:
- Round 10/11 NEON tbl breakthrough (turbo_kv_4b at fp32 parity)
- AVX2 port (v0.8.0)
- Gemma 4 architecture fixes
- Windows MSVC shims
- KV path fixes for Llama architectures (Bug 1 above)
Action: figure out the regeneration process (search for any existing amalgamate/merge script, or write one), regenerate, verify v0.8.0 features are present, run tests.
2. Add quant_free_string to the public API
Tiny new export in quant.h:
```c
// Free a string returned by quant_ask. Always use this — never call libc free directly,
// because the string may live in a different malloc zone than the caller's libc.
void quant_free_string(char* str);
```
Implementation: literally if (str) free(str); inside the .c (so it uses the dylib's malloc heap).
Then update the Python wrapper: lib.quant_free_string(ptr) instead of skipping the free.
3. Re-enable kv_compress=1 / =2 in the Python bindings
Once the regenerated quant.h includes the fixed Llama KV path, drop the warning + fallback in bindings/python/quantcpp/__init__.py:Model.__init__:
```python
Remove the v0.8.1 guard:
if kv_compress not in (0,):
warnings.warn(...)
kv_compress = 0
```
4. End-user smoke test in CI
Add to .github/workflows/publish.yml (or a new test workflow): in each wheel-build matrix entry, after cibuildwheel, run a small Model("test_model.gguf").ask("hi") call against a tiny GGUF model committed to the repo. Block publish if the smoke test aborts. This catches any future regression of bugs 1 & 2 BEFORE PyPI sees them.
The test model can be a 10-MB SmolLM2-135M Q4_K_M, or even smaller. Stored as a Git LFS object or downloaded from HF in CI.
Acceptance criteria
Why this matters
Until v0.8.2 lands, the headline value prop ("7× KV compression") is only available via the CLI binary, not via pip install quantcpp. We have two distribution channels saying different things. v0.8.2 unifies them.
Background
v0.8.1 shipped with two known limitations in the Python bindings:
kv_compress=0default (no KV compression). The bundledquant.his an Apr-6 snapshot whose UNIFORM_4B path aborts on Llama-architecture models. The CLIquantbinary uses the multi-file engine and works fine with all KV types.quant_askreturns a leaked string (~65 KB per call). Callinglibc.free(ptr)from ctypes crashes on macOS arm64 because Python's libc handle resolves to a different malloc zone thanlibquant.dylib's. We currently skip the free.Both are documented in CHANGELOG 0.8.1 and the bindings README.
Scope of v0.8.2
1. Regenerate
quant.hThe single header at
quant.his dated 2026-04-06 and pre-dates several fixes that landed onmainbetween then and v0.8.0:Action: figure out the regeneration process (search for any existing amalgamate/merge script, or write one), regenerate, verify v0.8.0 features are present, run tests.
2. Add
quant_free_stringto the public APITiny new export in
quant.h:```c
// Free a string returned by quant_ask. Always use this — never call libc free directly,
// because the string may live in a different malloc zone than the caller's libc.
void quant_free_string(char* str);
```
Implementation: literally
if (str) free(str);inside the .c (so it uses the dylib's malloc heap).Then update the Python wrapper:
lib.quant_free_string(ptr)instead of skipping the free.3. Re-enable
kv_compress=1/=2in the Python bindingsOnce the regenerated
quant.hincludes the fixed Llama KV path, drop the warning + fallback inbindings/python/quantcpp/__init__.py:Model.__init__:```python
Remove the v0.8.1 guard:
if kv_compress not in (0,):
warnings.warn(...)
kv_compress = 0
```
4. End-user smoke test in CI
Add to
.github/workflows/publish.yml(or a new test workflow): in each wheel-build matrix entry, aftercibuildwheel, run a smallModel("test_model.gguf").ask("hi")call against a tiny GGUF model committed to the repo. Block publish if the smoke test aborts. This catches any future regression of bugs 1 & 2 BEFORE PyPI sees them.The test model can be a 10-MB SmolLM2-135M Q4_K_M, or even smaller. Stored as a Git LFS object or downloaded from HF in CI.
Acceptance criteria
quant.hregenerated from v0.8.0+ tree, all 35 tests pass against itquant_free_stringexposed and used by Python wrapperModel("file.gguf", kv_compress=1).ask("hi")works with no abort and no warningWhy this matters
Until v0.8.2 lands, the headline value prop ("7× KV compression") is only available via the CLI binary, not via
pip install quantcpp. We have two distribution channels saying different things. v0.8.2 unifies them.