Skip to content

v0.8.2: regenerate quant.h + quant_free_string + re-enable KV compression in Python bindings #18

@unamedkr

Description

@unamedkr

Background

v0.8.1 shipped with two known limitations in the Python bindings:

  1. kv_compress=0 default (no KV compression). The bundled quant.h is an Apr-6 snapshot whose UNIFORM_4B path aborts on Llama-architecture models. The CLI quant binary uses the multi-file engine and works fine with all KV types.

  2. quant_ask returns a leaked string (~65 KB per call). Calling libc.free(ptr) from ctypes crashes on macOS arm64 because Python's libc handle resolves to a different malloc zone than libquant.dylib's. We currently skip the free.

Both are documented in CHANGELOG 0.8.1 and the bindings README.

Scope of v0.8.2

1. Regenerate quant.h

The single header at quant.h is dated 2026-04-06 and pre-dates several fixes that landed on main between then and v0.8.0:

  • Round 10/11 NEON tbl breakthrough (turbo_kv_4b at fp32 parity)
  • AVX2 port (v0.8.0)
  • Gemma 4 architecture fixes
  • Windows MSVC shims
  • KV path fixes for Llama architectures (Bug 1 above)

Action: figure out the regeneration process (search for any existing amalgamate/merge script, or write one), regenerate, verify v0.8.0 features are present, run tests.

2. Add quant_free_string to the public API

Tiny new export in quant.h:

```c
// Free a string returned by quant_ask. Always use this — never call libc free directly,
// because the string may live in a different malloc zone than the caller's libc.
void quant_free_string(char* str);
```

Implementation: literally if (str) free(str); inside the .c (so it uses the dylib's malloc heap).

Then update the Python wrapper: lib.quant_free_string(ptr) instead of skipping the free.

3. Re-enable kv_compress=1 / =2 in the Python bindings

Once the regenerated quant.h includes the fixed Llama KV path, drop the warning + fallback in bindings/python/quantcpp/__init__.py:Model.__init__:

```python

Remove the v0.8.1 guard:

if kv_compress not in (0,):
warnings.warn(...)
kv_compress = 0
```

4. End-user smoke test in CI

Add to .github/workflows/publish.yml (or a new test workflow): in each wheel-build matrix entry, after cibuildwheel, run a small Model("test_model.gguf").ask("hi") call against a tiny GGUF model committed to the repo. Block publish if the smoke test aborts. This catches any future regression of bugs 1 & 2 BEFORE PyPI sees them.

The test model can be a 10-MB SmolLM2-135M Q4_K_M, or even smaller. Stored as a Git LFS object or downloaded from HF in CI.

Acceptance criteria

  • Fresh quant.h regenerated from v0.8.0+ tree, all 35 tests pass against it
  • quant_free_string exposed and used by Python wrapper
  • Model("file.gguf", kv_compress=1).ask("hi") works with no abort and no warning
  • CI smoke test enforced before publish
  • CHANGELOG 0.8.2 entry with measurements (Llama 3.2 1B PPL with kv_compress=1 vs CLI baseline — must match within noise)

Why this matters

Until v0.8.2 lands, the headline value prop ("7× KV compression") is only available via the CLI binary, not via pip install quantcpp. We have two distribution channels saying different things. v0.8.2 unifies them.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions