v0.8.2: regenerate quant.h + quant_free_string + re-enable KV compression in Python bindings

## Background

v0.8.1 shipped with two known limitations in the Python bindings:

1. **`kv_compress=0` default** (no KV compression). The bundled `quant.h` is an Apr-6 snapshot whose UNIFORM_4B path aborts on Llama-architecture models. The CLI `quant` binary uses the multi-file engine and works fine with all KV types.

2. **`quant_ask` returns a leaked string** (~65 KB per call). Calling `libc.free(ptr)` from ctypes crashes on macOS arm64 because Python's libc handle resolves to a different malloc zone than `libquant.dylib`'s. We currently skip the free.

Both are documented in [CHANGELOG 0.8.1](https://github.com/quantumaikr/quant.cpp/blob/main/CHANGELOG.md#081--2026-04-09-python-bindings-hotfix) and the [bindings README](https://github.com/quantumaikr/quant.cpp/blob/main/bindings/python/README.md).

## Scope of v0.8.2

### 1. Regenerate `quant.h`

The single header at `quant.h` is dated 2026-04-06 and pre-dates several fixes that landed on `main` between then and v0.8.0:
- Round 10/11 NEON tbl breakthrough (turbo_kv_4b at fp32 parity)
- AVX2 port (v0.8.0)
- Gemma 4 architecture fixes
- Windows MSVC shims
- KV path fixes for Llama architectures (Bug 1 above)

Action: figure out the regeneration process (search for any existing amalgamate/merge script, or write one), regenerate, verify v0.8.0 features are present, run tests.

### 2. Add `quant_free_string` to the public API

Tiny new export in `quant.h`:

\`\`\`c
// Free a string returned by quant_ask. Always use this — never call libc free directly,
// because the string may live in a different malloc zone than the caller's libc.
void quant_free_string(char* str);
\`\`\`

Implementation: literally `if (str) free(str);` inside the .c (so it uses the dylib's malloc heap).

Then update the Python wrapper: `lib.quant_free_string(ptr)` instead of skipping the free.

### 3. Re-enable `kv_compress=1` / `=2` in the Python bindings

Once the regenerated `quant.h` includes the fixed Llama KV path, drop the warning + fallback in `bindings/python/quantcpp/__init__.py:Model.__init__`:

\`\`\`python
# Remove the v0.8.1 guard:
if kv_compress not in (0,):
    warnings.warn(...)
    kv_compress = 0
\`\`\`

### 4. End-user smoke test in CI

Add to `.github/workflows/publish.yml` (or a new test workflow): in each wheel-build matrix entry, after `cibuildwheel`, run a small `Model("test_model.gguf").ask("hi")` call against a tiny GGUF model committed to the repo. Block publish if the smoke test aborts. This catches any future regression of bugs 1 & 2 BEFORE PyPI sees them.

The test model can be a 10-MB SmolLM2-135M Q4_K_M, or even smaller. Stored as a Git LFS object or downloaded from HF in CI.

## Acceptance criteria

- [ ] Fresh `quant.h` regenerated from v0.8.0+ tree, all 35 tests pass against it
- [ ] `quant_free_string` exposed and used by Python wrapper
- [ ] `Model("file.gguf", kv_compress=1).ask("hi")` works with no abort and no warning
- [ ] CI smoke test enforced before publish
- [ ] CHANGELOG 0.8.2 entry with measurements (Llama 3.2 1B PPL with kv_compress=1 vs CLI baseline — must match within noise)

## Why this matters

Until v0.8.2 lands, the headline value prop ("7× KV compression") is only available via the CLI binary, not via `pip install quantcpp`. We have two distribution channels saying different things. v0.8.2 unifies them.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.8.2: regenerate quant.h + quant_free_string + re-enable KV compression in Python bindings #18

Background

Scope of v0.8.2

1. Regenerate `quant.h`

2. Add `quant_free_string` to the public API

3. Re-enable `kv_compress=1` / `=2` in the Python bindings

Remove the v0.8.1 guard:

4. End-user smoke test in CI

Acceptance criteria

Why this matters

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

v0.8.2: regenerate quant.h + quant_free_string + re-enable KV compression in Python bindings #18

Description

Background

Scope of v0.8.2

1. Regenerate quant.h

2. Add quant_free_string to the public API

3. Re-enable kv_compress=1 / =2 in the Python bindings

Remove the v0.8.1 guard:

4. End-user smoke test in CI

Acceptance criteria

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Regenerate `quant.h`

2. Add `quant_free_string` to the public API

3. Re-enable `kv_compress=1` / `=2` in the Python bindings