Skip to content

Commit eb41858

Browse files
unamedkrclaude
andcommitted
docs/pr: Reddit r/LocalLLaMA v0.8.1 + pip install draft (EN + KO)
Refreshed announcement targeting the 'pip install quantcpp' channel. Headlines the actual user gesture (one shell command) and code snippet. Honest framing on what we are NOT (faster than llama.cpp on GPU, a llama.cpp replacement) preserved from the prior parity drafts. Includes the v0.8.1 hotfix story as the 6th honest correction — treating retraction track record as a marketing asset, not a weakness. Pre-post checklist + response strategy section so the user can ship this without needing more context after a long break. NOT auto-posted. The user owns the timing decision (and the PyPI 0.8.0 yank must happen first). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 107b50f commit eb41858

File tree

2 files changed

+166
-0
lines changed

2 files changed

+166
-0
lines changed
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Reddit r/LocalLLaMA — quantcpp v0.8.1 + `pip install` (KO)
2+
3+
**제안 제목:** `[Project] quantcpp 0.8.1 — 단일 헤더 KV 압축 LLM 엔진, 이제 PyPI에서`
4+
5+
**제안 flair:** `Resources` 또는 `Other`
6+
7+
---
8+
9+
## 본문
10+
11+
**quantcpp 0.8.1**을 발행했습니다 — KV 캐시 압축 연구에 집중한 단일 헤더 C 추론 엔진입니다. 이제 PyPI에서 설치할 수 있습니다:
12+
13+
```bash
14+
pip install quantcpp
15+
```
16+
17+
```python
18+
from quantcpp import Model
19+
m = Model("model.gguf")
20+
print(m.ask("What is 2+2?"))
21+
```
22+
23+
Linux x86_64, Linux aarch64, macOS arm64에 사전 빌드 휠 (CPython 3.9–3.13). 다른 플랫폼은 sdist로 fallback해서 `quant.h`를 자동 컴파일합니다 — 런타임 의존성 0.
24+
25+
### 무엇인가
26+
27+
- **단일 헤더 (`quant.h`, ~16K LOC, ~646 KB)** — C 프로젝트 어디든 파일 하나만 떨어뜨리면 동작. CMake 불필요, 서브모듈 불필요.
28+
- **7가지 KV 캐시 양자화 타입** 한 엔진에. 모두 공개 논문(TurboQuant, PolarQuant, QJL)에서 재현 가능.
29+
- **순수 C, 의존성 0** — C 컴파일러가 도는 곳이면 어디든: iOS, Android, WASM, 마이크로컨트롤러, MSVC.
30+
- **다채널 배포**: PyPI, GGUF integration, llama.cpp PR 드래프트, single-header drop-in.
31+
32+
### 핵심 결과 (Llama 3.2 3B, M-series, CPU-only, 957 토큰 PPL eval, 3-run 평균)
33+
34+
| KV 타입 | tok/s | vs FP32 | PPL | ΔPPL | 압축비 |
35+
|---|---:|---:|---:|---:|---:|
36+
| FP32 | 17.93 | baseline | 13.56 |||
37+
| **turbo_kv_4b** | 18.13 | **+1.1%**| 14.08 | +3.8% | **7.1×** |
38+
| turbo_kv_5b_fast 🆕 | 17.53 | −2.2% | 13.65 | +0.7% | 3.76× |
39+
| turbo_kv_5b | 16.93 | −5.6% | 13.65 | +0.7% | 5.8× |
40+
41+
**`turbo_kv_4b`** 경로는 Apple Silicon에서 7.1× KV 압축에 fp32 *속도 패리티*를 달성합니다. 이를 가능케 한 커널은 NEON 명령어 하나 (`vqtbl1q_s8`) — 16-entry 코드북 룩업. 공개 Karpathy 루프 로그의 Round 10입니다. v0.8.0은 같은 패턴을 AVX2 (`_mm_shuffle_epi8`)로 포팅해서 Linux/Windows x86-64에도 적용했습니다.
42+
43+
### 우리가 주장하지 *않는*
44+
45+
- llama.cpp보다 GPU에서 빠르다고 주장하지 않습니다. llama.cpp + Metal/CUDA는 production throughput에서 5–10배 우위. 우리 가치는 dispatch overhead가 GPU compute를 넘어서는 **CPU/임베디드** 환경, 그리고 새 quant 방법을 빠르게 포팅하는 **연구 속도**입니다.
46+
- llama.cpp 대체재가 아닙니다. llama.cpp 100+ archs vs 우리 7개.
47+
- Python 바인딩의 0.8.x 기본값은 **`kv_compress=0`** (KV 압축 없음)입니다. CLI 바이너리는 모든 KV 타입에서 작동하고, 바인딩에 가져오는 것은 v0.8.2 (단일 헤더 재생성)로 트래킹됩니다. `pip install` 패키지는 로드 + 생성을 합니다. KV 압축은 다음 릴리스.
48+
49+
### 정직성 트랙 레코드
50+
51+
이는 0.6.x → 0.8.x 시리즈에서 프로젝트의 **6번째 자가 정정**입니다. v0.8.0 Python 바인딩의 두 버그(default 경로 abort, cross-heap `libc.free()`)를 발행 후 몇 시간 안에 클린 venv에서 end-user 시뮬레이션으로 직접 잡았습니다. v0.8.1이 핫픽스. PyPI 0.8.0은 yank 처리 중.
52+
53+
우리는 정정을 프로젝트의 1차 신뢰 자산으로 다루고, 기능과 동일하게 [CHANGELOG](https://github.com/quantumaikr/quant.cpp/blob/main/CHANGELOG.md)에 기록합니다.
54+
55+
### 링크
56+
57+
- **PyPI**: https://pypi.org/project/quantcpp/
58+
- **GitHub**: https://github.com/quantumaikr/quant.cpp
59+
- **재현 하네스**: 11라운드 Karpathy 루프 문서 (`bench/results/turboquant_reproduction.md`)
60+
- **Karpathy 루프 스코어링**: 10년 포지션 가드를 포함한 6차원 (단일 헤더 LOC, zero-deps, 포팅된 논문, honest correction count) — 실패 시 CI 차단
61+
62+
### 피드백을 환영합니다
63+
64+
1. `pip install quantcpp`**여러분의** OS/Python 버전에서 `Model("your.gguf").ask("hi")` 가 정상 종료하지 않으면 트레이스와 함께 이슈 등록해주세요. 휠 매트릭스는 오늘 기준 Linux x86_64/aarch64 + macOS arm64. 그 외는 sdist (소스 컴파일).
65+
2. `TQ_TURBO_KV_4B`을 위한 llama.cpp PR 드래프트는 `docs/pr/2026-04-09-llama-cpp-pr-draft.md`. llama.cpp ggml 내부 경험이 있어서 실제 포팅을 공동 저자로 하실 분 환영합니다.
66+
67+
---
68+
69+
## 게시 전 체크리스트 (사용자)
70+
71+
- [ ] PyPI 0.8.0 yank 먼저 (https://pypi.org/manage/project/quantcpp/release/0.8.0/) → Options → Yank → 사유 입력
72+
- [ ] 새 venv에서 `pip install quantcpp`**0.8.1** 로 resolve되는지 확인
73+
- [ ] 게시할 플랫폼에서 코드 스니펫 (`Model("file.gguf").ask("...")`) 한 번 더 테스트
74+
- [ ] 제목 flair 결정 — `Resources` 가 mod 자동 삭제될 가능성 가장 낮음
75+
- [ ] v0.8.1 릴리스 노트 링크 댓글로 핀
76+
- [ ] "llama.cpp랑 어떻게 다른가" 답변 준비 — "주장하지 않는 것" 섹션 사용
77+
78+
## 응답 전략 메모
79+
80+
- "llama.cpp보다 느리다" → 동의, "주장하지 않는 것" 섹션 가리키기
81+
- "그냥 llama.cpp + Python binding 아닌가" → 단일 헤더 (어떤 C 프로젝트에든 drop-in, 서브모듈/CMake 불필요), 7가지 KV quant 타입, 연구 속도 (KIVI/HIGGS 등) 설명
82+
- Tim Dettmers, Amir Zandieh 등 quant 연구 저자가 댓글 → 정중하게 engage. 연구 속도 기둥의 실제 타깃 청중
83+
- downvote brigade → 글 삭제 금지. 정직성 트랙 레코드가 해자, 삭제는 그것을 침식
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Reddit r/LocalLLaMA — quantcpp v0.8.1 + `pip install` (EN)
2+
3+
**Suggested title:** `[Project] quantcpp 0.8.1 — single-header KV-compressed LLM engine, now on PyPI`
4+
5+
**Suggested flair:** `Resources` or `Other`
6+
7+
---
8+
9+
## Body
10+
11+
We just shipped **quantcpp 0.8.1** — a single-header C inference engine focused on **KV cache compression research**, now installable from PyPI:
12+
13+
```bash
14+
pip install quantcpp
15+
```
16+
17+
```python
18+
from quantcpp import Model
19+
m = Model("model.gguf")
20+
print(m.ask("What is 2+2?"))
21+
```
22+
23+
Pre-built wheels for Linux x86_64, Linux aarch64, macOS arm64 (CPython 3.9–3.13). Other platforms fall back to source distribution and compile `quant.h` automatically — zero runtime dependencies.
24+
25+
### What it is
26+
27+
- **Single header (`quant.h`, ~16K LOC, ~646 KB)** — drop one file into any C project, no CMake, no submodule.
28+
- **7 KV cache quantization types** in one engine, all reproducible from public papers (TurboQuant, PolarQuant, QJL).
29+
- **Pure C, zero deps** — runs everywhere a C compiler runs (iOS, Android, WASM, microcontrollers, MSVC).
30+
- **Multi-channel distribution**: PyPI, GGUF integration, llama.cpp PR draft (filed separately), single-header drop-in.
31+
32+
### Headline result (Llama 3.2 3B, M-series, CPU-only, 957-token PPL eval, 3-run avg)
33+
34+
| KV type | tok/s | vs FP32 | PPL | ΔPPL | Compression |
35+
|---|---:|---:|---:|---:|---:|
36+
| FP32 | 17.93 | baseline | 13.56 |||
37+
| **turbo_kv_4b** | 18.13 | **+1.1%**| 14.08 | +3.8% | **7.1×** |
38+
| turbo_kv_5b_fast 🆕 | 17.53 | −2.2% | 13.65 | +0.7% | 3.76× |
39+
| turbo_kv_5b | 16.93 | −5.6% | 13.65 | +0.7% | 5.8× |
40+
41+
The **`turbo_kv_4b`** path achieves fp32 *speed parity* at 7.1× KV compression on Apple Silicon. The kernel that gets us there is a single NEON instruction (`vqtbl1q_s8`) doing a 16-entry codebook lookup — Round 10 of our public Karpathy-loop log. v0.8.0 ports the same pattern to AVX2 (`_mm_shuffle_epi8`) for Linux/Windows x86-64.
42+
43+
### What we are NOT claiming
44+
45+
- We are **not faster than llama.cpp on GPU**. llama.cpp + Metal/CUDA wins production throughput by 5–10×. Our value is on CPU/embedded where dispatch overhead dominates GPU compute, and on the **research velocity** of porting new quant methods quickly.
46+
- We are **not a llama.cpp replacement**. llama.cpp supports 100+ archs, we support 7 (the ones we benchmark).
47+
- The Python bindings in 0.8.x default to **`kv_compress=0`** (no KV compression). The CLI binary works with all KV types; bringing them to the bindings is tracked for v0.8.2 (regenerated single-header). The `pip install` package will load + generate; KV compression comes next release.
48+
49+
### Honesty track record
50+
51+
This is the project's **6th self-correction** in the 0.6.x → 0.8.x series. We caught both v0.8.0 Python binding bugs (a default-path abort and a cross-heap `libc.free()`) within hours of publishing by running an end-user simulation in a clean venv. v0.8.1 is the hotfix. PyPI 0.8.0 is being yanked.
52+
53+
We treat retractions as the project's primary trust asset and log them in the [CHANGELOG](https://github.com/quantumaikr/quant.cpp/blob/main/CHANGELOG.md) the same way we log features.
54+
55+
### Links
56+
57+
- **PyPI**: https://pypi.org/project/quantcpp/
58+
- **GitHub**: https://github.com/quantumaikr/quant.cpp
59+
- **Reproduction harness**: 11 Karpathy-loop rounds documented at `bench/results/turboquant_reproduction.md`
60+
- **Karpathy loop scoring**: 6 dimensions including a 10-year-position guard (single-header LOC, zero-deps, papers ported, honest correction count) — failures break CI
61+
62+
### What we'd love feedback on
63+
64+
1. If you `pip install quantcpp` and `Model("your.gguf").ask("hi")` doesn't return cleanly on **your** OS / Python version, please open an issue with the trace. The wheel matrix is Linux x86_64/aarch64 + macOS arm64 today; everything else uses sdist (source compile).
65+
2. The llama.cpp PR draft for `TQ_TURBO_KV_4B` is at `docs/pr/2026-04-09-llama-cpp-pr-draft.md`. If anyone with llama.cpp ggml internals experience wants to co-author the actual port, we'd welcome the help.
66+
67+
---
68+
69+
## Pre-post checklist (for the user posting)
70+
71+
- [ ] Yank PyPI 0.8.0 first (https://pypi.org/manage/project/quantcpp/release/0.8.0/) → Options → Yank → reason text
72+
- [ ] Confirm `pip install quantcpp` resolves to **0.8.1** in a fresh venv
73+
- [ ] Test the code snippet (`Model("file.gguf").ask("...")`) one more time on the target platform you'll mention
74+
- [ ] Decide on the title flair — `Resources` is least likely to be auto-removed by mods
75+
- [ ] Pin a comment with the link to the v0.8.1 release notes
76+
- [ ] Be ready to respond to "how does this compare to llama.cpp" — the answer is in the "What we are NOT claiming" section above
77+
78+
## Notes for response strategy
79+
80+
- If anyone says "you're slower than llama.cpp" → agree, point to "What we are NOT claiming"
81+
- If anyone says "this is just llama.cpp + a Python binding" → point to single-header (drop into any C project, no submodule, no CMake), 7 KV quant types, research velocity (KIVI/HIGGS/etc.)
82+
- If Tim Dettmers, Amir Zandieh, or other quant-research authors comment → engage thoughtfully, they're the actual target audience for the research-velocity pillar
83+
- If a downvote brigade hits → leave the post up, do not delete. The honesty track record is the moat; deletion erodes it.

0 commit comments

Comments
 (0)