"LLM의 SQLite" — 가장 작고, 가장 읽기 쉽고, 가장 쉽게 임베딩할 수 있는 LLM 엔진.
Two directions:
- Embedding Engine: quant.h 단일 헤더(15K LOC)로 어디든 LLM 추가
- KV Compression Research: 함수 3개로 새 양자화 타입 추가 가능한 연구 플랫폼
Non-goals: GPU 속도 경쟁 (llama.cpp 영역), 배치 서빙 (vLLM 영역), 학습
quant.cpp is a minimal C inference engine for local LLM with KV cache compression. 72K LOC, pure C, zero dependencies. Supports 7 architectures via GGUF. Killer feature: KV cache compression — 7x compression with PPL +0.0% vs FP32. Ships as quant.h (15K LOC single header) and WASM (192KB).
include/turboquant/ — Public C API (turboquant.h, tq_types.h, tq_spec.h)
src/core/ — Algorithms (tq_polar.c, tq_qjl.c, tq_turbo.c, tq_uniform.c, tq_traits.c)
src/cache/ — Paged cache + progressive compression
src/backend/cpu/ — CPU kernels (generic, AVX2, NEON)
src/backend/cuda/ — CUDA kernels
src/backend/metal/ — Metal compute shaders (7 kernels: matmul, rmsnorm, rope, attention, etc.)
src/engine/ — GGUF loader, transformer forward, tokenizer, generate
tests/ — Google Test unit tests (34 tests)
wasm/ — Browser demo (quant.wasm 192KB + index.html)
docs/ — API reference, custom quantization guide, tech report
examples/ — Embedding examples (minimal, chat, kv_compare)
bench/ — Performance benchmarks
spec/ — Format specification + test vectors
integrations/ — llama.cpp, vLLM, Python bindings
harness/ — Autonomous development harness (run.sh, team.toml)
docs/prd_v0.1.md— Full product requirementsdocs/wbs_v0.1.md— Work breakdown structure with checklistsprogram.md— Current agent task specification (READ THIS FIRST)score.sh— Automated 5-dimension scoring (0.0 ~ 1.0)
refs/QJL/— QJL Python/CUDA implementation (Amir Zandieh)refs/PolarQuant/— PolarQuant Python/Triton implementationrefs/llama.cpp/— llama.cpp fork with TQ1/TQ2 weight quantizationrefs/vllm/— vLLM KV cache infrastructurerefs/onnx/— ONNX quantization operator specification
cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON -DTQ_BUILD_BENCH=ON
cmake --build build -j$(sysctl -n hw.ncpu 2>/dev/null || nproc)
ctest --test-dir build --output-on-failurebash score.sh # Full 5-dimension evaluation
bash score.sh --quick # Build + correctness only (fast iteration)
bash score.sh --bench # Performance only
bash score.sh --quality # Quantization quality onlyThe score measures 5 dimensions:
- Structure (10w) — Headers, sources, tests, specs exist
- Correctness (11w) — Builds, tests pass, zero warnings
- Quality (11w) — Roundtrip MSE < 0.01, attention cosine > 0.99
- Performance (14w) — Throughput, compression ratio 5x, SIMD speedup 4x
- Integration (6w) — llama.cpp/vLLM/Python plugins, docs, examples
- C11 for core library, C++17 for tests/CUDA/Metal wrappers
- No external dependencies in core (libc/libm only)
- Every block struct must have static_assert size verification
- Every public function must have a unit test
- ONNX LSB-first bit-packing convention for all quantized formats
- Use refs/ code as algorithm reference — port to C, don't wrap Python
| Type | Key Bits | Algorithm | Block Size |
|---|---|---|---|
| TQ_POLAR_3B | 3 | PolarQuant | 128 |
| TQ_POLAR_4B | 4 | PolarQuant | 128 |
| TQ_QJL_1B | 1 | QJL sign hash | 256 |
| TQ_TURBO_3B | 3 | Polar 2b + QJL 1b | 128 |
| TQ_TURBO_4B | 4 | Polar 3b + QJL 1b | 128 |
| TQ_UNIFORM_4B | 4 | Min-Max | 128 |
| TQ_UNIFORM_2B | 2 | Min-Max | 128 |
Each module has exclusive file ownership. When working in parallel, agents must only modify their assigned files:
| Module | Owned Files | Dependencies |
|---|---|---|
foundation |
CMakeLists.txt, include/**, src/core/tq_traits.c |
None |
polar |
src/core/tq_polar.*, tests/test_polar.* |
foundation |
qjl |
src/core/tq_qjl.*, tests/test_qjl.* |
foundation |
turbo |
src/core/tq_turbo.*, tests/test_turbo.* |
polar, qjl |
uniform |
src/core/tq_uniform.*, src/core/tq_value_quant.*, tests/test_uniform.*, tests/test_value.* |
foundation |
cache |
src/cache/**, tests/test_paged_cache.*, tests/test_progressive.* |
foundation |
simd-neon |
src/backend/cpu/** |
polar, qjl |
gpu-cuda |
src/backend/cuda/** |
polar, qjl |
gpu-metal |
src/backend/metal/** |
polar, qjl |
bench |
bench/**, spec/**, tests/reference/** |
all core |
integration |
integrations/**, bindings/**, examples/** |
all |
이 프로젝트는 Harness 패턴을 적용한 에이전트 팀 구조를 사용한다.
| Agent | Role | Subagent Type |
|---|---|---|
architect |
기술 리더, 작업 분해/위임, Merge Gate | Leader |
core-dev |
알고리즘 구현, 테스트 작성 | general-purpose |
perf-dev |
SIMD/GPU 최적화, 벤치마크 | general-purpose |
qa |
경계면 교차 비교, 통합 정합성 검증 | general-purpose |
| Skill | Trigger | Description |
|---|---|---|
orchestrate |
"개발 시작", "하네스 실행" | 팀 구성 → 위임 → Merge Gate → 루프 |
develop |
"다음 항목", 모듈명 | Karpathy 루프: score → implement → verify |
score |
"점수", "현재 상태" | 5차원 자동 평가 + 병목 분석 |
qa |
"검증", 머지 전 | 경계면 교차 비교 (5대 경계면) |
- 경계면 집중: 단일 함수가 아니라 함수 간/모듈 간 데이터 흐름 검증
- 점진적 QA: 전체 완성 후 1회가 아닌, 모듈 완성 직후 즉시 검증
- 교차 비교: API 출력과 실제 블록 데이터를 동시에 읽고 비교
- 회귀 방지: 발견된 버그는 반드시 테스트로 영구 방어
This project uses a Hierarchical Harness that combines two methodologies:
Single agent, sequential, safe. Score → modify → score → revert-if-worse → repeat.
Multiple agents in isolated git worktrees, parallel, fast. Leader delegates, workers execute.
Outer Loop (Leader — Karpathy pattern):
1. bash score.sh → measure full project score
2. Identify lowest-scoring dimension
3. Delegate independent modules to Workers (ClawTeam)
4. Wait for workers to complete
5. Merge Gate: merge each worker's results one-by-one
- If score improves or stays: keep merge
- If score drops: revert that merge
6. Repeat from 1
Inner Loop (Workers — each runs own Karpathy loop):
- Work in isolated git worktree
- Only modify files in their module ownership
- score → modify → score → revert-if-worse
- Report completion via clawteam task update
| Score Range | Phase | Strategy |
|---|---|---|
| 0.00 ~ 0.05 | Foundation | Single agent, sequential (CMake, headers, types) |
| 0.05 ~ 0.30 | Core Algorithms | 3 parallel workers (polar, qjl, uniform) |
| 0.30 ~ 0.60 | Advanced | 4 parallel workers (turbo, cache, simd, bench) |
| 0.60 ~ 1.00 | Fine-tuning | Single agent, precision optimization |
When merging worker results back to main:
- Save current HEAD
- Merge worker branch
- Run
bash score.sh --quick - If score dropped →
git reset --hardto saved HEAD - If score OK → proceed to next worker
/score— Run the scoring harness and show results/develop— Start autonomous development (one WBS item per round)/harness— Launch the full hierarchical harness/spawn-team— Spawn ClawTeam parallel workers for current phase
# Recommended: Hierarchical hybrid mode
./harness/run.sh --target 0.9
# Single agent Karpathy loop only
./harness/run.sh --single --rounds 50
# ClawTeam parallel workers only
./harness/run.sh --parallel-only
# Manual team spawn
clawteam launch harness/team.toml --goal "Build quant.cpp" --workspace