quant.cpp — Agent Development Guide

Project Vision

"LLM의 SQLite" — 가장 작고, 가장 읽기 쉽고, 가장 쉽게 임베딩할 수 있는 LLM 엔진.

Two directions:

Embedding Engine: quant.h 단일 헤더(15K LOC)로 어디든 LLM 추가
KV Compression Research: 함수 3개로 새 양자화 타입 추가 가능한 연구 플랫폼

Non-goals: GPU 속도 경쟁 (llama.cpp 영역), 배치 서빙 (vLLM 영역), 학습

Project Overview

quant.cpp is a minimal C inference engine for local LLM with KV cache compression. 72K LOC, pure C, zero dependencies. Supports 7 architectures via GGUF. Killer feature: KV cache compression — 7x compression with PPL +0.0% vs FP32. Ships as quant.h (15K LOC single header) and WASM (192KB).

Architecture

include/turboquant/   — Public C API (turboquant.h, tq_types.h, tq_spec.h)
src/core/             — Algorithms (tq_polar.c, tq_qjl.c, tq_turbo.c, tq_uniform.c, tq_traits.c)
src/cache/            — Paged cache + progressive compression
src/backend/cpu/      — CPU kernels (generic, AVX2, NEON)
src/backend/cuda/     — CUDA kernels
src/backend/metal/    — Metal compute shaders (7 kernels: matmul, rmsnorm, rope, attention, etc.)
src/engine/           — GGUF loader, transformer forward, tokenizer, generate
tests/                — Google Test unit tests (34 tests)
wasm/                 — Browser demo (quant.wasm 192KB + index.html)
docs/                 — API reference, custom quantization guide, tech report
examples/             — Embedding examples (minimal, chat, kv_compare)
bench/                — Performance benchmarks
spec/                 — Format specification + test vectors
integrations/         — llama.cpp, vLLM, Python bindings
harness/              — Autonomous development harness (run.sh, team.toml)

Key Documents

docs/prd_v0.1.md — Full product requirements
docs/wbs_v0.1.md — Work breakdown structure with checklists
program.md — Current agent task specification (READ THIS FIRST)
score.sh — Automated 5-dimension scoring (0.0 ~ 1.0)

Reference Implementations (DO NOT MODIFY)

refs/QJL/ — QJL Python/CUDA implementation (Amir Zandieh)
refs/PolarQuant/ — PolarQuant Python/Triton implementation
refs/llama.cpp/ — llama.cpp fork with TQ1/TQ2 weight quantization
refs/vllm/ — vLLM KV cache infrastructure
refs/onnx/ — ONNX quantization operator specification

Build & Test

cmake -B build -DCMAKE_BUILD_TYPE=Release -DTQ_BUILD_TESTS=ON -DTQ_BUILD_BENCH=ON
cmake --build build -j$(sysctl -n hw.ncpu 2>/dev/null || nproc)
ctest --test-dir build --output-on-failure

Scoring (5-Dimension Harness)

bash score.sh              # Full 5-dimension evaluation
bash score.sh --quick      # Build + correctness only (fast iteration)
bash score.sh --bench      # Performance only
bash score.sh --quality    # Quantization quality only

The score measures 5 dimensions:

Structure (10w) — Headers, sources, tests, specs exist
Correctness (11w) — Builds, tests pass, zero warnings
Quality (11w) — Roundtrip MSE < 0.01, attention cosine > 0.99
Performance (14w) — Throughput, compression ratio 5x, SIMD speedup 4x
Integration (6w) — llama.cpp/vLLM/Python plugins, docs, examples

Coding Standards

C11 for core library, C++17 for tests/CUDA/Metal wrappers
No external dependencies in core (libc/libm only)
Every block struct must have static_assert size verification
Every public function must have a unit test
ONNX LSB-first bit-packing convention for all quantized formats
Use refs/ code as algorithm reference — port to C, don't wrap Python

Quantization Types

Type	Key Bits	Algorithm	Block Size
TQ_POLAR_3B	3	PolarQuant	128
TQ_POLAR_4B	4	PolarQuant	128
TQ_QJL_1B	1	QJL sign hash	256
TQ_TURBO_3B	3	Polar 2b + QJL 1b	128
TQ_TURBO_4B	4	Polar 3b + QJL 1b	128
TQ_UNIFORM_4B	4	Min-Max	128
TQ_UNIFORM_2B	2	Min-Max	128

Module Ownership (Conflict-Free Zones)

Each module has exclusive file ownership. When working in parallel, agents must only modify their assigned files:

Module	Owned Files	Dependencies
`foundation`	`CMakeLists.txt`, `include/**`, `src/core/tq_traits.c`	None
`polar`	`src/core/tq_polar.`, `tests/test_polar.`	foundation
`qjl`	`src/core/tq_qjl.`, `tests/test_qjl.`	foundation
`turbo`	`src/core/tq_turbo.`, `tests/test_turbo.`	polar, qjl
`uniform`	`src/core/tq_uniform.`, `src/core/tq_value_quant.`, `tests/test_uniform.`, `tests/test_value.`	foundation
`cache`	`src/cache/*`, `tests/test_paged_cache.`, `tests/test_progressive.*`	foundation
`simd-neon`	`src/backend/cpu/**`	polar, qjl
`gpu-cuda`	`src/backend/cuda/**`	polar, qjl
`gpu-metal`	`src/backend/metal/**`	polar, qjl
`bench`	`bench/`, `spec/`, `tests/reference/**`	all core
`integration`	`integrations/`, `bindings/`, `examples/**`	all

Agent Team & Skills (Harness Architecture)

이 프로젝트는 Harness 패턴을 적용한 에이전트 팀 구조를 사용한다.

Agents (.claude/agents/)

Agent	Role	Subagent Type
`architect`	기술 리더, 작업 분해/위임, Merge Gate	Leader
`core-dev`	알고리즘 구현, 테스트 작성	general-purpose
`perf-dev`	SIMD/GPU 최적화, 벤치마크	general-purpose
`qa`	경계면 교차 비교, 통합 정합성 검증	general-purpose

Skills (.claude/skills/)

Skill	Trigger	Description
`orchestrate`	"개발 시작", "하네스 실행"	팀 구성 → 위임 → Merge Gate → 루프
`develop`	"다음 항목", 모듈명	Karpathy 루프: score → implement → verify
`score`	"점수", "현재 상태"	5차원 자동 평가 + 병목 분석
`qa`	"검증", 머지 전	경계면 교차 비교 (5대 경계면)

QA 원칙 (from refs/harness)

경계면 집중: 단일 함수가 아니라 함수 간/모듈 간 데이터 흐름 검증
점진적 QA: 전체 완성 후 1회가 아닌, 모듈 완성 직후 즉시 검증
교차 비교: API 출력과 실제 블록 데이터를 동시에 읽고 비교
회귀 방지: 발견된 버그는 반드시 테스트로 영구 방어

Development Methodology: Hierarchical Harness

This project uses a Hierarchical Harness that combines two methodologies:

Methodology A: Karpathy AutoResearch Loop

Single agent, sequential, safe. Score → modify → score → revert-if-worse → repeat.

Methodology B: ClawTeam Multi-Agent

Multiple agents in isolated git worktrees, parallel, fast. Leader delegates, workers execute.

Combined: Hierarchical Harness

Outer Loop (Leader — Karpathy pattern):
  1. bash score.sh → measure full project score
  2. Identify lowest-scoring dimension
  3. Delegate independent modules to Workers (ClawTeam)
  4. Wait for workers to complete
  5. Merge Gate: merge each worker's results one-by-one
     - If score improves or stays: keep merge
     - If score drops: revert that merge
  6. Repeat from 1

Inner Loop (Workers — each runs own Karpathy loop):
  - Work in isolated git worktree
  - Only modify files in their module ownership
  - score → modify → score → revert-if-worse
  - Report completion via clawteam task update

Phase Transitions (Score-Based)

Score Range	Phase	Strategy
0.00 ~ 0.05	Foundation	Single agent, sequential (CMake, headers, types)
0.05 ~ 0.30	Core Algorithms	3 parallel workers (polar, qjl, uniform)
0.30 ~ 0.60	Advanced	4 parallel workers (turbo, cache, simd, bench)
0.60 ~ 1.00	Fine-tuning	Single agent, precision optimization

Merge Gate Protocol

When merging worker results back to main:

Save current HEAD
Merge worker branch
Run bash score.sh --quick
If score dropped → git reset --hard to saved HEAD
If score OK → proceed to next worker

Slash Commands

/score — Run the scoring harness and show results
/develop — Start autonomous development (one WBS item per round)
/harness — Launch the full hierarchical harness
/spawn-team — Spawn ClawTeam parallel workers for current phase

Running the Harness

# Recommended: Hierarchical hybrid mode
./harness/run.sh --target 0.9

# Single agent Karpathy loop only
./harness/run.sh --single --rounds 50

# ClawTeam parallel workers only
./harness/run.sh --parallel-only

# Manual team spawn
clawteam launch harness/team.toml --goal "Build quant.cpp" --workspace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

quant.cpp — Agent Development Guide

Project Vision

Project Overview

Architecture

Key Documents

Reference Implementations (DO NOT MODIFY)

Build & Test

Scoring (5-Dimension Harness)

Coding Standards

Quantization Types

Module Ownership (Conflict-Free Zones)

Agent Team & Skills (Harness Architecture)

Agents (.claude/agents/)

Skills (.claude/skills/)

QA 원칙 (from refs/harness)

Development Methodology: Hierarchical Harness

Methodology A: Karpathy AutoResearch Loop

Methodology B: ClawTeam Multi-Agent

Combined: Hierarchical Harness

Phase Transitions (Score-Based)

Merge Gate Protocol

Slash Commands

Running the Harness

FilesExpand file tree

CLAUDE.md

Latest commit

History

CLAUDE.md

File metadata and controls

quant.cpp — Agent Development Guide

Project Vision

Project Overview

Architecture

Key Documents

Reference Implementations (DO NOT MODIFY)

Build & Test

Scoring (5-Dimension Harness)

Coding Standards

Quantization Types

Module Ownership (Conflict-Free Zones)

Agent Team & Skills (Harness Architecture)

Agents (.claude/agents/)

Skills (.claude/skills/)

QA 원칙 (from refs/harness)

Development Methodology: Hierarchical Harness

Methodology A: Karpathy AutoResearch Loop

Methodology B: ClawTeam Multi-Agent

Combined: Hierarchical Harness

Phase Transitions (Score-Based)

Merge Gate Protocol

Slash Commands

Running the Harness