Releases: redchupa/lumen
v0.5.0 — Measure-driven cycle (gap 1.376× → 1.304×)
Headline
| Metric | v0.4.0 | v0.5.0 | Change |
|---|---|---|---|
| tg32 @ 1 thread (Lumen vs ggml) | 41.85 vs 40.40 | 45.64 vs 40.40 | Lumen +13% |
| tg32 @ 8 threads (Lumen vs ggml) | 65.5 vs 90.9 | 67.4 vs 87.9 | gap 1.376× → 1.304× |
pp32 (LUMEN_PREFILL=1, opt-in) |
n/a | 54.15 tok/s | new path |
Tokens remain bit-identical to v0.4.0. Same Qwen2.5-0.5B-Q8_0 model, same AMD Zen 4 host.
What shipped
- Phase 8.A — ThreadPool rewrite (mpsc + per-task allocation + mutex receiver → atomic next_task counter + generation-based wake).
LUMEN_THREADS=Nenv var added for scaling experiments. - Phase 8.E.1 — Q8 prefill via N=1 fan-out dispatcher. Sidesteps the slow N>1 codegen body (2.3-2.6× slower than N=1×N at every Qwen projection shape, confirmed by Phase 8.D.5 micro-bench).
LUMEN_PREFILL=1opt-in.
What didn't ship but stays in tree (default-off)
These are the measure-driven negative results from this cycle. Infrastructure preserved; activation waits for either a better host (ZMM on native 512-bit silicon) or the kernel rework (Phase 8.E.2).
- AVX-512 ZMM kernels (Phase 7.R/S/T). Zen 4 implements AVX-512 as double-pumped 256-bit, so ZMM ops don't deliver lane-width win on this host.
- chunk_rows L2-fit cap (8.B), software prefetch encoders (8.C), forward_layer_prefill_jit + Model::forward_prefill_jit (8.D.1-3). All measured net-negative on the current N>1 codegen.
11-cycle pattern (v0.5 measure-driven decisions)
| Phase | Hypothesis | Result |
|---|---|---|
| 7.M | Q8 native int dot | net-neutral |
| 7.N | VNNI single-acc | -3.6% |
| 7.O | VNNI 4-acc | -2.7% (revived in 7.P) |
| 7.R/S/T | AVX-512 ZMM lane width | -4.5% |
| 8.A | atomic ThreadPool | +9% 1t, +3% 8t ✓ |
| 8.B | chunk L2-fit cap | -3.5% (reverted) |
| 8.C | software prefetch | 8t -49% (reverted) |
| 8.D.3 | prefill batching | -2.9× pp32 |
| 8.D.3 retro | attention is the cause | 1% only (false) |
| 8.D.5 | N>1 codegen is slow | confirmed ✓ |
| 8.E.1 | N=1 fan-out fixes plumbing | +2.39× pp32 ✓ |
Net: 1 merged, 1 diagnostic, 1 partial fix vs 7 reverts. Gap closed 1.376× → 1.304×.
What's next (see ROADMAP.md)
- Phase 8.E.2 — Q8 N>1 codegen rewrite (port the 7.G 4-accumulator YMM pattern). Expected to close most of the 8t gap.
- ARM64 backend — Apple Silicon / Graviton / Raspberry Pi support.
- Q4 native matmul — enables Qwen2.5-1.5B and 3B at the same memory footprint.
Retrospective blog series
16 posts across the v0.1-v0.5 build, indexed at docs/blog/INDEX.md. The v0.5-cycle retros:
Lumen v0.4.0 — shape-aware VNNI/fp32 hybrid dispatch
v0.4.0 ships the result of four measurement-led phases (7.M → 7.P): three attempts to push int-dot product wins through, each disproved by bench, and a final hybrid dispatch that combines them all into the actual win. Tokens stay bit-identical to the naive Rust reference across every commit.
tg32 (Qwen2.5-0.5B-Q8_0, Windows, 8 worker threads on Zen 4)
| Path | tok/s | vs ggml -t 8 |
|---|---|---|
| Lumen v0.3.0 (Q8×F32 4-acc, custom pool) | ~60 | 1.51× slower |
| Lumen v0.4.0 (per-shape VNNI/fp32 hybrid) | ~65 | 1.39× slower |
| ggml -t 8 | 90.90 | 1.0× |
Sanity: Lumen 8t (65) is 1.58× faster than ggml -t 1 (41.32).
v0.1.0 → v0.4.0 spread: 4.43 → 65.3 tok/s = 14.7× decode in ~5 days.
What landed (and why the path zig-zagged)
Phase 7.M — Q8×Q8 AVX2 int dot
Built the full Q8×Q8 fused matmul infrastructure: 7 new AVX2 integer encoders (vpsignb, vpmaddubsw, vpmaddwd, vmovdqu, vpaddd, vpcmpeqd, vpsrlw_imm8), the IR pattern (Param Q8 + Param Q8 + 2× Dequantize + MatMul), MatmulJitCache::get_or_compile_q8q8, model.rs activation quantization pipeline. Measured net-neutral on Zen 4 — gate_up/lm_head/qkv/wo all improved, down_matmul regressed +27%. Default-off, infrastructure kept.
Phase 7.N — VNNI vpdpbusd encoders + 3-way dispatch
Added VEX-256 (AVX-VNNI client CPUs) and EVEX-256 (AVX-512 VNNI servers) forms of vpdpbusd plus runtime CPU detection. EVEX prefix encoded by hand (4 bytes: 0x62 + P0/P1/P2). The Q8×Q8 unit test confirmed EVEX runs on Zen 4 even though avxvnni CPUID bit is clear. Measured -3.6% on Zen 4 with single-accumulator design. Default-off, infrastructure kept.
Phase 7.O — 4-accumulator VNNI kernel
Applied Phase 7.G's 4-acc trick to VNNI: unroll the kb loop by 4 and feed each block into its own fp32 sub-accumulator. Win64-safe via ymm6 save/restore on the stack. Measured still -2.7% on Zen 4 — gate_up/qkv/wo/lm_head won as predicted (-9% to -28%) but down_matmul still regressed +27%, exactly canceling out. Default-off, infrastructure kept.
Phase 7.P — per-shape hybrid dispatch (the real win)
Three phases of data made the answer obvious: VNNI wins on short-K matmuls (K_blocks=28: -9% to -28%) and loses on long-K (K_blocks=152: +27%) because of CPU OoO window saturation in the longer inner loop. Solution: dispatch per matmul shape.
fn use_vnni_for_matmul(d_in: usize) -> bool {
has_vnni() && (d_in / 32) <= 64
}Routes Qwen's 5 short-K matmuls (qkv/wo/gate/up/lm_head) to VNNI and FFN-down to fp32 4-acc. +4.5% bench (62.5 → 65.3 tok/s), -13.3% per-step profile. Default-on.
v0.4.0 architectural promise
Every piece of code from 7.M, 7.N, 7.O is now on the production hot path via 7.P — no dead code. The 3-way CPU dispatch (avx512vnni > avxvnni > AVX2 fallback) and 3-way codegen dispatch (4-acc / single-acc / AVX2 chain) all light up depending on host and shape. External runtime deps still just thiserror.
v0.4 → v0.5
Single clear next move: AVX-512 ZMM fp32 + ZMM VNNI. The remaining 1.39× gap to ggml is almost entirely lane width — ggml uses ZMM (512-bit), we use YMM (256-bit). Extending the EVEX encoder family to ZMM ops and writing new ZMM kernels is the v0.5 work.
Read more
- BENCHMARK.md — full breakdown
- docs/blog/phase7p-v0.4.0-shape-aware-ko.md — v0.4 retrospective (KR)
- docs/blog/phase7m-7n-int-dot-honest-ko.md — Phase 7.M+7.N candid post-mortem (KR)
- docs/blog/phase7j-v0.3.0-profile-and-parallel-ko.md — v0.3 retrospective (KR)
Lumen v0.3.0 — profile, parallelize, 4× faster than v0.2.0
A day after v0.2.0, Lumen's Qwen2.5-0.5B decode runs 3.1× faster when allowed to use multiple threads, and +72% even on a single thread. Tokens remain bit-identical to the naive Rust reference.
tg32 (Qwen2.5-0.5B-Q8_0, Windows AVX2)
Single thread
| Path | tok/s | vs ggml |
|---|---|---|
| Lumen v0.2.0 | 17.97 | 2.30× slower |
| Lumen v0.3.0 | ~31 | 1.32× slower |
| ggml -t 1 | 41.32 | 1.0× |
8 threads (default on the test machine)
| Path | tok/s | vs ggml -t 8 |
|---|---|---|
| Lumen v0.3.0 | ~56 | 1.62× slower |
| ggml -t 8 | 90.90 | 1.0× |
Sanity check: Lumen 8t (56) is 1.36× faster than ggml 1t (41.32). Multi-thread scaling: Lumen 1.81×, ggml 2.20× — narrowing that gap is v0.4.
What landed
Phase 7.E.0 — profile decode time distribution
StepTimer accumulator + generate_greedy_jit_timed + an #[ignore] test that runs 32 decode tokens and prints a sorted breakdown. Showed:
layer/gate_up_matmul 692ms 36.1%
lm_head_matmul 446ms 23.2%
layer/down_matmul 414ms 21.6%
layer/silu_mul 155ms 8.1%
layer/qkv_matmul 83ms 4.3%
layer/wo_matmul 65ms 3.4%
layer/rope 43ms 2.2%
layer/attention 17ms 0.9% ← killed the planned flash-attn phase
(rest) <1%
88.6% matmul, 0.9% attention compute. The originally planned Phase 7.E (flash-style attention) would have saved at most fractions of a percent — deprioritized on the spot.
Phase 7.G — 4-accumulator Q8 N=1 decode kernel
Same dependency-chain fix from Phase 7.C, now applied to the Q8 kernel. 4 independent ymm accumulators replace a single ymm0 chain. Q8 matmuls landed 1.87–1.95× faster essentially across the board — bounded by memory bandwidth, not FMA throughput. Caught the Win64 ABI xmm6-xmm15 callee-saved trap (unit test passed; Qwen e2e produced all-NaN logits because RMSNorm's xmm6 state got clobbered).
Phase 7.H — vectorize SiLU and elementwise mul
std::arch::x86_64 only, no extra deps. SiLU uses an inlined polynomial expf:
- range-reduce:
x = n*ln2 + rvia FMA exp(r)≈ degree-5 Horner polynomial2^npacked into integer exponent bits, reinterpreted as f32
SiLU 87× faster (153ms → 1.76ms / 32 tokens). SiLU+mul went from 13.9% to 0.4% of decode.
Phase 7.I — RoPE inv_freq + sin_cos hoist
The original inner loop recomputed base.powf(...) and theta.sin_cos() per (t, h, i). For Qwen2 decode that's 14× redundant on Q and 2× on K. Hoist them out: 3.7× faster on RoPE (43ms → 12ms).
Phase 7.J — multi-thread Q8 matmul with rayon
Splits the M (output row) dimension of large Q8 matmuls across rayon worker threads. Per-call work-based threshold (M × K_blocks ≥ 100K) avoids regressions on small matmuls that don't have enough compute per row to outrun rayon's dispatch overhead. Single-thread ~31 → 8-thread ~56 tok/s (+1.81×).
v0.3 → v0.4 candidate moves
- Custom thread pool to replace rayon for tighter per-call overhead (likely closes the 1.81×→2.20× scaling gap).
- Q8-native int dot product (
vpmaddubsw/vpdpbusd) instead of the dequant→fp32 FMA path. The biggest single-thread move left. - Q8 token embeddings (545MB → 144MB; modest decode impact).
- Prefill batch path for long prompts.
Read more
- BENCHMARK.md — full breakdown
- docs/blog/phase7j-v0.3.0-profile-and-parallel-ko.md — v0.3.0 retrospective (KR), including the Win64 ABI gotcha
- docs/blog/phase7d-v0.2.0-q8-native-ko.md — v0.2.0 retrospective (KR)
- docs/blog/phase7-v0.1.0-end-to-end-ko.md — v0.1.0 end-to-end story (KR)
Lumen v0.2.0 — Q8-native decode, 4× faster than v0.1.0
Five days after v0.1.0, Lumen's single-thread Qwen2.5-0.5B decode runs 4.05× faster while still producing bit-identical tokens to the naive Rust reference.
tg32 (single thread, Windows AVX2)
| Path | tok/s | vs llama.cpp |
|---|---|---|
| Lumen naive Rust matmul | 2.91 | 14.2× slower |
| Lumen JIT v0.1.0 | 4.43 | 9.3× slower |
| Lumen JIT v0.1.0 + 1×N 4-acc decode (Phase 7.C) | 5.08 | 8.1× slower |
| Lumen JIT v0.2.0 (+ Q8-native fused matmul) | 17.97 | 2.30× slower |
| llama.cpp (ggml b9174 AVX2) | 41.32 | 1.0× |
What landed
Phase 7.C — 1×N 4-accumulator decode kernel (+15%)
The 1×8 AVX2 path used a single ymm accumulator, so each kk step issued one vfmadd231ps into a live dependency chain. New dispatcher branch routes M=1, N % 32 == 0 shapes — every weight matmul in autoregressive decode — to a kernel that runs 4 independent FMA chains over 32 output columns. Decode matmul throughput up; full forward up 15% (4.43 → 5.08 tok/s) because matmul wasn't the only bottleneck.
Phase 7.D — Q8-native fused matmul (+254%)
The real win. v0.1.0 dequantized Q8_0 weights to F32 at load time (~640MB → ~2.4GB in memory) and ran the matmul on the fp32 buffer. v0.2.0 keeps weights in their native Q8_0 ggml block layout end-to-end and feeds them into a new fused kernel (emit_quant_matmul_q8_n1_body) that:
- broadcasts the per-block fp16 scale
d - vectorizes K-direction over each 32-element block (8-wide × 4 unrolled)
- horizontally reduces ymm0 to a scalar with
vhaddps + vextractf128 + vaddss - writes one output row per pass
No load-time dequant pass. No per-call fp32 materialization. transpose_for_jit drops from 1.94s to 700ns because the F32 weights that needed transposing are now Q8 weights that don't.
End-to-end: 5.08 → 17.97 tok/s (+3.54×).
Phase 7.E — release plumbing
Cargo workspace version bumped to 0.2.0. README perf table and roadmap refreshed. New retrospective blog post (KR).
Correctness
qwen_generate_jit_matches_naive_and_speed continues to pass — the JIT path produces the same token sequence as the pure Rust reference on the same prompt ("안녕" → "안녕하세요, 저는").
What's next (toward v0.3)
- Flash-style attention (matters most on long context)
- Multi-thread prefill (near-linear on physical cores)
- AVX-512 dispatch on supporting CPUs
- Q8 token embeddings (the last big fp32 buffer in the model)
Read more
- BENCHMARK.md — full breakdown
- docs/blog/phase7d-v0.2.0-q8-native-ko.md — retrospective (KR)
- docs/blog/phase7-v0.1.0-end-to-end-ko.md — v0.1.0 end-to-end story (KR)
Lumen v0.1.0 — end-to-end LLM inference compiler in Rust
First public milestone of Lumen — a from-scratch LLM inference compiler + runtime in Rust.
What's in this release
- Self-built x86_64 instruction encoder (REX/VEX, ModR/M, SIB, AVX2+F16C+FMA)
- Register-tile (4×8) matmul: 57 GFLOPS / 19× scalar in standalone bench
- IR-level auto-synthesis of dequant×matmul fused kernels
- GGUF v3 reader + GPT-2 byte-level BPE tokenizer
- Full Qwen2.5-0.5B (24 layers, GQA 14Q/2KV, RoPE, SiLU FFN, KV cache) forward pass
- Shape-keyed JIT cache (
MatmulJitCache) feeding every weight matmul - Bit-identical tokens between naive Rust path and JIT path
Production runtime dependencies: thiserror — that's the entire list.
Benchmark (Qwen2.5-0.5B-Instruct-Q8_0, single thread, Windows AVX2)
| Path | tg32 tok/s | vs llama.cpp |
|---|---|---|
| Lumen naive Rust matmul | 2.91 | 14.2× slower |
| Lumen JIT (4×8 tile / 1×8 AVX2) | 4.43 | 9.3× slower |
| llama.cpp (ggml, b9174 AVX2) | 41.32 | 1.0× |
Honest gap breakdown and next-10× roadmap in BENCHMARK.md.
What's next (toward v0.2)
- Cache-blocked matmul (outer block loop around the 4×8 tile)
- Q8×F32 fused matmul on the model forward path (no dequant pass)
- Flash-style attention
- Multi-thread prefill
Read the story
- End-to-end retrospective (KR): docs/blog/phase7-v0.1.0-end-to-end-ko.md
- Architecture: docs/ARCHITECTURE.md
- Plan: PLAN.md