Skip to content

Releases: redchupa/lumen

v0.5.0 — Measure-driven cycle (gap 1.376× → 1.304×)

20 May 21:57

Choose a tag to compare

Headline

Metric v0.4.0 v0.5.0 Change
tg32 @ 1 thread (Lumen vs ggml) 41.85 vs 40.40 45.64 vs 40.40 Lumen +13%
tg32 @ 8 threads (Lumen vs ggml) 65.5 vs 90.9 67.4 vs 87.9 gap 1.376× → 1.304×
pp32 (LUMEN_PREFILL=1, opt-in) n/a 54.15 tok/s new path

Tokens remain bit-identical to v0.4.0. Same Qwen2.5-0.5B-Q8_0 model, same AMD Zen 4 host.

What shipped

  • Phase 8.A — ThreadPool rewrite (mpsc + per-task allocation + mutex receiver → atomic next_task counter + generation-based wake). LUMEN_THREADS=N env var added for scaling experiments.
  • Phase 8.E.1 — Q8 prefill via N=1 fan-out dispatcher. Sidesteps the slow N>1 codegen body (2.3-2.6× slower than N=1×N at every Qwen projection shape, confirmed by Phase 8.D.5 micro-bench). LUMEN_PREFILL=1 opt-in.

What didn't ship but stays in tree (default-off)

These are the measure-driven negative results from this cycle. Infrastructure preserved; activation waits for either a better host (ZMM on native 512-bit silicon) or the kernel rework (Phase 8.E.2).

  • AVX-512 ZMM kernels (Phase 7.R/S/T). Zen 4 implements AVX-512 as double-pumped 256-bit, so ZMM ops don't deliver lane-width win on this host.
  • chunk_rows L2-fit cap (8.B), software prefetch encoders (8.C), forward_layer_prefill_jit + Model::forward_prefill_jit (8.D.1-3). All measured net-negative on the current N>1 codegen.

11-cycle pattern (v0.5 measure-driven decisions)

Phase Hypothesis Result
7.M Q8 native int dot net-neutral
7.N VNNI single-acc -3.6%
7.O VNNI 4-acc -2.7% (revived in 7.P)
7.R/S/T AVX-512 ZMM lane width -4.5%
8.A atomic ThreadPool +9% 1t, +3% 8t ✓
8.B chunk L2-fit cap -3.5% (reverted)
8.C software prefetch 8t -49% (reverted)
8.D.3 prefill batching -2.9× pp32
8.D.3 retro attention is the cause 1% only (false)
8.D.5 N>1 codegen is slow confirmed ✓
8.E.1 N=1 fan-out fixes plumbing +2.39× pp32 ✓

Net: 1 merged, 1 diagnostic, 1 partial fix vs 7 reverts. Gap closed 1.376× → 1.304×.

What's next (see ROADMAP.md)

  1. Phase 8.E.2 — Q8 N>1 codegen rewrite (port the 7.G 4-accumulator YMM pattern). Expected to close most of the 8t gap.
  2. ARM64 backend — Apple Silicon / Graviton / Raspberry Pi support.
  3. Q4 native matmul — enables Qwen2.5-1.5B and 3B at the same memory footprint.

Retrospective blog series

16 posts across the v0.1-v0.5 build, indexed at docs/blog/INDEX.md. The v0.5-cycle retros:

Lumen v0.4.0 — shape-aware VNNI/fp32 hybrid dispatch

18 May 00:13

Choose a tag to compare

v0.4.0 ships the result of four measurement-led phases (7.M → 7.P): three attempts to push int-dot product wins through, each disproved by bench, and a final hybrid dispatch that combines them all into the actual win. Tokens stay bit-identical to the naive Rust reference across every commit.

tg32 (Qwen2.5-0.5B-Q8_0, Windows, 8 worker threads on Zen 4)

Path tok/s vs ggml -t 8
Lumen v0.3.0 (Q8×F32 4-acc, custom pool) ~60 1.51× slower
Lumen v0.4.0 (per-shape VNNI/fp32 hybrid) ~65 1.39× slower
ggml -t 8 90.90 1.0×

Sanity: Lumen 8t (65) is 1.58× faster than ggml -t 1 (41.32).

v0.1.0 → v0.4.0 spread: 4.43 → 65.3 tok/s = 14.7× decode in ~5 days.

What landed (and why the path zig-zagged)

Phase 7.M — Q8×Q8 AVX2 int dot

Built the full Q8×Q8 fused matmul infrastructure: 7 new AVX2 integer encoders (vpsignb, vpmaddubsw, vpmaddwd, vmovdqu, vpaddd, vpcmpeqd, vpsrlw_imm8), the IR pattern (Param Q8 + Param Q8 + 2× Dequantize + MatMul), MatmulJitCache::get_or_compile_q8q8, model.rs activation quantization pipeline. Measured net-neutral on Zen 4 — gate_up/lm_head/qkv/wo all improved, down_matmul regressed +27%. Default-off, infrastructure kept.

Phase 7.N — VNNI vpdpbusd encoders + 3-way dispatch

Added VEX-256 (AVX-VNNI client CPUs) and EVEX-256 (AVX-512 VNNI servers) forms of vpdpbusd plus runtime CPU detection. EVEX prefix encoded by hand (4 bytes: 0x62 + P0/P1/P2). The Q8×Q8 unit test confirmed EVEX runs on Zen 4 even though avxvnni CPUID bit is clear. Measured -3.6% on Zen 4 with single-accumulator design. Default-off, infrastructure kept.

Phase 7.O — 4-accumulator VNNI kernel

Applied Phase 7.G's 4-acc trick to VNNI: unroll the kb loop by 4 and feed each block into its own fp32 sub-accumulator. Win64-safe via ymm6 save/restore on the stack. Measured still -2.7% on Zen 4 — gate_up/qkv/wo/lm_head won as predicted (-9% to -28%) but down_matmul still regressed +27%, exactly canceling out. Default-off, infrastructure kept.

Phase 7.P — per-shape hybrid dispatch (the real win)

Three phases of data made the answer obvious: VNNI wins on short-K matmuls (K_blocks=28: -9% to -28%) and loses on long-K (K_blocks=152: +27%) because of CPU OoO window saturation in the longer inner loop. Solution: dispatch per matmul shape.

fn use_vnni_for_matmul(d_in: usize) -> bool {
    has_vnni() && (d_in / 32) <= 64
}

Routes Qwen's 5 short-K matmuls (qkv/wo/gate/up/lm_head) to VNNI and FFN-down to fp32 4-acc. +4.5% bench (62.5 → 65.3 tok/s), -13.3% per-step profile. Default-on.

v0.4.0 architectural promise

Every piece of code from 7.M, 7.N, 7.O is now on the production hot path via 7.P — no dead code. The 3-way CPU dispatch (avx512vnni > avxvnni > AVX2 fallback) and 3-way codegen dispatch (4-acc / single-acc / AVX2 chain) all light up depending on host and shape. External runtime deps still just thiserror.

v0.4 → v0.5

Single clear next move: AVX-512 ZMM fp32 + ZMM VNNI. The remaining 1.39× gap to ggml is almost entirely lane width — ggml uses ZMM (512-bit), we use YMM (256-bit). Extending the EVEX encoder family to ZMM ops and writing new ZMM kernels is the v0.5 work.

Read more

Lumen v0.3.0 — profile, parallelize, 4× faster than v0.2.0

17 May 06:36

Choose a tag to compare

A day after v0.2.0, Lumen's Qwen2.5-0.5B decode runs 3.1× faster when allowed to use multiple threads, and +72% even on a single thread. Tokens remain bit-identical to the naive Rust reference.

tg32 (Qwen2.5-0.5B-Q8_0, Windows AVX2)

Single thread

Path tok/s vs ggml
Lumen v0.2.0 17.97 2.30× slower
Lumen v0.3.0 ~31 1.32× slower
ggml -t 1 41.32 1.0×

8 threads (default on the test machine)

Path tok/s vs ggml -t 8
Lumen v0.3.0 ~56 1.62× slower
ggml -t 8 90.90 1.0×

Sanity check: Lumen 8t (56) is 1.36× faster than ggml 1t (41.32). Multi-thread scaling: Lumen 1.81×, ggml 2.20× — narrowing that gap is v0.4.

What landed

Phase 7.E.0 — profile decode time distribution

StepTimer accumulator + generate_greedy_jit_timed + an #[ignore] test that runs 32 decode tokens and prints a sorted breakdown. Showed:

layer/gate_up_matmul  692ms  36.1%
      lm_head_matmul  446ms  23.2%
   layer/down_matmul  414ms  21.6%
      layer/silu_mul  155ms   8.1%
    layer/qkv_matmul   83ms   4.3%
     layer/wo_matmul   65ms   3.4%
          layer/rope   43ms   2.2%
     layer/attention   17ms   0.9%   ← killed the planned flash-attn phase
       (rest)              <1%

88.6% matmul, 0.9% attention compute. The originally planned Phase 7.E (flash-style attention) would have saved at most fractions of a percent — deprioritized on the spot.

Phase 7.G — 4-accumulator Q8 N=1 decode kernel

Same dependency-chain fix from Phase 7.C, now applied to the Q8 kernel. 4 independent ymm accumulators replace a single ymm0 chain. Q8 matmuls landed 1.87–1.95× faster essentially across the board — bounded by memory bandwidth, not FMA throughput. Caught the Win64 ABI xmm6-xmm15 callee-saved trap (unit test passed; Qwen e2e produced all-NaN logits because RMSNorm's xmm6 state got clobbered).

Phase 7.H — vectorize SiLU and elementwise mul

std::arch::x86_64 only, no extra deps. SiLU uses an inlined polynomial expf:

  • range-reduce: x = n*ln2 + r via FMA
  • exp(r) ≈ degree-5 Horner polynomial
  • 2^n packed into integer exponent bits, reinterpreted as f32

SiLU 87× faster (153ms → 1.76ms / 32 tokens). SiLU+mul went from 13.9% to 0.4% of decode.

Phase 7.I — RoPE inv_freq + sin_cos hoist

The original inner loop recomputed base.powf(...) and theta.sin_cos() per (t, h, i). For Qwen2 decode that's 14× redundant on Q and 2× on K. Hoist them out: 3.7× faster on RoPE (43ms → 12ms).

Phase 7.J — multi-thread Q8 matmul with rayon

Splits the M (output row) dimension of large Q8 matmuls across rayon worker threads. Per-call work-based threshold (M × K_blocks ≥ 100K) avoids regressions on small matmuls that don't have enough compute per row to outrun rayon's dispatch overhead. Single-thread ~31 → 8-thread ~56 tok/s (+1.81×).

v0.3 → v0.4 candidate moves

  1. Custom thread pool to replace rayon for tighter per-call overhead (likely closes the 1.81×→2.20× scaling gap).
  2. Q8-native int dot product (vpmaddubsw/vpdpbusd) instead of the dequant→fp32 FMA path. The biggest single-thread move left.
  3. Q8 token embeddings (545MB → 144MB; modest decode impact).
  4. Prefill batch path for long prompts.

Read more

Lumen v0.2.0 — Q8-native decode, 4× faster than v0.1.0

16 May 22:49

Choose a tag to compare

Five days after v0.1.0, Lumen's single-thread Qwen2.5-0.5B decode runs 4.05× faster while still producing bit-identical tokens to the naive Rust reference.

tg32 (single thread, Windows AVX2)

Path tok/s vs llama.cpp
Lumen naive Rust matmul 2.91 14.2× slower
Lumen JIT v0.1.0 4.43 9.3× slower
Lumen JIT v0.1.0 + 1×N 4-acc decode (Phase 7.C) 5.08 8.1× slower
Lumen JIT v0.2.0 (+ Q8-native fused matmul) 17.97 2.30× slower
llama.cpp (ggml b9174 AVX2) 41.32 1.0×

What landed

Phase 7.C — 1×N 4-accumulator decode kernel (+15%)

The 1×8 AVX2 path used a single ymm accumulator, so each kk step issued one vfmadd231ps into a live dependency chain. New dispatcher branch routes M=1, N % 32 == 0 shapes — every weight matmul in autoregressive decode — to a kernel that runs 4 independent FMA chains over 32 output columns. Decode matmul throughput up; full forward up 15% (4.43 → 5.08 tok/s) because matmul wasn't the only bottleneck.

Phase 7.D — Q8-native fused matmul (+254%)

The real win. v0.1.0 dequantized Q8_0 weights to F32 at load time (~640MB → ~2.4GB in memory) and ran the matmul on the fp32 buffer. v0.2.0 keeps weights in their native Q8_0 ggml block layout end-to-end and feeds them into a new fused kernel (emit_quant_matmul_q8_n1_body) that:

  • broadcasts the per-block fp16 scale d
  • vectorizes K-direction over each 32-element block (8-wide × 4 unrolled)
  • horizontally reduces ymm0 to a scalar with vhaddps + vextractf128 + vaddss
  • writes one output row per pass

No load-time dequant pass. No per-call fp32 materialization. transpose_for_jit drops from 1.94s to 700ns because the F32 weights that needed transposing are now Q8 weights that don't.

End-to-end: 5.08 → 17.97 tok/s (+3.54×).

Phase 7.E — release plumbing

Cargo workspace version bumped to 0.2.0. README perf table and roadmap refreshed. New retrospective blog post (KR).

Correctness

qwen_generate_jit_matches_naive_and_speed continues to pass — the JIT path produces the same token sequence as the pure Rust reference on the same prompt ("안녕" → "안녕하세요, 저는").

What's next (toward v0.3)

  • Flash-style attention (matters most on long context)
  • Multi-thread prefill (near-linear on physical cores)
  • AVX-512 dispatch on supporting CPUs
  • Q8 token embeddings (the last big fp32 buffer in the model)

Read more

Lumen v0.1.0 — end-to-end LLM inference compiler in Rust

16 May 05:36

Choose a tag to compare

First public milestone of Lumen — a from-scratch LLM inference compiler + runtime in Rust.

What's in this release

  • Self-built x86_64 instruction encoder (REX/VEX, ModR/M, SIB, AVX2+F16C+FMA)
  • Register-tile (4×8) matmul: 57 GFLOPS / 19× scalar in standalone bench
  • IR-level auto-synthesis of dequant×matmul fused kernels
  • GGUF v3 reader + GPT-2 byte-level BPE tokenizer
  • Full Qwen2.5-0.5B (24 layers, GQA 14Q/2KV, RoPE, SiLU FFN, KV cache) forward pass
  • Shape-keyed JIT cache (MatmulJitCache) feeding every weight matmul
  • Bit-identical tokens between naive Rust path and JIT path

Production runtime dependencies: thiserror — that's the entire list.

Benchmark (Qwen2.5-0.5B-Instruct-Q8_0, single thread, Windows AVX2)

Path tg32 tok/s vs llama.cpp
Lumen naive Rust matmul 2.91 14.2× slower
Lumen JIT (4×8 tile / 1×8 AVX2) 4.43 9.3× slower
llama.cpp (ggml, b9174 AVX2) 41.32 1.0×

Honest gap breakdown and next-10× roadmap in BENCHMARK.md.

What's next (toward v0.2)

  • Cache-blocked matmul (outer block loop around the 4×8 tile)
  • Q8×F32 fused matmul on the model forward path (no dequant pass)
  • Flash-style attention
  • Multi-thread prefill

Read the story