Compile your LLMs, don't interpret them.
ForgeLLM is a Rust-native ahead-of-time (AOT) ML compiler for language models (1M-7B parameters). It compiles GGUF models into optimized, self-contained binaries with native Metal GPU acceleration — no runtime interpreter, no Python dependencies, no dynamic dispatch.
Faster than llama.cpp on Apple Silicon — Metal and CPU.
Documentation | Crates.io | forgellm.dev | Blog: How we beat llama.cpp · MMA flash + fp16 KV + dispatch fusion · CPU batched prefill (9.5× on long prompts)
Benchmarks on Apple M5 Pro, 8-bit quantization, 64-token generation.
| Model | ForgeLLM Metal | MLX (8-bit) | llama.cpp (Q8_0) | vs MLX | vs llama.cpp |
|---|---|---|---|---|---|
| SmolLM2-135M | 503 tok/s | 414 tok/s | 481 tok/s | 1.21x | 1.05x |
| SmolLM2-360M | 289 tok/s | 264 tok/s | 267 tok/s | 1.09x | 1.08x |
| Llama-3.2-1B | 178 tok/s | 111 tok/s | 130 tok/s | 1.60x | 1.37x |
| Llama-3.2-3B | 70.4 tok/s | 42.2 tok/s | 67.8 tok/s | 1.67x | 1.04x |
v0.7.3 vectorized the decode-path attention kernel with half4 loads. v0.7.4 restructures the V-weighted-sum so each simdgroup (32 lanes) cooperatively reduces over seq_len for one d4 chunk, fixing a large parallelism underutilization on head_dim=64 (the old layout kept only 16 / 256 threads productive; the new one keeps all 256 productive). Short-context decode is matmul-bound and unchanged; long-context decode is attention-bound and speeds up sharply.
| Model | Context | v0.7.2 | v0.7.3 | v0.7.4 | v0.7.2 → v0.7.4 |
|---|---|---|---|---|---|
| SmolLM2-135M | ~900 tok | 174 | 202 | 296 | +70% |
| SmolLM2-135M | ~2250 tok | 84 | 99 | 165 | +96% |
| Llama-3.2-1B | ~2250 tok | 87 | 96 | 124 | +42% |
Apple M5 Pro, Q8_0. Numbers below use v0.7.0 default attention: MMA-accelerated flash attention (hardware simdgroup_matrix<half, 8, 8> for Q·K^T and P·V) activates automatically when HEAD_DIM ≤ 128 and num_tokens ≥ 8. Set FORGE_MMA_ATTN=0 to force the legacy kernel.
| Model | ForgeLLM Metal | Legacy (opt-out) | MLX (8-bit) | llama.cpp (Q8_0) |
|---|---|---|---|---|
| SmolLM2-135M (~100 tok) | ~3,100 | ~3,100 | 1,507 | 2,812 |
| SmolLM2-135M (~1250 tok) | ~6,000 | ~6,000 | — | — |
| Llama-3.2-1B (~801 tok) | 2,282 | 1,950 | — | — |
| Llama-3.2-1B (~1501 tok) | 2,081 | 1,531 | — | — |
| Llama-3.2-1B (~2501 tok) | 1,775 | 1,134 | — | — |
| Llama-3.2-1B (~3001 tok) | 1,699 | 995 | — | — |
| Qwen2.5-0.5B (~666 tok) | 4,515 | 3,730 | — | — |
| Qwen2.5-0.5B (~2501 tok) | 3,540 | 2,408 | — | — |
| Phi-3-mini (~1201 tok) | 607 | 465 | — | — |
| Phi-3-mini (~3001 tok) | 480 | 278 | — | — |
Gains grow with prompt length because attention is O(M²) and the MMA path replaces scalar simdgroup reductions with hardware 8×8×8 matrix multiplies. Short-prompt numbers are within noise because matmul dominates there (MMA-flash and legacy share the same Q/K/V/O projection kernels).
Apple M5 Pro, Llama-3.2-1B-Instruct. v0.8.0 introduced a batched CPU prefill path — QKV/O/gate/up/down projections now use matmul_mat_q8_0_KxN / matmul_mat_q4_0_KxN kernels that load each weight matrix from RAM once per forward pass instead of M times. v0.8.1 added Q-tiled flash attention (attention_flash_batch) that amortizes K/V scans across Q_TILE=16 queries per block. v0.8.2 extended both to Q4_0.
| Prompt | Q8_0 per-token (v0.7.x) | Q8_0 batched (v0.8.1) | Q4_0 per-token | Q4_0 batched (v0.8.2) |
|---|---|---|---|---|
| 352 | 40 tok/s | 191 tok/s (4.7×) | 44 tok/s | 304 tok/s (6.9×) |
| 902 | 31 tok/s | 234 tok/s (7.6×) | 34 tok/s | 266 tok/s (7.9×) |
| 1603 | 24 tok/s | 207 tok/s (8.5×) | 26 tok/s | 238 tok/s (9.2×) |
| 2502 | 19 tok/s | 180 tok/s (9.5×) | — | 195 tok/s |
Long-context CPU prefill is now in the 180–300 tok/s range on Apple Silicon — competitive with llama.cpp's CPU path, from pure Rust with no external BLAS dependency. The batched path dispatches automatically when prompt ≥ PREFILL_BATCH_THRESHOLD=8; short prompts keep the stack-based per-token path for lower fixed cost.
| Model | Binary | Weights | Total |
|---|---|---|---|
| SmolLM2-135M | 3.7 MB | 244 MB | 248 MB |
| Llama-3.2-1B | 3.7 MB | 2.2 GB | 2.2 GB |
| Llama-3.2-3B | 3.7 MB | 4.6 GB | 4.9 GB |
Binary size is constant across all models. Compare: llama.cpp ~15 MB, MLX ~500 MB Python runtime.
We beat MLX and llama.cpp on generation across all model sizes. On prefill, we lead for short prompts on 135M and catch up at long contexts on 1B (~3x our previous number), using simdgroup_matrix hardware matrix-multiply tiles. MLX's Accelerate BLAS still edges us for mid-length 1B prefill (~325 tokens).
See benchmarks/HISTORY.md and blog/beating-llama-cpp.md for details.
# From crates.io (recommended)
cargo install forgellm-cli
# Or build from source
git clone https://github.com/sauravpanda/forge-llm.git
cd forge-llm && cargo build --release# Compile model to Metal binary
forge compile --model model.gguf --output ./my-model --target metal
forge export-weights --model model.gguf --output ./my-model/weights.bin
cp tokenizer.json ./my-model/
# Build and run
cd my-model && cargo build --release
./target/release/my-model weights.bin tokenizer.json "The meaning of life is"# Start OpenAI-compatible server
./target/release/my-model weights.bin tokenizer.json --serve --port 8080
# Query it
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "user", "content": "Hello!"}], "stream": true}'# Compile for CPU with NEON SIMD + Rayon parallelism
forge compile --model model.gguf --output ./my-model --target cpu --runEvery existing LLM inference engine — llama.cpp, vLLM, MLX — loads model weights at runtime and executes a generic inference loop. This is like shipping a Python interpreter when you could ship a compiled binary.
ForgeLLM compiles models into hardware-specific code:
| llama.cpp (interpreter) | ForgeLLM (compiler) | |
|---|---|---|
| Dispatch | Runtime graph build + plan + execute | Direct function calls, zero overhead |
| Dimensions | Dynamic (runtime checks) | Baked in at compile time |
| GPU commands | Multiple command encoders per layer | Single encoder for entire forward pass |
| Projections | Separate Q, K, V matmuls | Fused QKV in one dispatch |
| Memory | Runtime allocation | Static, pre-allocated buffers |
| Quantization | Dequant at load time | Native Q8_0/Q4_0 GPU kernels |
| Output | Shared library + runtime | Self-contained binary, deploy with scp |
| Target | Command | Features |
|---|---|---|
| Metal GPU | --target metal |
Native MSL shaders, simdgroup reductions, Q8_0/Q4_0 kernels, API server |
| CPU | --target cpu |
NEON sdot inline asm, Rayon parallelism, Apple AMX via Accelerate |
| WASM | --target wasm |
SIMD128, wasm-bindgen exports, browser-ready |
| wgpu/WGSL | --target gpu |
Cross-platform GPU via WebGPU |
| Architecture | Models | Interpreter (forge run) |
AOT Metal/CPU |
|---|---|---|---|
| LlamaForCausalLM | SmolLM2 (135M, 360M, 1.7B), Llama 3.2 (1B, 3B), TinyLlama | ✅ Verified | ✅ Verified |
| Qwen2ForCausalLM | Qwen2.5 (0.5B–7B) | ✅ Verified | ✅ Verified (0.5B Q8_0 on CPU + Metal; fixes #210) |
| Phi3ForCausalLM | Phi-3 Mini | ✅ Verified | ✅ Verified (Phi-3 Mini Q8_0 on Metal; fused QKV/gate+up split at load time in v0.6.7) |
| MistralForCausalLM | Mistral 7B (sliding-window attention) | ||
| GemmaForCausalLM | Gemma-1 2B | ✅ Verified (post-lookup embedding scale, GeLU-approx activation; +1 RMS offset is baked into llama.cpp-converted GGUFs so no runtime adjustment is needed — fixed in v0.7.5) |
✅ Verified (Gemma-1 2B Q8_0 on Metal: GELU kernel + batched embed-scale + large-col matmul routing — byte-identical to interpreter, v0.7.6) |
| GemmaForCausalLM (v2/v3) | Gemma-2, Gemma-3 | ❌ Not supported — needs logit softcap, dual norms per sublayer, sliding-window alternation | ❌ Not yet |
| StableLMForCausalLM | StableLM 1.6B, 3B |
Supports GGUF quantization formats: F32, F16, BF16, Q8_0, Q4_0, Q4_1, Q2_K through Q8_K. Also supports SafeTensors and LoRA adapter merging at compile time.
The Metal backend generates optimized Apple Silicon compute shaders:
- Hardware matrix-multiply prefill —
simdgroup_matrix<float, 8, 8>MMA tiles dequantize Q8_0 into threadgroup memory and run 8×8×8simdgroup_multiply_accumulateper tile - Simdgroup cooperative matmul — 32-lane SIMD reductions with shared memory vector caching (fast path for single-token decode)
- Native Q8_0/Q4_0 kernels — Dequantize on-the-fly during matmul, halving memory bandwidth
- Fused projections — QKV and gate+up concatenated into single matmul dispatches
- Single compute encoder — Entire forward pass in one encoder, zero transitions
- Double-buffered prefill — GPU overlaps with CPU encoding
fast::math — Hardware-accelerated rsqrt/exp in normalization and attention- OpenAI-compatible API —
--servemode with SSE streaming
# AOT compile to Metal GPU binary
forge compile --model model.gguf --output ./out --target metal
# AOT compile to CPU binary
forge compile --model model.gguf --output ./out --target cpu --run
# Export weights for compiled binary
forge export-weights --model model.gguf --output ./out/weights.bin
# Run interpreter (no compilation)
forge run --model model.gguf --tokenizer tokenizer.json --prompt "Hello"
# Interactive chat
forge chat --model model.gguf --tokenizer tokenizer.json
# Start API server (interpreter mode)
forge serve --model model.gguf --tokenizer tokenizer.json --port 8080
# Benchmark
forge bench --model model.gguf --tokenizer tokenizer.json --num-tokens 128
# Inspect model
forge info model.gguf
# ONNX export
forge export-onnx --model model.gguf --output model.onnx
# Speculative decoding
forge speculative --draft small.gguf --target-model large.gguf --output ./specGGUF/SafeTensors → Frontend → IR Graph → Optimizer → Codegen → Binary
parse build fuse emit compile
8 crates: forgellm-frontend, forgellm-optimizer, forgellm-codegen-cpu, forgellm-codegen-wasm, forgellm-codegen-gpu, forgellm-codegen-metal, forgellm-runtime, forgellm-cli
cargo test --workspace --exclude forgellm-python # 258+ tests
cargo clippy --workspace -- -D warnings
cargo fmt --all -- --checkMIT