High-performance LLM inference engine for NVIDIA Blackwell & Hopper GPUs.
~40k lines of C++20/CUDA — built entirely by Claude Code (Opus 4.6).
imp is a CUDA inference engine written from scratch — every line of code, every kernel, every optimization was generated by Claude Code (Claude Opus 4.6). The goal: see how far an AI coding agent can go on a genuinely hard systems programming task.
This is not a wrapper. imp implements its own GGUF parser, tokenizer, KV cache, attention kernels (scalar, WMMA, CUTLASS FMHA), quantized GEMV (dp4a), MoE routing, speculative decoding, CUDA Graphs, and more — targeting Blackwell (sm_120, sm_100) and Hopper (sm_90a) with CUDA 13.1.
RTX 5090 (Blackwell, sm_120, 32 GB GDDR7) — CUDA 13.1.1, NVFP4 decode + FP8 prefill.
| Model | Quant | imp (tok/s) | llama.cpp | Δ |
|---|---|---|---|---|
| Qwen3-4B | Q8_0 | 390 | 244 | +60% |
| Qwen3-8B | Q8_0 | 264 | 157 | +68% |
| Gemma-3-12B | Q8_0 | 139 | 98 | +42% |
| Qwen3-Coder-30B (MoE) | Q6_K | 265 | 251 | +6% |
Prefill: 5.6k–25.8k tok/s. Full results with 12 models: BENCHMARKS.md
No local CUDA toolkit needed — everything runs in Docker.
# 1. Clone and enter the repo
git clone https://github.com/kekzl/imp.git && cd imp
# 2. Download a model (any GGUF from Hugging Face)
mkdir -p models
# Example: Qwen3-8B Q8_0 (~8.6 GB)
# 3. Build
docker compose build imp-server
# 4. Run the server
docker run --gpus all -v ./models:/models -p 8080:8080 \
imp:latest --model /models/Qwen3-8B-Q8_0.gguf
# 5. Chat (OpenAI-compatible API)
curl -s http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'Works with any OpenAI-compatible client:
# pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
r = client.chat.completions.create(
model="default", messages=[{"role": "user", "content": "Hello!"}],
max_tokens=64, stream=True)
for chunk in r:
print(chunk.choices[0].delta.content or "", end="", flush=True)Or use the CLI directly:
# Interactive chat
docker run -it --gpus all -v ./models:/models \
imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf --interactive
# Single prompt
docker run --gpus all -v ./models:/models \
imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
--prompt "Explain quantum computing in 3 sentences."
# Benchmark (compare with llama-bench)
docker run --gpus all -v ./models:/models \
imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
--bench --bench-pp 512 --max-tokens 128 --bench-reps 5# Single prompt
./build/imp-cli --model model.gguf --prompt "Hello, world!"
# Interactive chat
./build/imp-cli --model model.gguf --interactive
# Vision (Gemma-3)
./build/imp-cli --model gemma-3-12b-it.gguf --mmproj mmproj.gguf \
--image photo.jpg --prompt "Describe this image"
# NVFP4 decode cache (auto-enabled on Blackwell)
./build/imp-cli --model model.gguf --decode-nvfp4 --interactive
# Benchmark (matches llama-bench methodology)
./build/imp-cli --model model.gguf --bench --bench-pp 512 --max-tokens 128 --bench-reps 5Full CLI options
Model:
--model <path> Path to GGUF or SafeTensors model
--mmproj <path> Vision encoder GGUF for multimodal
--image <path> Input image (requires --mmproj)
--device <n> CUDA device ID (default: 0)
--gpu-layers <n> Layers on GPU, -1 = all (default: -1)
Generation:
--prompt <text> Input prompt
--max-tokens <n> Max tokens to generate (default: 256)
--interactive Interactive chat mode
--stop <str> Stop sequence (repeatable, up to 4)
--chat-template <t> auto|none|chatml|llama2|llama3|nemotron|gemma|deepseek_r1|phi
Sampling:
--temperature <f> (default: 0.7)
--top-p <f> (default: 0.9)
--top-k <n> (default: 40)
--min-p <f> (default: 0.0, disabled)
--typical-p <f> (default: 1.0, disabled)
--repeat-penalty <f> (default: 1.0, disabled)
--repeat-last-n <n> Penalty window (default: 0, all tokens)
--frequency-penalty <f> (default: 0.0)
--presence-penalty <f> (default: 0.0)
--seed <n> -1 for random (default: -1)
--dry-multiplier <f> DRY penalty scale (default: 0.0, disabled)
--dry-base <f> DRY exponential base (default: 1.75)
--dry-allowed-length <n> (default: 2)
--dry-penalty-last-n <n> (default: 0, all)
--mirostat <n> 0=off, 2=v2 (default: 0)
--mirostat-tau <f> (default: 5.0)
--mirostat-eta <f> (default: 0.1)
Performance:
--kv-fp8 FP8 E4M3 KV cache
--kv-int8 INT8 KV cache
--prefill-fp8 FP8 weight cache for prefill
--prefill-chunk-size <n> Max tokens per prefill chunk (default: 0)
--decode-nvfp4 NVFP4 decode cache (FP16 prefill + NVFP4 decode)
--decode-nvfp4-only NVFP4 decode-only (saves VRAM, slower prefill)
--no-nvfp4 Disable NVFP4 auto-detection
--ssm-fp16 FP16 SSM state
--no-cuda-graphs Disable CUDA Graphs
Benchmark:
--bench Synthetic benchmark mode (warmup + timed reps)
--bench-pp <n> Prompt tokens (default: 512)
--bench-reps <n> Repetitions (default: 3)
- Architectures: LLaMA, Mistral, Mixtral, DeepSeek, Qwen3, Qwen3-MoE, Phi-4, Gemma-3 (text + vision), Nemotron-H (Mamba2 + Attention + MoE)
- Model formats: GGUF, SafeTensors
- Quantization: Q2_K, Q3_K, Q4_0, Q4_K_M, Q5_K, Q6_K, Q8_0, FP8 E4M3, NVFP4 (FP4 E2M1), INT8
- Vision: Gemma-3 SigLIP encoder (896x896, 256 image tokens) via separate mmproj.gguf
- Attention: scalar Flash Attention 2 → WMMA tensor-core (sm_90+) → WMMA 8-warp (sm_120+); CUTLASS Hopper FMHA for prefill (WGMMA + TMA)
- KV cache: paged blocks (configurable 16/32/64), LRU eviction, prefix caching with block pinning, FP16/FP8/INT8/INT4
- Decode: CUDA Graphs (conditional WHILE loop), PDL, fused RMSNorm+Q8_1, fused QKV/gate+up GEMV, NVFP4 decode cache with prmt register LUT, multi-block argmax
- Prefill: CUTLASS FMHA (WGMMA + TMA), CUTLASS NVFP4 GEMM (sm_120), FP8 cuBLASLt, FP16/FP8 weight cache, batched K/V GEMM
- Sampling: temperature, top-p, top-k, min-p, typical-p, repetition/frequency/presence penalties (windowed), DRY, Mirostat v2
- Runtime: continuous batching, speculative decoding, Green Context SM partitioning, upfront VRAM budget planner
- Agentic: prefix cache block pinning, JSON schema constraining, tool calling (ChatML + Llama3), thinking/reasoning budgets, TTFT metrics
- API: C library, OpenAI-compatible HTTP server (SSE streaming, tool calling, logprobs, JSON mode, concurrent requests, rate limiting)
Tested models: Qwen3-4B, Qwen3-8B, Qwen3-32B (Q4_K_M), Qwen3-Coder-30B (MoE), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B (Mamba2+MoE), Phi-4-Mini, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral. Other models sharing the same architectures (LLaMA, Mistral, Qwen3, Gemma-3, DeepSeek, Nemotron-H) should work.
| Document | Description |
|---|---|
| Benchmarks | Full benchmark results vs llama.cpp (12 models, RTX 5090) |
| Usage & Reference | Build instructions, server setup, C API, project structure, architecture |
| Technical Comparison | imp vs llama.cpp — architecture, features, performance |
| Memory Management | VRAM/RAM strategies: imp vs llama.cpp vs Ollama vs vLLM |
| GEMV Dispatch | Complete map of quantization dispatch across all decode paths |
| CUDA 13.1 Audit | Feature inventory, performance phases P2-P29, architecture review |
Built by @kekzl with Claude Code (Claude Opus 4.6) as a proof of concept for AI-assisted systems programming.
Stands on the shoulders of llama.cpp — the GGUF format, quantization schemes, and the entire concept of practical local LLM inference were pioneered by Georgi Gerganov and the llama.cpp community.
MIT — see LICENSE.