GitHub - kekzl/imp: High-performance LLM inference engine in C++/CUDA for NVIDIA Blackwell GPUs (RTX 5090)

High-performance LLM inference engine for NVIDIA Blackwell & Hopper GPUs.

~40k lines of C++20/CUDA — built entirely by Claude Code (Opus 4.6).

What is this?

imp is a CUDA inference engine written from scratch — every line of code, every kernel, every optimization was generated by Claude Code (Claude Opus 4.6). The goal: see how far an AI coding agent can go on a genuinely hard systems programming task.

This is not a wrapper. imp implements its own GGUF parser, tokenizer, KV cache, attention kernels (scalar, WMMA, CUTLASS FMHA), quantized GEMV (dp4a), MoE routing, speculative decoding, CUDA Graphs, and more — targeting Blackwell (sm_120, sm_100) and Hopper (sm_90a) with CUDA 13.1.

Performance

RTX 5090 (Blackwell, sm_120, 32 GB GDDR7) — CUDA 13.1.1, NVFP4 decode + FP8 prefill.

Model	Quant	imp (tok/s)	llama.cpp	Δ
Qwen3-4B	Q8_0	390	244	+60%
Qwen3-8B	Q8_0	264	157	+68%
Gemma-3-12B	Q8_0	139	98	+42%
Qwen3-Coder-30B (MoE)	Q6_K	265	251	+6%

_{Prefill: 5.6k–25.8k tok/s. Full results with 12 models: BENCHMARKS.md}

Quickstart

No local CUDA toolkit needed — everything runs in Docker.

# 1. Clone and enter the repo
git clone https://github.com/kekzl/imp.git && cd imp

# 2. Download a model (any GGUF from Hugging Face)
mkdir -p models
# Example: Qwen3-8B Q8_0 (~8.6 GB)

# 3. Build
docker compose build imp-server

# 4. Run the server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

# 5. Chat (OpenAI-compatible API)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Works with any OpenAI-compatible client:

# pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
r = client.chat.completions.create(
    model="default", messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=64, stream=True)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Or use the CLI directly:

# Interactive chat
docker run -it --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf --interactive

# Single prompt
docker run --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
  --prompt "Explain quantum computing in 3 sentences."

# Benchmark (compare with llama-bench)
docker run --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
  --bench --bench-pp 512 --max-tokens 128 --bench-reps 5

CLI

# Single prompt
./build/imp-cli --model model.gguf --prompt "Hello, world!"

# Interactive chat
./build/imp-cli --model model.gguf --interactive

# Vision (Gemma-3)
./build/imp-cli --model gemma-3-12b-it.gguf --mmproj mmproj.gguf \
  --image photo.jpg --prompt "Describe this image"

# NVFP4 decode cache (auto-enabled on Blackwell)
./build/imp-cli --model model.gguf --decode-nvfp4 --interactive

# Benchmark (matches llama-bench methodology)
./build/imp-cli --model model.gguf --bench --bench-pp 512 --max-tokens 128 --bench-reps 5

Full CLI options

Model:
  --model <path>            Path to GGUF or SafeTensors model
  --mmproj <path>           Vision encoder GGUF for multimodal
  --image <path>            Input image (requires --mmproj)
  --device <n>              CUDA device ID (default: 0)
  --gpu-layers <n>          Layers on GPU, -1 = all (default: -1)

Generation:
  --prompt <text>           Input prompt
  --max-tokens <n>          Max tokens to generate (default: 256)
  --interactive             Interactive chat mode
  --stop <str>              Stop sequence (repeatable, up to 4)
  --chat-template <t>       auto|none|chatml|llama2|llama3|nemotron|gemma|deepseek_r1|phi

Sampling:
  --temperature <f>         (default: 0.7)
  --top-p <f>               (default: 0.9)
  --top-k <n>               (default: 40)
  --min-p <f>               (default: 0.0, disabled)
  --typical-p <f>           (default: 1.0, disabled)
  --repeat-penalty <f>      (default: 1.0, disabled)
  --repeat-last-n <n>       Penalty window (default: 0, all tokens)
  --frequency-penalty <f>   (default: 0.0)
  --presence-penalty <f>    (default: 0.0)
  --seed <n>                -1 for random (default: -1)
  --dry-multiplier <f>      DRY penalty scale (default: 0.0, disabled)
  --dry-base <f>            DRY exponential base (default: 1.75)
  --dry-allowed-length <n>  (default: 2)
  --dry-penalty-last-n <n>  (default: 0, all)
  --mirostat <n>            0=off, 2=v2 (default: 0)
  --mirostat-tau <f>        (default: 5.0)
  --mirostat-eta <f>        (default: 0.1)

Performance:
  --kv-fp8                  FP8 E4M3 KV cache
  --kv-int8                 INT8 KV cache
  --prefill-fp8             FP8 weight cache for prefill
  --prefill-chunk-size <n>  Max tokens per prefill chunk (default: 0)
  --decode-nvfp4            NVFP4 decode cache (FP16 prefill + NVFP4 decode)
  --decode-nvfp4-only       NVFP4 decode-only (saves VRAM, slower prefill)
  --no-nvfp4                Disable NVFP4 auto-detection
  --ssm-fp16                FP16 SSM state
  --no-cuda-graphs          Disable CUDA Graphs

Benchmark:
  --bench                   Synthetic benchmark mode (warmup + timed reps)
  --bench-pp <n>            Prompt tokens (default: 512)
  --bench-reps <n>          Repetitions (default: 3)

Features

Architectures: LLaMA, Mistral, Mixtral, DeepSeek, Qwen3, Qwen3-MoE, Phi-4, Gemma-3 (text + vision), Nemotron-H (Mamba2 + Attention + MoE)
Model formats: GGUF, SafeTensors
Quantization: Q2_K, Q3_K, Q4_0, Q4_K_M, Q5_K, Q6_K, Q8_0, FP8 E4M3, NVFP4 (FP4 E2M1), INT8
Vision: Gemma-3 SigLIP encoder (896x896, 256 image tokens) via separate mmproj.gguf
Attention: scalar Flash Attention 2 → WMMA tensor-core (sm_90+) → WMMA 8-warp (sm_120+); CUTLASS Hopper FMHA for prefill (WGMMA + TMA)
KV cache: paged blocks (configurable 16/32/64), LRU eviction, prefix caching with block pinning, FP16/FP8/INT8/INT4
Decode: CUDA Graphs (conditional WHILE loop), PDL, fused RMSNorm+Q8_1, fused QKV/gate+up GEMV, NVFP4 decode cache with prmt register LUT, multi-block argmax
Prefill: CUTLASS FMHA (WGMMA + TMA), CUTLASS NVFP4 GEMM (sm_120), FP8 cuBLASLt, FP16/FP8 weight cache, batched K/V GEMM
Sampling: temperature, top-p, top-k, min-p, typical-p, repetition/frequency/presence penalties (windowed), DRY, Mirostat v2
Runtime: continuous batching, speculative decoding, Green Context SM partitioning, upfront VRAM budget planner
Agentic: prefix cache block pinning, JSON schema constraining, tool calling (ChatML + Llama3), thinking/reasoning budgets, TTFT metrics
API: C library, OpenAI-compatible HTTP server (SSE streaming, tool calling, logprobs, JSON mode, concurrent requests, rate limiting)

Tested models: Qwen3-4B, Qwen3-8B, Qwen3-32B (Q4_K_M), Qwen3-Coder-30B (MoE), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B (Mamba2+MoE), Phi-4-Mini, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral. Other models sharing the same architectures (LLaMA, Mistral, Qwen3, Gemma-3, DeepSeek, Nemotron-H) should work.

Documentation

Document	Description
Benchmarks	Full benchmark results vs llama.cpp (12 models, RTX 5090)
Usage & Reference	Build instructions, server setup, C API, project structure, architecture
Technical Comparison	imp vs llama.cpp — architecture, features, performance
Memory Management	VRAM/RAM strategies: imp vs llama.cpp vs Ollama vs vLLM
GEMV Dispatch	Complete map of quantization dispatch across all decode paths
CUDA 13.1 Audit	Feature inventory, performance phases P2-P29, architecture review

Acknowledgments

Built by @kekzl with Claude Code (Claude Opus 4.6) as a proof of concept for AI-assisted systems programming.

Stands on the shoulders of llama.cpp — the GGUF format, quantization schemes, and the entire concept of practical local LLM inference were pioneered by Georgi Gerganov and the llama.cpp community.

License

MIT — see LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 159 Commits
.github/workflows		.github/workflows
bench		bench
benchmarks		benchmarks
cmake		cmake
docs		docs
include/imp		include/imp
scripts		scripts
src		src
tests		tests
third_party/stb		third_party/stb
tools		tools
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
AGENTS.md		AGENTS.md
AUDIT.md		AUDIT.md
BENCHMARKS.md		BENCHMARKS.md
CLAUDE.md		CLAUDE.md
CMakeLists.txt		CMakeLists.txt
DISPATCH.md		DISPATCH.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
bench_compare.sh		bench_compare.sh
bench_fp8_prefill.sh		bench_fp8_prefill.sh
docker-compose.yml		docker-compose.yml
docker-entrypoint.sh		docker-entrypoint.sh
imp		imp
logo.svg		logo.svg
presets.toml		presets.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is this?

Performance

Quickstart

CLI

Features

Documentation

Acknowledgments

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 2

Languages

Folders and files

Latest commit

History

Repository files navigation

What is this?

Performance

Quickstart

CLI

Features

Documentation

Acknowledgments

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 2

Languages

Packages