Skip to content

kekzl/imp

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

159 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

imp

High-performance LLM inference engine for NVIDIA Blackwell & Hopper GPUs.

~40k lines of C++20/CUDA — built entirely by Claude Code (Opus 4.6).

License CUDA 13.1+ C++20 Last Commit


What is this?

imp is a CUDA inference engine written from scratch — every line of code, every kernel, every optimization was generated by Claude Code (Claude Opus 4.6). The goal: see how far an AI coding agent can go on a genuinely hard systems programming task.

This is not a wrapper. imp implements its own GGUF parser, tokenizer, KV cache, attention kernels (scalar, WMMA, CUTLASS FMHA), quantized GEMV (dp4a), MoE routing, speculative decoding, CUDA Graphs, and more — targeting Blackwell (sm_120, sm_100) and Hopper (sm_90a) with CUDA 13.1.

Performance

RTX 5090 (Blackwell, sm_120, 32 GB GDDR7) — CUDA 13.1.1, NVFP4 decode + FP8 prefill.

Model Quant imp (tok/s) llama.cpp Δ
Qwen3-4B Q8_0 390 244 +60%
Qwen3-8B Q8_0 264 157 +68%
Gemma-3-12B Q8_0 139 98 +42%
Qwen3-Coder-30B (MoE) Q6_K 265 251 +6%

Prefill: 5.6k–25.8k tok/s. Full results with 12 models: BENCHMARKS.md

Quickstart

No local CUDA toolkit needed — everything runs in Docker.

# 1. Clone and enter the repo
git clone https://github.com/kekzl/imp.git && cd imp

# 2. Download a model (any GGUF from Hugging Face)
mkdir -p models
# Example: Qwen3-8B Q8_0 (~8.6 GB)

# 3. Build
docker compose build imp-server

# 4. Run the server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

# 5. Chat (OpenAI-compatible API)
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

Works with any OpenAI-compatible client:

# pip install openai
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="none")
r = client.chat.completions.create(
    model="default", messages=[{"role": "user", "content": "Hello!"}],
    max_tokens=64, stream=True)
for chunk in r:
    print(chunk.choices[0].delta.content or "", end="", flush=True)

Or use the CLI directly:

# Interactive chat
docker run -it --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf --interactive

# Single prompt
docker run --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
  --prompt "Explain quantum computing in 3 sentences."

# Benchmark (compare with llama-bench)
docker run --gpus all -v ./models:/models \
  imp:latest imp-cli --model /models/Qwen3-8B-Q8_0.gguf \
  --bench --bench-pp 512 --max-tokens 128 --bench-reps 5

CLI

# Single prompt
./build/imp-cli --model model.gguf --prompt "Hello, world!"

# Interactive chat
./build/imp-cli --model model.gguf --interactive

# Vision (Gemma-3)
./build/imp-cli --model gemma-3-12b-it.gguf --mmproj mmproj.gguf \
  --image photo.jpg --prompt "Describe this image"

# NVFP4 decode cache (auto-enabled on Blackwell)
./build/imp-cli --model model.gguf --decode-nvfp4 --interactive

# Benchmark (matches llama-bench methodology)
./build/imp-cli --model model.gguf --bench --bench-pp 512 --max-tokens 128 --bench-reps 5
Full CLI options
Model:
  --model <path>            Path to GGUF or SafeTensors model
  --mmproj <path>           Vision encoder GGUF for multimodal
  --image <path>            Input image (requires --mmproj)
  --device <n>              CUDA device ID (default: 0)
  --gpu-layers <n>          Layers on GPU, -1 = all (default: -1)

Generation:
  --prompt <text>           Input prompt
  --max-tokens <n>          Max tokens to generate (default: 256)
  --interactive             Interactive chat mode
  --stop <str>              Stop sequence (repeatable, up to 4)
  --chat-template <t>       auto|none|chatml|llama2|llama3|nemotron|gemma|deepseek_r1|phi

Sampling:
  --temperature <f>         (default: 0.7)
  --top-p <f>               (default: 0.9)
  --top-k <n>               (default: 40)
  --min-p <f>               (default: 0.0, disabled)
  --typical-p <f>           (default: 1.0, disabled)
  --repeat-penalty <f>      (default: 1.0, disabled)
  --repeat-last-n <n>       Penalty window (default: 0, all tokens)
  --frequency-penalty <f>   (default: 0.0)
  --presence-penalty <f>    (default: 0.0)
  --seed <n>                -1 for random (default: -1)
  --dry-multiplier <f>      DRY penalty scale (default: 0.0, disabled)
  --dry-base <f>            DRY exponential base (default: 1.75)
  --dry-allowed-length <n>  (default: 2)
  --dry-penalty-last-n <n>  (default: 0, all)
  --mirostat <n>            0=off, 2=v2 (default: 0)
  --mirostat-tau <f>        (default: 5.0)
  --mirostat-eta <f>        (default: 0.1)

Performance:
  --kv-fp8                  FP8 E4M3 KV cache
  --kv-int8                 INT8 KV cache
  --prefill-fp8             FP8 weight cache for prefill
  --prefill-chunk-size <n>  Max tokens per prefill chunk (default: 0)
  --decode-nvfp4            NVFP4 decode cache (FP16 prefill + NVFP4 decode)
  --decode-nvfp4-only       NVFP4 decode-only (saves VRAM, slower prefill)
  --no-nvfp4                Disable NVFP4 auto-detection
  --ssm-fp16                FP16 SSM state
  --no-cuda-graphs          Disable CUDA Graphs

Benchmark:
  --bench                   Synthetic benchmark mode (warmup + timed reps)
  --bench-pp <n>            Prompt tokens (default: 512)
  --bench-reps <n>          Repetitions (default: 3)

Features

  • Architectures: LLaMA, Mistral, Mixtral, DeepSeek, Qwen3, Qwen3-MoE, Phi-4, Gemma-3 (text + vision), Nemotron-H (Mamba2 + Attention + MoE)
  • Model formats: GGUF, SafeTensors
  • Quantization: Q2_K, Q3_K, Q4_0, Q4_K_M, Q5_K, Q6_K, Q8_0, FP8 E4M3, NVFP4 (FP4 E2M1), INT8
  • Vision: Gemma-3 SigLIP encoder (896x896, 256 image tokens) via separate mmproj.gguf
  • Attention: scalar Flash Attention 2 → WMMA tensor-core (sm_90+) → WMMA 8-warp (sm_120+); CUTLASS Hopper FMHA for prefill (WGMMA + TMA)
  • KV cache: paged blocks (configurable 16/32/64), LRU eviction, prefix caching with block pinning, FP16/FP8/INT8/INT4
  • Decode: CUDA Graphs (conditional WHILE loop), PDL, fused RMSNorm+Q8_1, fused QKV/gate+up GEMV, NVFP4 decode cache with prmt register LUT, multi-block argmax
  • Prefill: CUTLASS FMHA (WGMMA + TMA), CUTLASS NVFP4 GEMM (sm_120), FP8 cuBLASLt, FP16/FP8 weight cache, batched K/V GEMM
  • Sampling: temperature, top-p, top-k, min-p, typical-p, repetition/frequency/presence penalties (windowed), DRY, Mirostat v2
  • Runtime: continuous batching, speculative decoding, Green Context SM partitioning, upfront VRAM budget planner
  • Agentic: prefix cache block pinning, JSON schema constraining, tool calling (ChatML + Llama3), thinking/reasoning budgets, TTFT metrics
  • API: C library, OpenAI-compatible HTTP server (SSE streaming, tool calling, logprobs, JSON mode, concurrent requests, rate limiting)

Tested models: Qwen3-4B, Qwen3-8B, Qwen3-32B (Q4_K_M), Qwen3-Coder-30B (MoE), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B (Mamba2+MoE), Phi-4-Mini, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral. Other models sharing the same architectures (LLaMA, Mistral, Qwen3, Gemma-3, DeepSeek, Nemotron-H) should work.

Documentation

Document Description
Benchmarks Full benchmark results vs llama.cpp (12 models, RTX 5090)
Usage & Reference Build instructions, server setup, C API, project structure, architecture
Technical Comparison imp vs llama.cpp — architecture, features, performance
Memory Management VRAM/RAM strategies: imp vs llama.cpp vs Ollama vs vLLM
GEMV Dispatch Complete map of quantization dispatch across all decode paths
CUDA 13.1 Audit Feature inventory, performance phases P2-P29, architecture review

Acknowledgments

Built by @kekzl with Claude Code (Claude Opus 4.6) as a proof of concept for AI-assisted systems programming.

Stands on the shoulders of llama.cpp — the GGUF format, quantization schemes, and the entire concept of practical local LLM inference were pioneered by Georgi Gerganov and the llama.cpp community.

License

MIT — see LICENSE.