Skip to content

Latest commit

 

History

History
128 lines (86 loc) · 6.34 KB

File metadata and controls

128 lines (86 loc) · 6.34 KB

imp

Single-GPU LLM inference engine for NVIDIA Blackwell (RTX 5090).

License CUDA 13.2+ C++20 Status: experimental


What it is

imp is a from-scratch CUDA inference engine targeting one GPU: the NVIDIA RTX 5090 (Blackwell, sm_120f). It implements its own GGUF and SafeTensors loaders, BPE tokenizer, paged KV cache, attention kernels, MoE routing, Gated DeltaNet, CUDA Graphs, and serves an OpenAI-compatible HTTP API. It is not a wrapper around any other engine.

Every line was generated by an AI coding agent (Claude Code). The project is a long-running experiment in how far that approach can be pushed on a serious systems-programming task.

Why it might be interesting

  • Single-GPU, single-architecture focus. No portability layer, no fallbacks. Every kernel is written for sm_120f and uses Blackwell-specific features (PDL, Green Contexts, NVFP4 tensor cores, packed cvt.e4m3x2).
  • Native NVFP4 weight + KV path. SafeTensors NVFP4 prequant (NVIDIA Model Optimizer + llm-compressor) loads directly into NVFP4 tensor-core kernels — no FP16 dequant fallback in the hot path.
  • Gated DeltaNet hybrid models. Qwen3.5 / Qwen3.6 (GDN + attention + MoE) run with fused multi-token GDN scan and register-cached recurrent state.
  • Continuous batching server. imp-server exposes OpenAI /v1/chat/completions and Anthropic /v1/messages (both streaming and non-streaming) with prefix caching, JSON-schema constraining, and tool calling.

Status

Experimental. The codebase is single-author / single-target / single-GPU. There are open bugs (see TODO.md), some quantization paths are coherent only on specific model families, and prefill numbers vary up to 2.6× across container restarts because of cuBLAS autotuning. Don't deploy this anywhere it matters.

Quickstart

Everything runs in Docker; no local CUDA toolkit needed.

# 1. Clone
git clone https://github.com/kekzl/imp.git && cd imp

# 2. Drop a GGUF or SafeTensors model into ./models/
mkdir -p models
# (Example: any *.gguf or NVFP4 prequant SafeTensors directory)

# 3. Build the server image
docker compose build imp-server

# 4. Serve it
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/your-model.gguf

# 5. Hit the OpenAI-compatible endpoint
curl -s http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"max_tokens":64}'

See docs/usage.md for the full CLI reference, server flags, and C-API embedding guide.

Supported hardware

sm_120f only — NVIDIA RTX 5090 (GB202, Blackwell).

There are no architecture fallbacks. The code will fail to compile or fail at startup on any other SM. No support for Hopper, Ada, Ampere, or earlier; no AMD/Apple/CPU paths.

Supported models

Family Variants Quantizations
Qwen3 / Qwen3-MoE dense + MoE Q4_K_M, Q6_K, Q8_0, NVFP4, MXFP4
Qwen3.5 / Qwen3.6 GDN + attention (+ MoE) Q4_K_M, Q8_0, NVFP4
Gemma-4 (26B-A4B MoE) MoE Q4_K_M, Q5_K_M, Q8_0, NVFP4
Llama / Mistral / Mixtral / DeepSeek dense + MoE GGUF (Q*_K, Q8_0), FP8
Gemma-3 text + vision (SigLIP) GGUF
Nemotron-H Mamba2 + Attention + MoE GGUF

Tested-and-verified models with VRAM and decode tok/s: docs/supported-models.md.

Performance

Headline: Qwen3-8B Q8_0 decode at 255 tok/s on a single RTX 5090 (greedy, 256 output tokens, 3-rep average) — about 1.6× llama.cpp b8445 on the same hardware.

Long-context prefill (pp=8192) is where imp is consistently ahead of llama.cpp: ×1.13 to ×1.70 across the dense models tested. NVFP4 prequant decode (Qwen3.6, Gemma-4, Qwen3-Coder) ranges from 200–270 tok/s.

Full numbers, methodology, and the tests/perf_baseline.json regression gate: docs/performance.md.

Caveats. Numbers are from one machine, one run series. Prefill (pp512) shows up to ±2.6× variance across container restarts due to cuBLAS algo selection — the docs use decode (tg256) for any A/B comparison. A different RTX 5090, different driver, different CUDA build, or different llama.cpp commit will produce different numbers.

Building from source

# Inside the dev container, or with CUDA 13.2+ on the host:
cmake -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build -j$(nproc)

Full build options, test commands, and verify-gate setup: docs/usage.md.

Documentation

Document Description
Usage & reference Build, server, CLI, C API
Supported models Tested model families with VRAM + tok/s
Quantization GGUF Q*_K, NVFP4, MXFP4, FP8 KV — formats, pipelines, trade-offs
Performance Decode + prefill throughput, methodology
imp.conf reference All runtime configuration keys
sm_120 kernels Kernel optimization notes
Roadmap Open bugs and in-flight performance work
Changelog Per-release notes

Contributing

See CONTRIBUTING.md for build, test, and PR workflow.

License

MIT — see LICENSE.

Acknowledgements

Built by @kekzl with Claude Code as a long-running experiment.

Stands on the shoulders of llama.cpp — the GGUF format, the GGML quantization schemes, and most of the practical conventions for local LLM inference were established there.

Heavy use of CUTLASS for SM120 FMHA, NVFP4 / MXFP4 GEMM, and grouped MoE kernels. Other references: Flash Attention 2, EAGLE, NVIDIA Model Optimizer, llm-compressor.