Releases · kekzl/imp

23 Apr 11:56

kekzl

9e68101

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization Latest

Latest

Big correctness + platform release covering 195 commits since v0.6. The long-context dispatch cliff is gone, Gemma-4 and the Qwen 3.5 / 3.6 GDN family now produce clean output on Blackwell, CUDA 13.2.1 with stream priorities and mem-sync domains is live, and the StreamingLLM smart-KV mode is available.

Headline

FP8 FMHA S_tile smem overlap fix (#33) — pp > 1024 now coherent across every tested architecture. Previously all attention layers emitted NaN above the cuBLAS dispatch boundary. Up to ×1.70 vs llama.cpp at pp=8192 on Qwen3-4B.
Qwen 3.5 / 3.6 GDN stabilization (#28, #30) — gdn_scan_fused_kernel __launch_bounds__(HD, 2) miscompile fixed, partial-RoPE pair-offset fixed, ssm_state_dtype never auto-downgraded for GDN (the FP32 scan was overflowing into the next layer's state). Qwen 3.6 tg256 36 → 57 tok/s.
Gemma-4 suite — CUDA graphs on the decode fast-path (#11-#14), rope_freqs on global layers (#20), SWA long-context (#21), host-resident MoE gate_up split (e879bcd), split-K cp.async chunk loop for head_dim=512. Q4_K_M decode 55 → 183 tok/s (×1.21 vs llama.cpp).
Platform — CUDA 13.2.1 base images (#16), stream priorities + mem-sync domains + cluster spread (#17), StreamingLLM smart KV cache (#26), weight-storage refactor with TensorKind + StoragePlanner + gemm_dispatch (#27), CUTLASS 3.x NVFP4 Grouped GEMM scaffold (#22), ModelArch::QWEN36_MOE scaffold (#23).

Long-Context Prefill (new — pp=8192)

Previously broken. Now functional and ahead of llama.cpp on every tested model:

Model	imp v0.7	llama.cpp	Speedup
Qwen3-4B Q8_0	13,566	7,978	×1.70
Qwen3-8B Q8_0	11,050	6,749	×1.64
Qwen3.5-4B GDN Q8_0	13,090	—	—
Mistral-24B Q6_K	3,595	3,058	×1.18
Qwen3-32B Q4_K_M	2,040	1,802	×1.13

Diagnostic / infra

IMP_DEBUG_RAW meta-flag (#29), IMP_EXPERT_OVERHEAD_PCT hint on graph disable (#32)
tools/analysis/layer_diff.py — per-layer tensor diff vs llama.cpp
Gemma4GraphsTest e2e regression
FmhaFP8Test.Qwen35LikeHD256_GQA41_SeqMultiTile — catches the bug class from #33

Known issues (carried from CHANGELOG)

Qwen3-Coder-30B-A3B NVFP4 still needs --no-cuda-graphs (general-MoE D2H routing is graph-incompatible; Gemma-4 excepted via its decode fast-path).
Prefill throughput has up to 2.6× variance between container restarts due to cuBLAS autotuning — compare decode-only for reliable A/B testing.
1024→2048 throughput dip on small dense models (Qwen3-4B: 27k → 19k tok/s at the dispatch boundary). Output correct; smoothing is future work.
MXFP4 GGUFs use imp-proprietary tensor-type 31 — llama.cpp reads it as the removed Q4_0_4_4, so cross-tool PPL comparison is not possible without a standard-format export.

Full changelog

See CHANGELOG.md for the complete Keep-a-Changelog entry.

🤖 Generated with Claude Code

Assets 2

02 Apr 19:27

kekzl

v0.6

ec1dadd

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

Highlights

Qwen3.5 (Gated DeltaNet) now works correctly. The root cause was a missing Jinja2 {% macro %} feature — Qwen3.5's chat template uses macros for multimodal content handling. Without macro support, user prompts rendered as "None" and the model ignored all input. Fixed with full Jinja2 macro support (MacroNode, parse_macro, call_macro with positional args, kwargs, and defaults).

Native MXFP4 GGUF weight format. Tensor-core-native 4-bit weights (FP4 E2M1 + UE8M0 block scales) feed directly into Blackwell's CUTLASS block-scaled GEMM — zero dequant overhead. Includes a Python converter (tools/convert_mxfp4.py) and full runtime integration with FP16 decode fallback.

HuggingFace Hub integration. Load models directly from HuggingFace by repo ID instead of local paths. GPTQ SafeTensors dequant and tokenizer.json loader included.

Performance (RTX 5090, CUDA 13.2)

Model	Quant	Decode (tok/s)	Prefill (tok/s)
Qwen3-4B	Q8_0	377	27,201
Qwen3-8B	Q8_0	255	17,636
Qwen3.5-4B (GDN)	Q8_0	306	14,823
Qwen3.5-9B (GDN)	Q8_0	134	8,520
Llama-3.2-3B	Q8_0	208	22,544

What's New

Features

Jinja2 macro support — {% macro name(args) %}...{% endmacro %} with positional/keyword args and defaults
Native MXFP4 GGUF — GGML_TYPE_MXFP4 (type 31), 4.25 bits/weight, CUTLASS tensor-core GEMM
MXFP4 converter — tools/convert_mxfp4.py (HuggingFace BF16/FP16 → MXFP4 GGUF)
HuggingFace Hub — load models by repo ID (--model Qwen/Qwen3-8B)
GPTQ dequant — SafeTensors GPTQ models load with on-the-fly dequantization
tokenizer.json loader — HuggingFace tokenizer format support
N-gram speculative decoding — --ngram-spec CLI/server flag, multi-sequence decode verify
Jinja2 engine improvements — slice, is-tests (string/iterable/mapping/number), strip(chars), tojson filter
--min-kv-tokens — guaranteed KV cache capacity before weight cache allocation

Bug Fixes

Qwen3.5 chat template — Jinja2 macro support fixes "None" content rendering
N-gram spec verify — replaced pseudo-prefill (KV divergence) with multi-sequence decode verify
Gemma-3 multi-turn — three root causes fixed (cuBLAS cache, softcap, banned tokens)
CMake sm_120/120f conflict — skip 120f gencode when 120 already in CMAKE_CUDA_ARCHITECTURES
GDN L2 norm epsilon — fused kernel (1e-12) now matches decode kernel (1e-6)
Think token banning — token_type metadata from GGUF for correct <think> handling
Server defaults — default context length, strip banned tokens from output

Infrastructure

CUDA 13.2, CUTLASS v4.4.2, GoogleTest v1.17.0, cpp-httplib v0.40.0, nlohmann/json v3.12.0
Dead EAGLE-3 code removed
TODO.md refreshed with current status

Breaking Changes

None. GGUF models from v0.5.1 continue to work unchanged.

Tested Models

Qwen3-4B, Qwen3-8B, Qwen3-32B, Qwen3-Coder-30B (MoE), Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B (GDN), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B, Llama-3.2-3B, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral

Quickstart

docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

Assets 2

28 Mar 01:32

kekzl

v0.5.1

55c290e

v0.5.1: Fix GDN multi-turn chat

What's Fixed

GDN (Qwen3.5) multi-turn chat — conversations with 2+ turns produced degenerate output (repeated tokens, garbage). llama.cpp worked correctly. Root cause: FP8 E4M3 weight quantization (3-bit mantissa) causes precision errors that accumulate through the GDN delta rule scan when processing repeated chat template special tokens.

Changes

FP16 prefill weights for GDN: Auto-detected, ~8% prefill throughput reduction vs FP8, but correct multi-turn output
Chunked prefill state carry-forward: Recurrent state no longer reset between prefill chunks
Conv1d chunk boundary fix: Reads previous chunk context instead of zero-padding
Prefix caching guard: Disabled for recurrent models (token skipping breaks sequential state)

Benchmarks (RTX 5090)

Model	Decode (tok/s)	Prefill (tok/s)
Qwen3-4B	375	24,055
Qwen3-8B	255	17,746
Qwen3.5-4B (GDN)	308	14,687
Qwen3.5-9B (GDN)	134	8,418
Gemma-3-12B	129	6,998

Multi-Turn Quality

Scenario	v0.5	v0.5.1
5-turn chat (4B)	❌ garbage	✅ correct
7-turn chat (9B)	❌ garbage	✅ correct

Full Changelog: v0.5...v0.5.1

Assets 2

21 Mar 17:41

kekzl

v0.4.1

5ab7d50

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

Bug Fixes

Qwen3.5 9B+ model quality fix:

NVFP4 (4-bit) decode cache auto-disabled for GDN (Gated DeltaNet) models. The delta rule scan accumulates quantization error in the recurrent state across tokens — NVFP4 caused repeated <|im_start|> tokens on Qwen3.5-9B and garbage output on 27B. FP8 prefill + dp4a Q8_0 decode preserves enough precision.
Qwen3.5-4B was unaffected (smaller weight matrices tolerate 4-bit), but the fix applies globally to all GDN models for safety.

cuBLASLt crash-to-fallback:

cublasLtMatmul failures (CUDA 13.2 status 7 on sm_120 for certain M/K/N) now fall back to cublasGemmEx instead of silently continuing with corrupted output. Affects all three cuBLASLt paths (generic GEMM, INT compute, FP8-scaled).

Performance

Qwen3.5 decode (RTX 5090, Q8_0):

Model	v0.4	v0.4.1	Note
Qwen3.5-4B	327 tok/s	253 tok/s	No NVFP4 → dp4a path (correct before and after)
Qwen3.5-9B	❌ broken	136 tok/s	Was producing garbage, now works
Qwen3.5-27B	~12 tok/s	31 tok/s	VRAM-limited (27 GB model on 32 GB card)

Non-GDN models unaffected — Qwen3, Gemma-3, LLaMA, etc. continue using NVFP4 at full speed.

Other

Async sampling with pinned host memory (truly async cudaMemcpyAsync)
Batched logprobs D2H: single cudaStreamSynchronize for N sequences instead of N syncs

Assets 2

21 Mar 12:42

kekzl

v0.4

d2fe039

v0.4 — Qwen3.5 Gated DeltaNet support

What's New

Qwen3.5 (Gated DeltaNet) architecture support — 4B, 27B, 35B-A3B MoE models now work with correct, coherent output. Fused CUDA kernels make imp significantly faster than llama.cpp on this architecture.

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Metric	imp v0.4	llama.cpp b8445	Speedup
Decode (tg128)	327 tok/s	180 tok/s	+82%
Prefill (pp512)	16,017 tok/s	11,149 tok/s	+44%
Prefill (pp128)	5,799 tok/s	6,136 tok/s	~1x

Key Changes

Correctness fixes:

RoPE frequency base for partial RoPE (rope_dim < head_dim) — affected all attention layers in Qwen3.5
post_attn_norm misplacement — was applied inside attention output instead of as pre-FFN norm
Conv1d decode buffer aliasing (FP16/FP32 shared buffer)
BOS token default for GPT2 tokenizers

Performance optimizations:

Fused multi-token GDN scan kernel with register-cached recurrent state (125x less memory traffic)
Fused RMSNormGated+SiLU kernel (8192 → 1 kernel launch for pp128)
Fused conv1d+SiLU+FP32 prefill kernel
CUDA graphs enabled for GDN decode

No regressions on existing models (Qwen3, Qwen3-MoE, Gemma-3, DeepSeek, Nemotron-H).

Supported Qwen3.5 Models

Qwen3.5-4B (Q8_0, Q6_K, Q4_K_M)
Qwen3.5-27B (Q8_0, Q6_K, Q4_K_M)
Qwen3.5-35B-A3B MoE (Q6_K)

Requirements

NVIDIA GPU with sm_90+ (Hopper/Blackwell)
CUDA Toolkit 13.2+
Docker with NVIDIA Container Toolkit

Assets 2

14 Mar 09:09

kekzl

v0.2

23a0baa

v0.2

Full Changelog: https://github.com/kekzl/imp/commits/v0.2

Assets 2

Releases: kekzl/imp

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization

Headline

Long-Context Prefill (new — pp=8192)

Diagnostic / infra

Known issues (carried from CHANGELOG)

Full changelog

Uh oh!

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

Highlights

Performance (RTX 5090, CUDA 13.2)

What's New

Features

Bug Fixes

Infrastructure

Breaking Changes

Tested Models

Quickstart

Uh oh!

v0.5.1: Fix GDN multi-turn chat

What's Fixed

Changes

Benchmarks (RTX 5090)

Multi-Turn Quality

Uh oh!

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

Bug Fixes

Performance

Other

Uh oh!

v0.4 — Qwen3.5 Gated DeltaNet support

What's New

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Key Changes

Supported Qwen3.5 Models

Requirements

Uh oh!

v0.2

Uh oh!