Skip to content

Releases: kekzl/imp

v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization

23 Apr 11:56
9e68101

Choose a tag to compare

Big correctness + platform release covering 195 commits since v0.6. The long-context dispatch cliff is gone, Gemma-4 and the Qwen 3.5 / 3.6 GDN family now produce clean output on Blackwell, CUDA 13.2.1 with stream priorities and mem-sync domains is live, and the StreamingLLM smart-KV mode is available.

Headline

  • FP8 FMHA S_tile smem overlap fix (#33) — pp > 1024 now coherent across every tested architecture. Previously all attention layers emitted NaN above the cuBLAS dispatch boundary. Up to ×1.70 vs llama.cpp at pp=8192 on Qwen3-4B.
  • Qwen 3.5 / 3.6 GDN stabilization (#28, #30) — gdn_scan_fused_kernel __launch_bounds__(HD, 2) miscompile fixed, partial-RoPE pair-offset fixed, ssm_state_dtype never auto-downgraded for GDN (the FP32 scan was overflowing into the next layer's state). Qwen 3.6 tg256 36 → 57 tok/s.
  • Gemma-4 suite — CUDA graphs on the decode fast-path (#11-#14), rope_freqs on global layers (#20), SWA long-context (#21), host-resident MoE gate_up split (e879bcd), split-K cp.async chunk loop for head_dim=512. Q4_K_M decode 55 → 183 tok/s (×1.21 vs llama.cpp).
  • Platform — CUDA 13.2.1 base images (#16), stream priorities + mem-sync domains + cluster spread (#17), StreamingLLM smart KV cache (#26), weight-storage refactor with TensorKind + StoragePlanner + gemm_dispatch (#27), CUTLASS 3.x NVFP4 Grouped GEMM scaffold (#22), ModelArch::QWEN36_MOE scaffold (#23).

Long-Context Prefill (new — pp=8192)

Previously broken. Now functional and ahead of llama.cpp on every tested model:

Model imp v0.7 llama.cpp Speedup
Qwen3-4B Q8_0 13,566 7,978 ×1.70
Qwen3-8B Q8_0 11,050 6,749 ×1.64
Qwen3.5-4B GDN Q8_0 13,090
Mistral-24B Q6_K 3,595 3,058 ×1.18
Qwen3-32B Q4_K_M 2,040 1,802 ×1.13

Diagnostic / infra

  • IMP_DEBUG_RAW meta-flag (#29), IMP_EXPERT_OVERHEAD_PCT hint on graph disable (#32)
  • tools/analysis/layer_diff.py — per-layer tensor diff vs llama.cpp
  • Gemma4GraphsTest e2e regression
  • FmhaFP8Test.Qwen35LikeHD256_GQA41_SeqMultiTile — catches the bug class from #33

Known issues (carried from CHANGELOG)

  • Qwen3-Coder-30B-A3B NVFP4 still needs --no-cuda-graphs (general-MoE D2H routing is graph-incompatible; Gemma-4 excepted via its decode fast-path).
  • Prefill throughput has up to 2.6× variance between container restarts due to cuBLAS autotuning — compare decode-only for reliable A/B testing.
  • 1024→2048 throughput dip on small dense models (Qwen3-4B: 27k → 19k tok/s at the dispatch boundary). Output correct; smoothing is future work.
  • MXFP4 GGUFs use imp-proprietary tensor-type 31 — llama.cpp reads it as the removed Q4_0_4_4, so cross-tool PPL comparison is not possible without a standard-format export.

Full changelog

See CHANGELOG.md for the complete Keep-a-Changelog entry.

🤖 Generated with Claude Code

v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub

02 Apr 19:27

Choose a tag to compare

Highlights

Qwen3.5 (Gated DeltaNet) now works correctly. The root cause was a missing Jinja2 {% macro %} feature — Qwen3.5's chat template uses macros for multimodal content handling. Without macro support, user prompts rendered as "None" and the model ignored all input. Fixed with full Jinja2 macro support (MacroNode, parse_macro, call_macro with positional args, kwargs, and defaults).

Native MXFP4 GGUF weight format. Tensor-core-native 4-bit weights (FP4 E2M1 + UE8M0 block scales) feed directly into Blackwell's CUTLASS block-scaled GEMM — zero dequant overhead. Includes a Python converter (tools/convert_mxfp4.py) and full runtime integration with FP16 decode fallback.

HuggingFace Hub integration. Load models directly from HuggingFace by repo ID instead of local paths. GPTQ SafeTensors dequant and tokenizer.json loader included.

Performance (RTX 5090, CUDA 13.2)

Model Quant Decode (tok/s) Prefill (tok/s)
Qwen3-4B Q8_0 377 27,201
Qwen3-8B Q8_0 255 17,636
Qwen3.5-4B (GDN) Q8_0 306 14,823
Qwen3.5-9B (GDN) Q8_0 134 8,520
Llama-3.2-3B Q8_0 208 22,544

What's New

Features

  • Jinja2 macro support{% macro name(args) %}...{% endmacro %} with positional/keyword args and defaults
  • Native MXFP4 GGUFGGML_TYPE_MXFP4 (type 31), 4.25 bits/weight, CUTLASS tensor-core GEMM
  • MXFP4 convertertools/convert_mxfp4.py (HuggingFace BF16/FP16 → MXFP4 GGUF)
  • HuggingFace Hub — load models by repo ID (--model Qwen/Qwen3-8B)
  • GPTQ dequant — SafeTensors GPTQ models load with on-the-fly dequantization
  • tokenizer.json loader — HuggingFace tokenizer format support
  • N-gram speculative decoding--ngram-spec CLI/server flag, multi-sequence decode verify
  • Jinja2 engine improvements — slice, is-tests (string/iterable/mapping/number), strip(chars), tojson filter
  • --min-kv-tokens — guaranteed KV cache capacity before weight cache allocation

Bug Fixes

  • Qwen3.5 chat template — Jinja2 macro support fixes "None" content rendering
  • N-gram spec verify — replaced pseudo-prefill (KV divergence) with multi-sequence decode verify
  • Gemma-3 multi-turn — three root causes fixed (cuBLAS cache, softcap, banned tokens)
  • CMake sm_120/120f conflict — skip 120f gencode when 120 already in CMAKE_CUDA_ARCHITECTURES
  • GDN L2 norm epsilon — fused kernel (1e-12) now matches decode kernel (1e-6)
  • Think token banning — token_type metadata from GGUF for correct <think> handling
  • Server defaults — default context length, strip banned tokens from output

Infrastructure

  • CUDA 13.2, CUTLASS v4.4.2, GoogleTest v1.17.0, cpp-httplib v0.40.0, nlohmann/json v3.12.0
  • Dead EAGLE-3 code removed
  • TODO.md refreshed with current status

Breaking Changes

None. GGUF models from v0.5.1 continue to work unchanged.

Tested Models

Qwen3-4B, Qwen3-8B, Qwen3-32B, Qwen3-Coder-30B (MoE), Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B (GDN), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B, Llama-3.2-3B, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral

Quickstart

docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
  imp:latest --model /models/Qwen3-8B-Q8_0.gguf

v0.5.1: Fix GDN multi-turn chat

28 Mar 01:32

Choose a tag to compare

What's Fixed

GDN (Qwen3.5) multi-turn chat — conversations with 2+ turns produced degenerate output (repeated tokens, garbage). llama.cpp worked correctly. Root cause: FP8 E4M3 weight quantization (3-bit mantissa) causes precision errors that accumulate through the GDN delta rule scan when processing repeated chat template special tokens.

Changes

  • FP16 prefill weights for GDN: Auto-detected, ~8% prefill throughput reduction vs FP8, but correct multi-turn output
  • Chunked prefill state carry-forward: Recurrent state no longer reset between prefill chunks
  • Conv1d chunk boundary fix: Reads previous chunk context instead of zero-padding
  • Prefix caching guard: Disabled for recurrent models (token skipping breaks sequential state)

Benchmarks (RTX 5090)

Model Decode (tok/s) Prefill (tok/s)
Qwen3-4B 375 24,055
Qwen3-8B 255 17,746
Qwen3.5-4B (GDN) 308 14,687
Qwen3.5-9B (GDN) 134 8,418
Gemma-3-12B 129 6,998

Multi-Turn Quality

Scenario v0.5 v0.5.1
5-turn chat (4B) ❌ garbage ✅ correct
7-turn chat (9B) ❌ garbage ✅ correct

Full Changelog: v0.5...v0.5.1

v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness

21 Mar 17:41

Choose a tag to compare

Bug Fixes

Qwen3.5 9B+ model quality fix:

  • NVFP4 (4-bit) decode cache auto-disabled for GDN (Gated DeltaNet) models. The delta rule scan accumulates quantization error in the recurrent state across tokens — NVFP4 caused repeated <|im_start|> tokens on Qwen3.5-9B and garbage output on 27B. FP8 prefill + dp4a Q8_0 decode preserves enough precision.
  • Qwen3.5-4B was unaffected (smaller weight matrices tolerate 4-bit), but the fix applies globally to all GDN models for safety.

cuBLASLt crash-to-fallback:

  • cublasLtMatmul failures (CUDA 13.2 status 7 on sm_120 for certain M/K/N) now fall back to cublasGemmEx instead of silently continuing with corrupted output. Affects all three cuBLASLt paths (generic GEMM, INT compute, FP8-scaled).

Performance

Qwen3.5 decode (RTX 5090, Q8_0):

Model v0.4 v0.4.1 Note
Qwen3.5-4B 327 tok/s 253 tok/s No NVFP4 → dp4a path (correct before and after)
Qwen3.5-9B ❌ broken 136 tok/s Was producing garbage, now works
Qwen3.5-27B ~12 tok/s 31 tok/s VRAM-limited (27 GB model on 32 GB card)

Non-GDN models unaffected — Qwen3, Gemma-3, LLaMA, etc. continue using NVFP4 at full speed.

Other

  • Async sampling with pinned host memory (truly async cudaMemcpyAsync)
  • Batched logprobs D2H: single cudaStreamSynchronize for N sequences instead of N syncs

v0.4 — Qwen3.5 Gated DeltaNet support

21 Mar 12:42

Choose a tag to compare

What's New

Qwen3.5 (Gated DeltaNet) architecture support — 4B, 27B, 35B-A3B MoE models now work with correct, coherent output. Fused CUDA kernels make imp significantly faster than llama.cpp on this architecture.

Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)

Metric imp v0.4 llama.cpp b8445 Speedup
Decode (tg128) 327 tok/s 180 tok/s +82%
Prefill (pp512) 16,017 tok/s 11,149 tok/s +44%
Prefill (pp128) 5,799 tok/s 6,136 tok/s ~1x

Key Changes

Correctness fixes:

  • RoPE frequency base for partial RoPE (rope_dim < head_dim) — affected all attention layers in Qwen3.5
  • post_attn_norm misplacement — was applied inside attention output instead of as pre-FFN norm
  • Conv1d decode buffer aliasing (FP16/FP32 shared buffer)
  • BOS token default for GPT2 tokenizers

Performance optimizations:

  • Fused multi-token GDN scan kernel with register-cached recurrent state (125x less memory traffic)
  • Fused RMSNormGated+SiLU kernel (8192 → 1 kernel launch for pp128)
  • Fused conv1d+SiLU+FP32 prefill kernel
  • CUDA graphs enabled for GDN decode

No regressions on existing models (Qwen3, Qwen3-MoE, Gemma-3, DeepSeek, Nemotron-H).

Supported Qwen3.5 Models

  • Qwen3.5-4B (Q8_0, Q6_K, Q4_K_M)
  • Qwen3.5-27B (Q8_0, Q6_K, Q4_K_M)
  • Qwen3.5-35B-A3B MoE (Q6_K)

Requirements

  • NVIDIA GPU with sm_90+ (Hopper/Blackwell)
  • CUDA Toolkit 13.2+
  • Docker with NVIDIA Container Toolkit

v0.2

14 Mar 09:09

Choose a tag to compare