Releases: kekzl/imp
v0.7.0 — Long-context correctness + Gemma-4/GDN stabilization
Big correctness + platform release covering 195 commits since v0.6. The long-context dispatch cliff is gone, Gemma-4 and the Qwen 3.5 / 3.6 GDN family now produce clean output on Blackwell, CUDA 13.2.1 with stream priorities and mem-sync domains is live, and the StreamingLLM smart-KV mode is available.
Headline
- FP8 FMHA S_tile smem overlap fix (#33) —
pp > 1024now coherent across every tested architecture. Previously all attention layers emitted NaN above the cuBLAS dispatch boundary. Up to ×1.70 vs llama.cpp at pp=8192 on Qwen3-4B. - Qwen 3.5 / 3.6 GDN stabilization (#28, #30) —
gdn_scan_fused_kernel__launch_bounds__(HD, 2)miscompile fixed, partial-RoPE pair-offset fixed,ssm_state_dtypenever auto-downgraded for GDN (the FP32 scan was overflowing into the next layer's state). Qwen 3.6 tg256 36 → 57 tok/s. - Gemma-4 suite — CUDA graphs on the decode fast-path (#11-#14),
rope_freqson global layers (#20), SWA long-context (#21), host-resident MoE gate_up split (e879bcd), split-K cp.async chunk loop forhead_dim=512. Q4_K_M decode 55 → 183 tok/s (×1.21 vs llama.cpp). - Platform — CUDA 13.2.1 base images (#16), stream priorities + mem-sync domains + cluster spread (#17), StreamingLLM smart KV cache (#26), weight-storage refactor with
TensorKind+StoragePlanner+gemm_dispatch(#27), CUTLASS 3.x NVFP4 Grouped GEMM scaffold (#22),ModelArch::QWEN36_MOEscaffold (#23).
Long-Context Prefill (new — pp=8192)
Previously broken. Now functional and ahead of llama.cpp on every tested model:
| Model | imp v0.7 | llama.cpp | Speedup |
|---|---|---|---|
| Qwen3-4B Q8_0 | 13,566 | 7,978 | ×1.70 |
| Qwen3-8B Q8_0 | 11,050 | 6,749 | ×1.64 |
| Qwen3.5-4B GDN Q8_0 | 13,090 | — | — |
| Mistral-24B Q6_K | 3,595 | 3,058 | ×1.18 |
| Qwen3-32B Q4_K_M | 2,040 | 1,802 | ×1.13 |
Diagnostic / infra
IMP_DEBUG_RAWmeta-flag (#29),IMP_EXPERT_OVERHEAD_PCThint on graph disable (#32)tools/analysis/layer_diff.py— per-layer tensor diff vs llama.cppGemma4GraphsTeste2e regressionFmhaFP8Test.Qwen35LikeHD256_GQA41_SeqMultiTile— catches the bug class from #33
Known issues (carried from CHANGELOG)
- Qwen3-Coder-30B-A3B NVFP4 still needs
--no-cuda-graphs(general-MoE D2H routing is graph-incompatible; Gemma-4 excepted via its decode fast-path). - Prefill throughput has up to 2.6× variance between container restarts due to cuBLAS autotuning — compare decode-only for reliable A/B testing.
- 1024→2048 throughput dip on small dense models (Qwen3-4B: 27k → 19k tok/s at the dispatch boundary). Output correct; smoothing is future work.
- MXFP4 GGUFs use imp-proprietary tensor-type 31 — llama.cpp reads it as the removed
Q4_0_4_4, so cross-tool PPL comparison is not possible without a standard-format export.
Full changelog
See CHANGELOG.md for the complete Keep-a-Changelog entry.
🤖 Generated with Claude Code
v0.6 — Qwen3.5, MXFP4, Jinja2 Macros, HuggingFace Hub
Highlights
Qwen3.5 (Gated DeltaNet) now works correctly. The root cause was a missing Jinja2 {% macro %} feature — Qwen3.5's chat template uses macros for multimodal content handling. Without macro support, user prompts rendered as "None" and the model ignored all input. Fixed with full Jinja2 macro support (MacroNode, parse_macro, call_macro with positional args, kwargs, and defaults).
Native MXFP4 GGUF weight format. Tensor-core-native 4-bit weights (FP4 E2M1 + UE8M0 block scales) feed directly into Blackwell's CUTLASS block-scaled GEMM — zero dequant overhead. Includes a Python converter (tools/convert_mxfp4.py) and full runtime integration with FP16 decode fallback.
HuggingFace Hub integration. Load models directly from HuggingFace by repo ID instead of local paths. GPTQ SafeTensors dequant and tokenizer.json loader included.
Performance (RTX 5090, CUDA 13.2)
| Model | Quant | Decode (tok/s) | Prefill (tok/s) |
|---|---|---|---|
| Qwen3-4B | Q8_0 | 377 | 27,201 |
| Qwen3-8B | Q8_0 | 255 | 17,636 |
| Qwen3.5-4B (GDN) | Q8_0 | 306 | 14,823 |
| Qwen3.5-9B (GDN) | Q8_0 | 134 | 8,520 |
| Llama-3.2-3B | Q8_0 | 208 | 22,544 |
What's New
Features
- Jinja2 macro support —
{% macro name(args) %}...{% endmacro %}with positional/keyword args and defaults - Native MXFP4 GGUF —
GGML_TYPE_MXFP4(type 31), 4.25 bits/weight, CUTLASS tensor-core GEMM - MXFP4 converter —
tools/convert_mxfp4.py(HuggingFace BF16/FP16 → MXFP4 GGUF) - HuggingFace Hub — load models by repo ID (
--model Qwen/Qwen3-8B) - GPTQ dequant — SafeTensors GPTQ models load with on-the-fly dequantization
- tokenizer.json loader — HuggingFace tokenizer format support
- N-gram speculative decoding —
--ngram-specCLI/server flag, multi-sequence decode verify - Jinja2 engine improvements — slice, is-tests (string/iterable/mapping/number), strip(chars), tojson filter
--min-kv-tokens— guaranteed KV cache capacity before weight cache allocation
Bug Fixes
- Qwen3.5 chat template — Jinja2 macro support fixes "None" content rendering
- N-gram spec verify — replaced pseudo-prefill (KV divergence) with multi-sequence decode verify
- Gemma-3 multi-turn — three root causes fixed (cuBLAS cache, softcap, banned tokens)
- CMake sm_120/120f conflict — skip 120f gencode when 120 already in CMAKE_CUDA_ARCHITECTURES
- GDN L2 norm epsilon — fused kernel (1e-12) now matches decode kernel (1e-6)
- Think token banning — token_type metadata from GGUF for correct
<think>handling - Server defaults — default context length, strip banned tokens from output
Infrastructure
- CUDA 13.2, CUTLASS v4.4.2, GoogleTest v1.17.0, cpp-httplib v0.40.0, nlohmann/json v3.12.0
- Dead EAGLE-3 code removed
- TODO.md refreshed with current status
Breaking Changes
None. GGUF models from v0.5.1 continue to work unchanged.
Tested Models
Qwen3-4B, Qwen3-8B, Qwen3-32B, Qwen3-Coder-30B (MoE), Qwen3.5-4B, Qwen3.5-9B, Qwen3.5-27B, Qwen3.5-35B (GDN), Gemma-3-12B/27B, DeepSeek-R1-7B/14B, Nemotron-3-Nano-30B, Llama-3.2-3B, Llama 3.1 8B, Mistral 7B, Mixtral 8x7B, Devstral
Quickstart
docker compose build imp-server
docker run --gpus all -v ./models:/models -p 8080:8080 \
imp:latest --model /models/Qwen3-8B-Q8_0.ggufv0.5.1: Fix GDN multi-turn chat
What's Fixed
GDN (Qwen3.5) multi-turn chat — conversations with 2+ turns produced degenerate output (repeated tokens, garbage). llama.cpp worked correctly. Root cause: FP8 E4M3 weight quantization (3-bit mantissa) causes precision errors that accumulate through the GDN delta rule scan when processing repeated chat template special tokens.
Changes
- FP16 prefill weights for GDN: Auto-detected, ~8% prefill throughput reduction vs FP8, but correct multi-turn output
- Chunked prefill state carry-forward: Recurrent state no longer reset between prefill chunks
- Conv1d chunk boundary fix: Reads previous chunk context instead of zero-padding
- Prefix caching guard: Disabled for recurrent models (token skipping breaks sequential state)
Benchmarks (RTX 5090)
| Model | Decode (tok/s) | Prefill (tok/s) |
|---|---|---|
| Qwen3-4B | 375 | 24,055 |
| Qwen3-8B | 255 | 17,746 |
| Qwen3.5-4B (GDN) | 308 | 14,687 |
| Qwen3.5-9B (GDN) | 134 | 8,418 |
| Gemma-3-12B | 129 | 6,998 |
Multi-Turn Quality
| Scenario | v0.5 | v0.5.1 |
|---|---|---|
| 5-turn chat (4B) | ❌ garbage | ✅ correct |
| 7-turn chat (9B) | ❌ garbage | ✅ correct |
Full Changelog: v0.5...v0.5.1
v0.4.1 — Qwen3.5 9B fix + cuBLASLt robustness
Bug Fixes
Qwen3.5 9B+ model quality fix:
- NVFP4 (4-bit) decode cache auto-disabled for GDN (Gated DeltaNet) models. The delta rule scan accumulates quantization error in the recurrent state across tokens — NVFP4 caused repeated
<|im_start|>tokens on Qwen3.5-9B and garbage output on 27B. FP8 prefill + dp4a Q8_0 decode preserves enough precision. - Qwen3.5-4B was unaffected (smaller weight matrices tolerate 4-bit), but the fix applies globally to all GDN models for safety.
cuBLASLt crash-to-fallback:
cublasLtMatmulfailures (CUDA 13.2 status 7 on sm_120 for certain M/K/N) now fall back tocublasGemmExinstead of silently continuing with corrupted output. Affects all three cuBLASLt paths (generic GEMM, INT compute, FP8-scaled).
Performance
Qwen3.5 decode (RTX 5090, Q8_0):
| Model | v0.4 | v0.4.1 | Note |
|---|---|---|---|
| Qwen3.5-4B | 327 tok/s | 253 tok/s | No NVFP4 → dp4a path (correct before and after) |
| Qwen3.5-9B | ❌ broken | 136 tok/s | Was producing garbage, now works |
| Qwen3.5-27B | ~12 tok/s | 31 tok/s | VRAM-limited (27 GB model on 32 GB card) |
Non-GDN models unaffected — Qwen3, Gemma-3, LLaMA, etc. continue using NVFP4 at full speed.
Other
- Async sampling with pinned host memory (truly async
cudaMemcpyAsync) - Batched logprobs D2H: single
cudaStreamSynchronizefor N sequences instead of N syncs
v0.4 — Qwen3.5 Gated DeltaNet support
What's New
Qwen3.5 (Gated DeltaNet) architecture support — 4B, 27B, 35B-A3B MoE models now work with correct, coherent output. Fused CUDA kernels make imp significantly faster than llama.cpp on this architecture.
Benchmarks (Qwen3.5-4B Q8_0, RTX 5090)
| Metric | imp v0.4 | llama.cpp b8445 | Speedup |
|---|---|---|---|
| Decode (tg128) | 327 tok/s | 180 tok/s | +82% |
| Prefill (pp512) | 16,017 tok/s | 11,149 tok/s | +44% |
| Prefill (pp128) | 5,799 tok/s | 6,136 tok/s | ~1x |
Key Changes
Correctness fixes:
- RoPE frequency base for partial RoPE (
rope_dim < head_dim) — affected all attention layers in Qwen3.5 post_attn_normmisplacement — was applied inside attention output instead of as pre-FFN norm- Conv1d decode buffer aliasing (FP16/FP32 shared buffer)
- BOS token default for GPT2 tokenizers
Performance optimizations:
- Fused multi-token GDN scan kernel with register-cached recurrent state (125x less memory traffic)
- Fused RMSNormGated+SiLU kernel (8192 → 1 kernel launch for pp128)
- Fused conv1d+SiLU+FP32 prefill kernel
- CUDA graphs enabled for GDN decode
No regressions on existing models (Qwen3, Qwen3-MoE, Gemma-3, DeepSeek, Nemotron-H).
Supported Qwen3.5 Models
Qwen3.5-4B(Q8_0, Q6_K, Q4_K_M)Qwen3.5-27B(Q8_0, Q6_K, Q4_K_M)Qwen3.5-35B-A3BMoE (Q6_K)
Requirements
- NVIDIA GPU with sm_90+ (Hopper/Blackwell)
- CUDA Toolkit 13.2+
- Docker with NVIDIA Container Toolkit
v0.2
Full Changelog: https://github.com/kekzl/imp/commits/v0.2