Releases · EricLBuehler/mistral.rs

02 Apr 18:20

EricLBuehler

v0.8.0

962112d

v0.8.0 Latest

Latest

What's Changed

Tweaks to docs and readme by @EricLBuehler in #1854
Upgrade Metal standard from 3.0 to 3.1 by @lizzzcai in #1861
fix stable diffusion readme by @setoelkahfi in #1857
Use cudaforge for kernel build by @guoqingbao in #1856
Bump bytes from 1.11.0 to 1.11.1 by @dependabot[bot] in #1865
Fix accuracy of fused glu metal and cuda impls by @EricLBuehler in #1867
Bump time from 0.3.45 to 0.3.47 by @dependabot[bot] in #1868
Fix for ViT + flash attn case by @EricLBuehler in #1869
Parallel + I/O pipelined ISQ by @EricLBuehler in #1870
Fix gptoss sliding window case with prefix caching by @EricLBuehler in #1871
Change gguf files delimiter to ';' by @synek317 in #1873
GPT-OSS paged attention with sinks support, MoE prefill kernels across CUDA, Metal, and CPU by @EricLBuehler in #1872
Fix streaming sse hang on error event by @EricLBuehler in #1875
Support Qwen 3 Next by @EricLBuehler in #1864
Fix completions ignoring logprobs by @EricLBuehler in #1877
Fixes for Qwen 3 VL family by @EricLBuehler in #1878
Add new quant method: F8Q8 by @EricLBuehler in #1883
fix(docker): install git in CUDA builders for flash-attn-v3 CUTLASS fetch by @glaziermag in #1885
Bump to 0.7.1-alpha.1 by @EricLBuehler in #1880
fix(core): use unix seconds for streaming chunk created timestamp by @glaziermag in #1887
feat: tvos metal support by @setoelkahfi in #1891
Rewrite paged attention for block-level prefix caching with KV gather kernels by @EricLBuehler in #1890
Fix contiguous error with phi3 gguf by @EricLBuehler in #1892
fix(core): handle missing BOS token in calibration path by @glaziermag in #1895
feat: add optional save_file for url image generation response format by @setoelkahfi in #1893
fix(metal): load metallib from memory instead of temp file for sandbox compatability by @EricLBuehler in #1898
fix(build): enable vendored Swagger UI for offline compilation by @EricLBuehler in #1899
fix(cuda): account for tensor storage offset in GDN kernel launches by @EricLBuehler in #1900
fix(cuda): account for tensor storage offset in moe kernel launches by @EricLBuehler in #1901
Implement GGUF for Mistral3 by @Cooksey99 in #1771
feat(rust sdk): deferred media prefixing, typed errors, and API cleanup, restructure examples by @EricLBuehler in #1904
feat(models): add Voxtral Mini 4B real-time speech recognition model by @EricLBuehler in #1905
fix(ci): add Metal and CUDA+NCCL compile checks by @EricLBuehler in #1907
fix(device_map): pre-allocate masks per device to reduce OOM pressure by @EricLBuehler in #1908
feat(pyo3): release GIL around blocking Runner operations to improve Python SDK by @EricLBuehler in #1909
feat(server-core): make utoipa-swagger-ui an optional feature by @EricLBuehler in #1910
fix(server-core): terminate SSE streams when response channel closes by @EricLBuehler in #1943
ci: disable docs deployment on forks by @haricot in #1942
fix: memory limit constants for 32-bit targets in attention and ISQ by @setoelkahfi in #1933
fix(gguf): verify_arch_any used AND logic instead of OR by @n-engine in #1916
fix(#1934): emulate negative step range in chat templates by @haricot in #1941
fix(vision): correct Qwen VL multi-turn image processing and thinking model token decoding by @EricLBuehler in #1950
Update MCP client documentation link in README by @naufraghi in #1935
feat(models): support Qwen 3.5 model family by @EricLBuehler in #1993
feat(cli): add --uqff-base-model and --uqff-repo-id flags to quantize command by @EricLBuehler in #1994
fix(cli): ensure readme matches older versions by @EricLBuehler in #1995
fix(isq): bits standardize format for numerical isq setting by @EricLBuehler in #1997
fix(metal): upgrade paged-attn to Metal 3.1 for native bfloat16 support by @ljchang in #2010
Small fix for Voxtral: load params.json before config.json if present by @jam10o-new in #1979
Fix UQFF loading for MoE models in Qwen2Loader by @glaziermag in #1977
fix(metal): auto-retry on iOS Metal background GPU permission error by @EricLBuehler in #2015
fix(ring): support Ring backend in properply in more models by @EricLBuehler in #2016
fix(cache): set hybrid recurrent state_indices during prompt cache reset by @EricLBuehler in #2017
feat(quant): add MXFP4 ISQ with optimized decode kernels by @EricLBuehler in #2018
refactor(wrapper-crates): reduce duplicated builder and request glue by @EricLBuehler in #2019
fix(docs): duplicate entry in SUMMARY.md breaks docs build by @EricLBuehler in #2020
Implement the Gemma 4 model by @EricLBuehler in #2046

New Contributors

@lizzzcai made their first contribution in #1861
@setoelkahfi made their first contribution in #1857
@synek317 made their first contribution in #1873
@glaziermag made their first contribution in #1885
@n-engine made their first contribution in #1916
@naufraghi made their first contribution in #1935
@ljchang made their first contribution in #2010
@jam10o-new made their first contribution in #1979

Full Changelog: v0.7.0...v0.8.0

Contributors

naufraghi, haricot, and 11 other contributors

Assets 2

28 Jan 06:10

EricLBuehler

v0.7.0

b5af260

v0.7.0

Highlights

New CLI: mistralrs-cli
Prefix Caching: We have implemented Prefix Caching for PagedAttention (#1750). This significantly accelerates multi-turn conversations and RAG workflows by reusing KV cache for shared prompt prefixes.
Major model expanstion: Support for the Embedding Gemma, Qwen 3 Embedding, Gemma 3n, GLM-4, Granite Hybrid MoE, GLM-4 MoE, GLM-4 MoE Lite
Dynamic model loading: Dynamic Model Loading: The server now supports loading and unloading models at runtime (#1828)
Performance: Added support for CUDA 13.0/13.1 (#1767) and introduced highly optimized fused kernels (GEMV, GLU) and blockwise FP8 kernels for significant speedups on NVIDIA GPUs.
candle 0.9.2: We have migrated to the official crates.io release of candle 0.9.2, stabilizing our backend dependencies!

New Models & Architectures

Embedding models: Qwen 3 Embedding, Embedding Gemma
Text models: GLM-4, GLM-4.7 Flash, Granite Hybrid, GPT-OSS
Vision models: Gemma 3n, Qwen 3 VL & Qwen 3 VL MoE

What's Changed

Improve automatic tool call by @EricLBuehler in #1460
chore: Dockerfile.cuda-all configurable threads by @polarathene in #1458
chore: Dockerfile.cuda-all - Merge RUN for apt-get install by @polarathene in #1459
Add fallback definition for metal::isnan by @EricLBuehler in #1463
chore: Dockerfile - Drop runtime rayon thread ENV by @polarathene in #1465
Remove duplicate calls for api_dir_list by @guoqingbao in #1474
Fix transient pyo3 dep in mistralrs_mcp by @EricLBuehler in #1478
Fix objc dep when non macos by @EricLBuehler in #1480
Fix phi4 mini + nccl cache issue by @EricLBuehler in #1481
Fix phi3.5 moe (#1447) by @EricLBuehler in #1482
Support GLM4 model by @guoqingbao in #1437
Refactor distributed backend by @EricLBuehler in #1484
Cap metal paged attn kv allocation by @EricLBuehler in #1485
Better paged attn metal cap by @EricLBuehler in #1486
Server core: consolidate and unify route handlers and API surface by @matthewhaynesonline in #1423
Support qwen3 gguf by @EricLBuehler in #1488
Make bos/eos token IDs optional by @EricLBuehler in #1493
Remove python deps from CUDA dockerfiles by @EricLBuehler in #1487
Handle noncontiguous v in naive_sdpa by @EricLBuehler in #1499
Server Core: refactor Paged Attention configuration (breaking change) by @matthewhaynesonline in #1500
Use StorageModePrivate for Metal PA kv cache by @EricLBuehler in #1506
fix(stream): emit field in tool-call deltas for schema compliance by @Sbargaoui in #1507
PagedAttention kv-cache quantization (F8E4M3) by @EricLBuehler in #1400
Validate model name in OpenAI API by @EricLBuehler in #1509
Fix mcp import in doc string by @GaetanLepage in #1510
Add multi-model support by @EricLBuehler in #1512
Add stars label to readme by @EricLBuehler in #1513
Handle base_model.model case in lora by @EricLBuehler in #1514
Add thread_local! for engine-specific const/static by @EricLBuehler in #1517
Fix MCP doc test by @GaetanLepage in #1511
Allow disabling metal precompilation by @EricLBuehler in #1518
Rust 1.88 clippy by @EricLBuehler in #1522
Fix cuda warnings by @EricLBuehler in #1526
Fix panic on error decoding tokens by @EricLBuehler in #1527
Split Marlin and Paged Attention kernels for faster build by @guoqingbao in #1525
chore: update llguidance by @ammar-elsabe in #1535
Add the SmolLM3 model by @EricLBuehler in #1501
Add Gemma 3n support by @EricLBuehler in #1519
Fix sequence length check by @EricLBuehler in #1546
update candle version by @AlpineVibrations in #1545
Add the capability to build for ios by @rubiktubik in #1548
chore: Dockerfile - Remove redundant symlink creation + ENV by @polarathene in #1504
Apply changes from Gemma 3n weight & config reupload by @EricLBuehler in #1553
Faster multimodal merging for gemma3n by @EricLBuehler in #1558
Improved Rust UQFF api by @EricLBuehler in #1563
Uqff minor tweaks and optimizations by @EricLBuehler in #1565
Update Candle backend, support new dtypes, CUDA 12.9 by @EricLBuehler in #1566
Fix nccl regression by @EricLBuehler in #1569
Fix metal regression by @EricLBuehler in #1575
Initial support for OpenAI Responses API by @EricLBuehler in #1580
Sanitize returned server errors by @EricLBuehler in #1581
Fix invalid HTML warning by @szepeviktor in #1583
Make typos configuration stricter by @szepeviktor in #1582
fix: use try_init when initialize tracing by @christer-eriksson in #1588
Add blockwise fp8 quantize kernels by @EricLBuehler in #1586
Support old gemma3n intermediate size config by @EricLBuehler in #1594
Handle when there are invalid vision tower weights by @EricLBuehler in #1595
Fix needless bool warning by @jncraton in #1599
Use smaller model in streaming example by @jncraton in #1598
Simplify streaming example output by @jncraton in #1597
Add vector 1xK fp8 kernels by @EricLBuehler in #1600
Target generic CPUs when building Python package by @jncraton in #1596
Fix tests in CI by @EricLBuehler in #1603
Add tiktoken -> Tokenizer conversion utilities by @EricLBuehler in #1604
Reworked attention chunking by @EricLBuehler in #1591
Fix cuda ring compilation by @EricLBuehler in #1608
mistralrs-quant: Fix build when feature=+cuda,-ring. by @ryanli in #1611
chore: Dockerfile - Add SHELL instruction by @polarathene in #1612
Send the mcp server an initialization notification to let it know the client is done initializing plus finish off making all 3 supported mcp transports properly increment their request ids by @sonicrules1234 in #1614
Add Claude Code GitHub Workflow by @EricLBuehler in #1616
Add MXFP4 gather gemm support by @EricLBuehler in #1615
Update ISQ Python example by @jncraton in #1617
Rust 1.89 clippy by @EricLBuehler in #1621
Support Qwen3 MoE GGUF model with fast MoE kernel by @guoqingbao in #1622
Disable CUDA event tracking, use cudarc 0.17 by @EricLBuehler in #1623
Enforce workspace msrv by @EricLBuehler in #1631
Fix bench async crash by @EricLBuehler in #1632
Bump tracing-subscriber from 0.3.19 to 0.3.20 by @dependabot[bot] in #1633
Update MCP server guide source by @emmanuel-ferdman i...

Contributors

vigsterkr, jncraton, and 24 other contributors

Assets 2

10 Jun 23:28

EricLBuehler

v0.6.0

3410183

v0.6.0

Dockerfiles (CUDA, CPU): https://github.com/EricLBuehler/mistral.rs/pkgs/container/mistral.rs
PyPi packages (no features, cuda, mkl, metal, accelerate)

🔥 Highlights from v0.6.0

🚀 Major Features

Llama 4 support and Qwen 3 / MoE / VL models, including DeepSeek and DeepCoder integrations
Multimodal prefix caching, paged attention scheduler improvements, and faster Metal/CUDA backends
Web chat app with chat history, file uploads, speech generation, and revamped tool-calling/search
Fast sampler and CPU FlashAttention with improved performance and accuracy
Metal and CUDA: major improvements in quantization (AFQ, ISQ), UQFF handling, and memory optimizations
MCP (Model Context Protocol): new server endpoints, docs, and integrated client
Vision and audio expansion: support for SIGLIP, Dia 1.6b TTS, conformer backbone (Phi-4MM), auto loaders, and vision tool prefixes

🧠 Inference Optimizations

Lightning-fast AFQ on CPU, optimized Qwen 3 MoE on Metal, and paged attention fixes
Unified FlashAttention backend and automatic method selection for ISQ
Metal precompilation support and reduced autorelease thrashing

🧰 Dev Improvements

Refactored engine architecture, KV cache, attention backends, and device mapping logic
Centralized dependency management and cleaner internal abstractions
Streamlined and faster LoRA support

🎉 Other

Revamped README, AGENTS.md, and new benchmarking scripts
Interactive mode now shows throughput, supports Gumbel sampling, and better runtime sampling controls
Expanded quant and GGUF support: AWQ, Qwen3 GGUF, and prequantized MLX compatibility

⸻

What's Changed

Fix handling of Metal fused attn head dims by @EricLBuehler in #1234
Support paged attn for vision model rust api by @EricLBuehler in #1235
[Breaking] Support setting HF cache path by @EricLBuehler in #1237
Support tool calling for DeepSeek models by @EricLBuehler in #1239
Server image processing refactor by @EricLBuehler in #1244
Optimized CUDA RoPE kernels by @EricLBuehler in #1247
Typo fix (add_speial_tokens to add_special_tokens) by @edwko in #1246
Fixes for UQFF + distributed layers by @EricLBuehler in #1250
Automatic agentic search integration (web_search_options) by @EricLBuehler in #1243
Format kernels by @EricLBuehler in #1251
Add quantize guards for UQFF deserialize by @EricLBuehler in #1252
Refactor cuBLASlt-related code by @EricLBuehler in #1253
Update deps, bump pyo3 version by @EricLBuehler in #1259
Faster cuda FP8 performance by @EricLBuehler in #1257
Rust 1.86 clippy by @EricLBuehler in #1260
Refactor engine arch by @EricLBuehler in #1262
Revamped LoRA support - removing the Ordering system by @EricLBuehler in #1263
Fast Metal-specific quantization method: AFQ by @EricLBuehler in #1264
Support prequantized models from MLX by @EricLBuehler in #1265
Automatic ISQ to select fastest & most accurate method by @EricLBuehler in #1266
Improved usage metrics by @EricLBuehler in #1267
Bump tokio from 1.44.1 to 1.44.2 by @dependabot in #1270
Gather MM ops in mistralrs-quant by @EricLBuehler in #1272
Improve performance of deepseek models by @guoqingbao in #1274
Implement Llama 4 by @EricLBuehler in #1268
Fixes for Llama 4 UQFF loading by @EricLBuehler in #1275
Support sharding for UQFF by @EricLBuehler in #1276
Fix bug for group-topk (group_limited_greedy) in deepseek models by @guoqingbao in #1278
Support the DeepCoder model by @EricLBuehler in #1279
Improved PagedAttn scheduling accuracy by @EricLBuehler in #1282
Fixes for scheduling image seqs with pagedattn by @EricLBuehler in #1283
update to llguidance 0.7.16 by @mmoskal in #1284
Update dependencies by @EricLBuehler in #1286
Much faster image inputs processing by @EricLBuehler in #1289
Add more SDPA head dims for much faster SIGLIP by @EricLBuehler in #1290
Show throughput in interactive mode by @EricLBuehler in #1291
Unify bitwise operations by @EricLBuehler in #1288
Multimodal prefix caching support! by @EricLBuehler in #1298
Interactive mode improvements by @EricLBuehler in #1299
Add the Qwen 3 and Qwen 3 MoE models by @EricLBuehler in #1285
Revamped and streaming web search support by @EricLBuehler in #1301
Handle vision messages or different tool call prefixes by @EricLBuehler in #1302
Simplify prefix cacher by @EricLBuehler in #1305
Use rustyline to handle non-ascii in interactive mode by @beeender in #1306
Add more tools for automatic search by @EricLBuehler in #1307
Fix CPU hogging in interactive mode by @beeender in #1309
Add Metal precompilation support by @EricLBuehler in #1311
Reduce thrashing of Metal autorelease by @EricLBuehler in #1313
make AdapterPaths and LoraAdapterPaths public by @Slowki in #1314
Refactor KV cache manager by @EricLBuehler in #1315
Add Audio and Speech model categories by @Slowki in #1317
Remove has_conv2d from vision model API by @EricLBuehler in #1318
Unified/automatic flash attention enabler by @EricLBuehler in #1319
Fix cublaslt 4d mask by @EricLBuehler in #1320
Qwen VL models fixes by @EricLBuehler in #1322
Fixes for all vision models by @EricLBuehler in #1323
Improved+faster LRU prefix cacher and sampler! by @EricLBuehler in #1321
Inplace ISQ support and default to mmap by @EricLBuehler in #1277
Fix typos by @omahs in #1329
Fix Idefics 3 arch chat templating by @EricLBuehler in #1330
Remove two spaces from PR comment by @szepeviktor in #1331
Add automatic vision loader type by @EricLBuehler in #1332
Add the Dia 1.6b TTS model! by @EricLBuehler in #1304
update llguidance to 0.7.20 by @Slowki in #1334
Add model category <> messages check by @EricLBuehler in #1335
Improve normalization integration test by @EricLBuehler in #1340
Fix streaming example print statement by @EricLBuehler in #1339
Fix normalization formula in comment by @EricLBuehler in #1338
Fix image_to_pixels for non-RGB images by @EricLBuehler in #1337
Fix typo in expect messages by @EricLBuehler in #1342
Don't use mmap on cuda by @EricLBuehler in #1336
Support AWQ format models by @guoqingbao in #1350
Fix uqff dummy layer ISQ application by @EricLBuehler in #1351
Disable immediate isq if writ...

Contributors

beeender, szepeviktor, and 9 other contributors

Assets 2

24 Mar 04:16

EricLBuehler

v0.5.0

7c086a9

v0.5.0

Highlights

Blog post: https://huggingface.co/blog/EricB/mistralrs-v0-5-0

Thank you to all contributors for this release! This release includes the following highlights but also countless improvements, fixes, and optimizations.

Support for many more models:
- Gemma 3
- Qwen 2.5 VL
- Mistral Small 3.1
- Phi 4 Multimodal (image only)
Native tool calling support for:
- Llama 3.1/3.2/3.3
- Mistral Small 3
- Mistral Nemo
- Hermes 2 Pro
- Hermes 3
Tensor Parallelism support (NCCL)!
FlashAttention V3 support and integration in PagedAttention
30x reduction in ISQ times on Metal!
Revamped prefix cacher system

What's Changed

Allow using library in CurrentThread runtime by @sgrebnov in #1082
Improve accuracy of uqff auto device map by @EricLBuehler in #1084
DeepSeekV3 sigmoid support by @EricLBuehler in #1092
GPU-accelerated sampling (+5% decode perf) by @EricLBuehler in #1094
Fix missing perceiver_config in qwen2vl by @EricLBuehler in #1096
More topk methods for deepseek 2/3 by @EricLBuehler in #1097
More accurate layer size computation for deepseek 2/3 by @EricLBuehler in #1098
Improve streaming UX by @EricLBuehler in #1102
Faster fp8 blockwise dequant by @EricLBuehler in #1100
DS2/3 paged attn by @EricLBuehler in #1103
Faster bincount by @EricLBuehler in #1104
PagedAttention prompt chunking support by @EricLBuehler in #1105
Refactor server SSE by @EricLBuehler in #1107
PagedAttention + FlashAttention (and FlashAttention V3) by @EricLBuehler in #1109
Take KEEP_ALIVE_INTERVAL into account by @EricLBuehler in #1111
Refactor enable of flash attn by @EricLBuehler in #1110
Fix imatrix isq quantize_onto by @EricLBuehler in #1112
Tensor parallelism and pipeline parallelism by @EricLBuehler in #1113
Bump openssl from 0.10.69 to 0.10.70 by @dependabot in #1121
Allow chat streaming to use tools by @Jeadie in #1088
New file format for imatrix: .cimatrix by @EricLBuehler in #1004
Fix isq with bias for column parallel by @EricLBuehler in #1128
Multi-node support for tensor parallelism by @EricLBuehler in #1125
Add an NCCL feature flag by @EricLBuehler in #1129
Fix mistral 2501 gguf by @EricLBuehler in #1131
Add jinja strftime_now function by @EricLBuehler in #1132
Multiple models multi node by @EricLBuehler in #1136
Remove unexpected cp behavior by @jncraton in #1141
Revamp speculative decoding! by @EricLBuehler in #1027
Fuse MLP mul-and-act by @EricLBuehler in #1142
Short-circuit dry sampling: +6% T/s by @EricLBuehler in #1143
Integrate fused MLP mul-act for more models! by @EricLBuehler in #1144
Use cudarc 0.13.5 by @EricLBuehler in #1145
Handle HF_HUB_CACHE env var by @EricLBuehler in #1146
FlashAttention V2/V3 metadata with support for device location by @EricLBuehler in #1148
FP8 blockwise dequant cuda kernel by @EricLBuehler in #1149
Blockwise FP8 CUDA for cc < 800 by @EricLBuehler in #1150
Fix chat sampling response by @EricLBuehler in #1154
Multiple processes for TP by @EricLBuehler in #1152
Ensure we do not bind the port for daemon processes by @EricLBuehler in #1158
Handle CUDA_NVCC_FLAGS in flash attn v3 by @EricLBuehler in #1160
build fix for arm. by @jamesvren in #1164
Working PrefixCacherV2! by @EricLBuehler in #1168
Implement Phi-4 Multimodal! by @EricLBuehler in #1163
No extra split/cat pair in rope by @EricLBuehler in #1169
Remove gpu<>cpu sync for faster long-context by @EricLBuehler in #1170
Refactor NCCL device mappers by @EricLBuehler in #1172
Bump ring from 0.17.11 to 0.17.13 by @dependabot in #1179
DSV3/R1 fixes by @EricLBuehler in #1173
Fix diffusion device mapping by @EricLBuehler in #1187
Internal abstraction for distributed op by @EricLBuehler in #1188
Make Sequence::set_toks more safe by @EricLBuehler in #1190
Fix CI tests out of storage? by @EricLBuehler in #1191
Internal abstraction for distributed op by @EricLBuehler in #1189
Fix build_cuda_all.yaml CI by @EricLBuehler in #1193
Support tensor parallelism for vision models! by @EricLBuehler in #1194
Always pass _USE_MATH_DEFINES for CUDA by @EricLBuehler in #1195
Remove matmul via f16 framework by @EricLBuehler in #1196
Remove API for matmul_via_f16 by @EricLBuehler in #1197
Add UQFF text/vision model API by @EricLBuehler in #1198
Complete qwen2_5_vl, and some fixes by @brrr in #1184
Implement Gemma 3 by @EricLBuehler in #1201
Add Gemma 3 vision support! by @EricLBuehler in #1202
Manually fixup sentencepiece detok by @EricLBuehler in #1204
More vision models with TP by @EricLBuehler in #1200
Fix topology link in the docs by @etiennebalit in #1205
Gemma3 1b support and optimized rotating cache by @EricLBuehler in #1206
Improve rotating kv cache, prefix cacher system by @EricLBuehler in #1207
Better handling for kvcache set_len by @EricLBuehler in #1208
Update deps and use rand 0.9 by @EricLBuehler in #1210
Update hf hub dep, add initial blockwise fp8 GEMM tests by @EricLBuehler in #1212
Growable RotatingKvCache and fixes for Phi-4 mini by @EricLBuehler in #1215
Gemma 3 cuda fixes by @EricLBuehler in #1217
Add pydantic schema examples! by @EricLBuehler in #1219
Sliding window attention fixes by @EricLBuehler in #1220
adapt to rig crate as client by @benliao in #1214
Implement Mistral 3! by @EricLBuehler in #1221
Metal SDPA with masking by @EricLBuehler in #1225
Send [DONE] SSE chunk per openai spec by @EricLBuehler in #1226
Fix handling of device when compiled for but disabled nccl by @EricLBuehler in #1227
Fix nccl blocking case by @EricLBuehler in #1228
Native Llama, Mistral Small 3.1, Mistral Nemo, Hermes 2 Pro, Hermes 3 tool calling! by @EricLBuehler in #1229
OpenAI API compatability fixes by @EricLBuehler in #1230
[Breaking] Automatic server logging by @EricLBuehler in #1231
Use default stream for flash attn by @EricLBuehler in https://gi...

Contributors

jncraton, brrr, and 7 other contributors

Assets 2

22 Jan 19:39

EricLBuehler

v0.4.0

f1a56f6

v0.4.0

New features

🔥 New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
🧮 Imatrix quantization
⚙️ Automatic device mapping
BNB quantization
Support blockwise FP8 dequantization and FP8 on Metal
Integrate the llguidance library (@mmoskal)
Metal PagedAttention
Many fixes and improvements from contributors!

Breaking changes

The Rust device mapping API has changed.

MSRV

The MSRV of this release is 1.83.0.

What's Changed

Use CUDA_COMPUTE_CAP if nvidia-smi not found by @EricLBuehler in #944
fix(docs): fix broken link by @sammcj in #945
Better diffusion interactive mode by @EricLBuehler in #948
Implement Imatrix for ISQ by @EricLBuehler in #949
Support imatrix quantization for vision models by @EricLBuehler in #950
Perplexity calculations with imatrix by @EricLBuehler in #952
set minimum rustc version to 1.82 by @mmoskal in #957
Fix append_sliding_window by @EricLBuehler in #958
Fix completion api behavior of best_of by @EricLBuehler in #959
Ensure support for cuda cc 5.3 by @EricLBuehler in #960
Improve test speeds on Windows by @EricLBuehler in #961
use llguidance library for constraints (including json schemas) by @mmoskal in #899
Fix metal fp8 quantization by @EricLBuehler in #962
Fix example gguf_locally to match chat template requirements by @msk in #966
Bitsandbytes quantization: loading and kernels by @EricLBuehler in #967
updated the tokenizers dependency of core to 0.21 by @vkomenda in #975
Remove outdated binaries mention in the readme by @BafS in #973
Improve error handling by @cdoko in #974
Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by @cdoko in #979
Include start offset for metal bitwise ops by @EricLBuehler in #978
Fail fast on TcpListener bind errors by @cdoko in #982
Inplace softmax long-seqlen attention optimizations by @EricLBuehler in #984
Fix cuda cublaslt when using vllama mask by @EricLBuehler in #985
Add cross attn quantization for mllama by @EricLBuehler in #987
fix mistralrs-server ignoring interactive_mode arg by @haricot in #990
Adding streaming function to mistralrs server. by @Narsil in #986
Fixes for bnb and more apis in mistralrs-quant by @EricLBuehler in #972
Support send + sync in loader by @EricLBuehler in #991
More vllama optimizations by @EricLBuehler in #992
Update docs by @EricLBuehler in #993
Use metal autorelease to optimize memory usage by @EricLBuehler in #996
Partial Fix for Sliding Window Attention by @cdoko in #994
Only dep on objc when building on metal by @EricLBuehler in #998
Prefix cacher v2 by @EricLBuehler in #1000
Add --cpu flag to mistralrs-server by @cdoko in #997
Metal PagedAttention support by @EricLBuehler in #1001
Fix cross attention + prefix cacher v2 support by @EricLBuehler in #1006
Support for normal cache for mllama, phi3v, qwen2vl by @EricLBuehler in #1007
Cleaner creation of dummy pa input metadata by @EricLBuehler in #1014
Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by @guoqingbao in #1009
Support device mapping for Paged Attention by @cdoko in #1011
Prefix cacher fixes by @EricLBuehler in #1018
More fixes for the prefix cacher by @EricLBuehler in #1019
Support uqff for idefics3 by @EricLBuehler in #1020
Prepare for v0.3.5 by @EricLBuehler in #1021
Cleaner pipeline no prefix cache setting by @EricLBuehler in #1022
Support uqff load/save for idefics3 by @EricLBuehler in #1023
Update license for 2025 by @EricLBuehler in #1024
Implement DeepSeekV2 by @EricLBuehler in #1010
Use cudarc fork to fix CUDA build on Windows by @EricLBuehler in #1032
Fix metal paged attn phi3 by @EricLBuehler in #1033
Use float8 mistralrs_cudarc_fork feature by @EricLBuehler in #1034
Patch prefix caching to fix incorrect outputs by @EricLBuehler in #1035
Allocate paged attn cache as empty instead of zeros by @EricLBuehler in #1036
Remove ug and cudarc transient dep by @EricLBuehler in #1037
Rename MemoryGpuConfig::Amount->MbAmount by @EricLBuehler in #1038
CUDA dequant kernels conditional compilation by @EricLBuehler in #1039
F16 support for mllama, introduce FloatInfo by @EricLBuehler in #1041
Automatic device mapping support by @EricLBuehler in #1042
Support automatic device mapping for gguf models by @EricLBuehler in #1044
Support loading models without ISQ using device map by @EricLBuehler in #1045
Fix GGUF auto device mapping by @EricLBuehler in #1047
More efficient loading of safetensors when casting by @EricLBuehler in #1048
Fix Loading and Running on CPU by @cdoko in #1052
Work on better device mapping for mllama by @EricLBuehler in #1049
Mention interactive mode or server port in readme for gguf by @EricLBuehler in #1055
Fix panic in mistralrs-server by @cdoko in #981
Include device memory avail in device map err by @EricLBuehler in #1060
Fix --cpu on cuda by @cdoko in #1056
Improve pagedattn support in mistralrs bench by @EricLBuehler in #1063
Paged attention support for multi gpu by @EricLBuehler in #1059
Ergonomic automatic device mapping support by @EricLBuehler in #1054
Examples for automatic device mapping by @EricLBuehler in #1065
Fix metal pagedattn half8 vec impl by @EricLBuehler in #1067
Improve support for GGUF auto device map by @EricLBuehler in #1069
Fix missing field in idefics3 during loading by @EricLBuehler in #1070
Fix missing field in idefics3 during loading by @EricLBuehler in #1072
Fix paged attention for vision models on multiple devices by @cdoko in #1071
Fixes for idefics3 and idefics2 by @EricLBuehler in #1073
Improve automatic device map by @EricLBuehler in #1076
Implement the DeepSeekV3 model (support full DeepSeek R1) by @EricLBuehler in #1077
Don't print GGUF model metadata when silent=true by @Jeadie in #1079
Allow ChatCompletionChunkResponse (and therefore streaming) to have Usage. by @Jeadie in #1078
Support loading blockwise...

Contributors

msk, Narsil, and 9 other contributors

Assets 2

28 Nov 19:27

EricLBuehler

v0.3.4

68c078f

v0.3.4

New features

Qwen2-VL support
Idefics 3/SmolVLM support
️‍🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
🗂️ More efficient non-PagedAttention KV cache implementation!
Public tokenization API

Python wheels

The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.

MSRV

1.79.0

What's Changed

Update Dockerfile by @Reckon-11 in #895
Add the Qwen2-VL model by @EricLBuehler in #894
ISQ for mistralrs-bench by @EricLBuehler in #902
Use tokenizers v0.20 by @EricLBuehler in #904
Fix metal sdpa for v stride by @EricLBuehler in #905
Better parsing of the image path by @EricLBuehler in #906
Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
Handle assistant messages with 'tool_calls' by @Jeadie in #824
Attention-fused softmax for Metal by @EricLBuehler in #908
Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
Support --dtype in mistralrs bench by @EricLBuehler in #911
Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
Preallocated KV cache by @EricLBuehler in #916
Fixes for kv cache grow by @EricLBuehler in #917
Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
Expand attnmask on cuda by @EricLBuehler in #923
Faster CUDA prompt speeds by @EricLBuehler in #925
Paged Attention alibi support by @EricLBuehler in #926
Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
VLlama vision model ISQ support by @EricLBuehler in #928
Support fp8 on Metal by @EricLBuehler in #930
Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
Calculate perplexity of ISQ models by @EricLBuehler in #931
Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
Fix etag missing in hf hub by @EricLBuehler in #934
Fix some examples for vllama 3.2 by @EricLBuehler in #937
Improve memory efficency of vllama by @EricLBuehler in #938
Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
Expose a public tokenization API by @EricLBuehler in #940
Prepare for v0.3.4 by @EricLBuehler in #942

New Contributors

@Reckon-11 made their first contribution in #895

Full Changelog: v0.3.2...v0.3.4

Contributors

Jeadie, dependabot, and 2 other contributors

Assets 2

28 Oct 15:44

EricLBuehler

v0.3.2

57a8b03

v0.3.2

Key changes

General improvements and fixes
ISQ FP8
GPTQ Marlin
26% performance boost on Metal
Python package wheels are available. See below and the various PyPi packages.

What's Changed

Update docs and deps by @EricLBuehler in #804
Support Qwen 2.5 by @EricLBuehler in #805
Update docs with clarifications and notes by @EricLBuehler in #806
Improved inverting for Attention Mask by @EricLBuehler in #811
Fix repeat_interleave by @EricLBuehler in #812
Use f32 for neg inf in cross attn mask by @EricLBuehler in #814
Improve UQFF memory efficiency by @EricLBuehler in #813
Update Metal, CUDA Candle impls and ISQ by @EricLBuehler in #816
chore: update pagedattention.cu by @eltociear in #822
MLlama - if f16, load vision model in f32 by @EricLBuehler in #820
ci: Upgrade actions by @polarathene in #823
docs: added a top button because of readme length by @bhargavshirin in #833
Typo in error of model architecture enum by @nikolaydubina in #835
Expose config for Rust api, tweak modekind by @EricLBuehler in #841
Add ISQ FP8 by @EricLBuehler in #832
Fix Metal F8 build errors by @EricLBuehler in #846
Bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #854
Generate standalone UQFF models by @EricLBuehler in #849
Update README.MD by @kaleaditya779 in #848
Add GPTQ Marlin support for 4 and 8 bit by @EricLBuehler in #856
Adds wrap_help feature to clap by @DaveTJones in #858
Patch UQFF metal generation by @EricLBuehler in #857
Add GGUF Qwen 2 by @EricLBuehler in #860
Avoid duplicate Metal command buffer encodings during ISQ by @EricLBuehler in #861
Fix for isnanf by @EricLBuehler in #859
Fix some metal warnings by @EricLBuehler in #862
Support interactive mode markdown bold/italics via ANSI codes by @EricLBuehler in #879
Even better V-Llama accuracy by @EricLBuehler in #881
Trim whitespace (such as carriage returns) from nvidia-smi output. by @asaddi in #880
MODEL_ID not "MODEL_ID" by @simonw in #863
Sync ggml metal kernels by @EricLBuehler in #885
Increase Metal decoding T/s by 26% by @EricLBuehler in #887
Remove pretty-printer by @EricLBuehler in #889
Fix typo in documentation by @msk in #888
fix Half-Quadratic Quantization and Dequantization on CPU by @haricot in #873
Prepare for v0.3.2 by @EricLBuehler in #891

New Contributors

@bhargavshirin made their first contribution in #833
@nikolaydubina made their first contribution in #835
@kaleaditya779 made their first contribution in #848
@DaveTJones made their first contribution in #858
@asaddi made their first contribution in #880
@simonw made their first contribution in #863
@msk made their first contribution in #888
@haricot made their first contribution in #873

Full Changelog: v0.3.1...v0.3.2

Contributors

simonw, msk, and 10 other contributors

Assets 23

29 Sep 15:39

EricLBuehler

v0.3.1

1caf83a

v0.3.1

Highlights

UQFF
FLUX model
Llama 3.2 Vision model

MSRV

The MSRV of this release is 1.79.0.

What's Changed

Enable automatic determination of normal loader type by @EricLBuehler in #742
Add the ForwardInputsResult api by @EricLBuehler in #745
Implement Mixture of Quantized Experts (MoQE) by @EricLBuehler in #747
Bump quinn-proto from 0.11.6 to 0.11.8 by @dependabot in #748
Fix f64-f32 type mismatch for Metal/Accelerate by @EricLBuehler in #752
Nicer error when misconfigured PagedAttention input metadata by @EricLBuehler in #753
Update deps, support CUDA 12.6 by @EricLBuehler in #755
Patch bug when not using PagedAttention by @EricLBuehler in #759
Fix MistralRs Drop impl in tokio runtime by @EricLBuehler in #762
Use nicer Candle Error APIs by @EricLBuehler in #767
Support setting seed by @EricLBuehler in #766
Fix Metal build error with seed by @EricLBuehler in #771
Fix and add checks for no kv cache by @EricLBuehler in #776
UQFF: The uniquely powerful quantized file format. by @EricLBuehler in #770
Add Scheduler::running_len by @EricLBuehler in #780
Deduplicate RoPE caches by @EricLBuehler in #787
Easier and simpler Rust-side API by @EricLBuehler in #785
Add some examples for AnyMoE by @EricLBuehler in #788
Rust API for sampling by @EricLBuehler in #790
Our first Diffusion model: FLUX by @EricLBuehler in #758
Fix build bugs with metal, NSUInteger by @EricLBuehler in #792
Support weight tying in Llama 3.2 GGUF models by @EricLBuehler in #801
Implement the Llama 3.2 vision models by @EricLBuehler in #796

Full Changelog: v0.3.0...v0.3.1

Contributors

dependabot and EricLBuehler

Assets 2

02 Sep 17:27

EricLBuehler

v0.3.0

ae71578

v0.3.0

Highlights

New model topology feature: ISQ and device mapping
🔥Faster FlashAttention support when batching
Removed plotly and associated JS dependencies
φ³ Support Phi 3.5, Phi 3.5 vision, Phi 3.5 MoE
Improved Rust API ergonomics
Support multiple (shaded) GGUF files

MSRV

The Rust MSRV of this version is 1.79.0

What's Changed

Fixes for auto dtype selection with RUST_BACKTRACE=1 by @EricLBuehler in #690
Add support multiple GGUF files by @EricLBuehler in #692
Refactor normal and vision loaders by @EricLBuehler in #693
Fix split.count GGUF duplication handling by @EricLBuehler in #695
Batching example by @EricLBuehler in #694
Some fixes by @EricLBuehler in #697
Improve vision rust examples by @EricLBuehler in #698
Add ISQ topology by @EricLBuehler in #701
Add custom logits processor API by @EricLBuehler in #702
Add Gemma 2 PagedAttention support by @EricLBuehler in #704
Faster RmsNorm in Gemma/Gemma2 by @EricLBuehler in #703
Fix bug in Metal ISQ by @EricLBuehler in #706
Support GGUF BF16 tensors by @EricLBuehler in #691
Better support for FlashAttention: real batching + sliding window + softcap by @EricLBuehler in #707
Remove some usages of pub in models by @EricLBuehler in #708
Support the Phi 3.5 V model by @EricLBuehler in #710
Implement the Phi 3.5 MoE model by @EricLBuehler in #709
Device map topology by @EricLBuehler in #717
Implement DRY penalty by @EricLBuehler in #637
Remove plotly and just output CSV loss file by @EricLBuehler in #700
Using once_cell to reduce MSRV by @EricLBuehler in #724
Fixes for Windows build by @EricLBuehler in #729
Even more phi3.5moe fix attempts by @EricLBuehler in #731
Add example for Phi 3.5 MoE by @EricLBuehler in #733
Add Phi 3.5 chat template by @EricLBuehler in #734
Patch ISQ for Mixtral by @EricLBuehler in #730
Gracefully handle Engine Drop with termination request by @EricLBuehler in #735
feat(vision): add support for proper file and data image URLs by @Schuwi in #727
Add new parsing to Python API by @EricLBuehler in #737
Remove test and add custom error type to Python API by @EricLBuehler in #738
Update kernels for metal bf16 by @EricLBuehler in #719
Better Response Result API by @EricLBuehler in #739
More Metal quantized kernel fixes by @EricLBuehler in #740
[Breaking] Bump version to v0.3.0 by @EricLBuehler in #736
Final changes for v0.3.0 by @EricLBuehler in #741

New Contributors

@Schuwi made their first contribution in #727

Full Changelog: v0.2.5...v0.3.0

Contributors

Schuwi and EricLBuehler

Assets 2

16 Aug 01:10

github-actions

v0.2.5

e64a71a

v0.2.5

What's Changed

Refactor ISQ quant parsing by @EricLBuehler in #664
Refactor server examples to use OpenAI Python client by @EricLBuehler in #665
Implement prompt chunking by @EricLBuehler in #623
Python example and server example cleanup by @EricLBuehler in #668
Implement GPTQ quantization by @EricLBuehler in #467
Update deps by @EricLBuehler in #672
Rework the automatic dtype selection feature by @EricLBuehler in #676
Fix backend Candle fork Metal, flash attn, also Llama linear by @EricLBuehler in #681
Use converted tokenizer.json in tests by @EricLBuehler in #682
Refactor ISQ and mistralrs-quant by @EricLBuehler in #683
Fix metal build for isq by @EricLBuehler in #686
Add missing error case in automatic dtype selection feature by @ac3xx in #685
fix null in tool type response by @wseaton in #687
Implement HQQ quantization by @EricLBuehler in #677
Bump version to 0.2.5 by @EricLBuehler in #688

New Contributors

@ac3xx made their first contribution in #685
@wseaton made their first contribution in #687

Full Changelog: v0.2.4...v0.2.5

Install mistralrs-server 0.2.5

Install prebuilt binaries via shell script

curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.5/mistralrs-server-installer.sh | sh

Download mistralrs-server 0.2.5

File	Platform	Checksum
mistralrs-server-aarch64-apple-darwin.tar.xz	Apple Silicon macOS	checksum
mistralrs-server-x86_64-apple-darwin.tar.xz	Intel macOS	checksum
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz	x64 Linux	checksum

Contributors

ac3xx, wseaton, and EricLBuehler

Assets 15

Releases: EricLBuehler/mistral.rs

v0.8.0

What's Changed

New Contributors

Contributors

Uh oh!

v0.7.0

Highlights

New Models & Architectures

What's Changed

Contributors

Uh oh!

v0.6.0

🔥 Highlights from v0.6.0

What's Changed

Contributors

Uh oh!

v0.5.0

Highlights

What's Changed

Contributors

Uh oh!

v0.4.0

New features

Breaking changes

MSRV

What's Changed

Contributors

Uh oh!

v0.3.4

New features

Python wheels

MSRV

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.2

Key changes

What's Changed

New Contributors

Contributors

Uh oh!

v0.3.1

Highlights

MSRV

What's Changed

Contributors

Uh oh!

v0.3.0

Highlights

MSRV

What's Changed

New Contributors

Contributors

Uh oh!

v0.2.5

What's Changed

New Contributors

Install mistralrs-server 0.2.5

Install prebuilt binaries via shell script

Download mistralrs-server 0.2.5

Contributors

Uh oh!