This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
mistral.rs is a blazing-fast LLM inference engine written in Rust. It supports text, multimodal, image generation, and speech models with Rust and Python SDKs, plus OpenAI HTTP and MCP APIs.
# Basic release build
cargo build --release
# With CUDA support (Linux)
cargo build --release --features "cuda flash-attn cudnn"
# With Metal support (macOS)
cargo build --release --features metal
# Install CLI binary
cargo install --path mistralrs-cli --features <features># Run core tests
cargo test -p mistralrs-core -p mistralrs-quant -p mistralrs-vision
# Format code (uses rustfmt, ruff, clang-format)
make fmt
# Check formatting
cargo fmt --all -- --check
# Run clippy
cargo clippy --workspace --tests --examples -- -D warnings# Run interactive mode (model type auto-detected)
mistralrs run -m <model_id>
# Run with GGUF quantized model
mistralrs run --format gguf -m <repo> -f <file>
# Run server
mistralrs serve -p 1234 -m <model_id>
# Run server with web UI
mistralrs serve --ui -m <model_id>
# Run benchmarks
mistralrs bench -m <model_id>When integrating a new model, make sure it respects all of the varbuilder .pp calls. In Candle, a VarBuilder maintains an internal path vector that acts like a “current working directory” for model weights; every call to pp("sub") (alias for push_prefix) clones the builder and appends sub, so successive calls accumulate a dotted prefix such as transformer.h.0 while leaving the original builder untouched . When you eventually call get(...), Candle joins that prefix with the tensor name (prefix + "." + name) and looks it up in the checkpoint backend, producing keys that exactly match the dot-separated names emitted by PyTorch’s state_dict/named_parameters, which means PyTorch-trained weights can be loaded without any renaming . This lets you recreate the PyTorch module tree in Rust by “walking” it: e.g. vb.pp("word_embeddings") grabs word_embeddings., while a chain like vb.pp("encoder").pp("layers").pp(i.to_string()) targets keys such as encoder.layers.0., exactly as shown in community tutorials porting Transformers models to Candle . As one maintainer put it, the prefix system lets you “cd” around the parameter hierarchy, giving a lightweight namespace mechanism that keeps Candle fully compatible with PyTorch naming conventions while remaining ergonomic to use.
You should also look for a model.safetensors.index.json file for the model at hand to verify correct structure.
mistralrs-core/- Core inference engine, model implementations, pipelinesmistralrs-cli/- Unified CLI binary (commands: run, serve, bench, from-config)mistralrs-server-core/- HTTP server routing, OpenAI API implementationmistralrs-pyo3/- Python SDK (PyO3 bindings)mistralrs/- Rust SDK (high-level crate)mistralrs-vision/- Image processing utilitiesmistralrs-quant/- Quantization implementations (ISQ, GGUF, GPTQ, etc.)mistralrs-paged-attn/- PagedAttention implementationmistralrs-audio/- Audio processingmistralrs-mcp/- Model Context Protocol clientmistralrs-bench/- (Deprecated) Usemistralrs benchinstead
-
Pipeline Architecture: All models implement the
Pipelinetrait inmistralrs-core/src/pipeline/mod.rs. Different model types (Plain, GGUF, GGML, Multimodal) have their own pipeline implementations. -
Model Loading: Models are loaded through
Loadertraits that handle different formats and quantizations. Seemistralrs-core/src/loader.rs. -
Request Handling: The server uses message passing with
MistralRsstruct managing a background thread pool. Requests flow throughmistralrs-core/src/engine/mod.rs. -
Device Management: Automatic and manual device mapping for multi-GPU setups handled in
mistralrs-core/src/device_map.rs.
When adding new model architectures:
- Implement the model in
mistralrs-core/src/models/ - Add pipeline support in
mistralrs-core/src/pipeline/ - Update model detection in
mistralrs-core/src/pipeline/normal.rs - Add architecture enum variant in
mistralrs-core/src/lib.rs - Update CLI args in
mistralrs-cli/src/main.rs
When adding new quantization methods:
- Implement in
mistralrs-quant/src/ - Add to quantization loading logic in pipelines
- Update documentation in
docs/QUANTIZATION.md
mistralrs-core/src/engine/mod.rs- Main engine orchestrationmistralrs-core/src/pipeline/mod.rs- Pipeline trait and common logicmistralrs-server-core/src/routes.rs- HTTP API endpointsmistralrs-pyo3/src/lib.rs- Python SDK entry pointmistralrs/examples/- Usage examples for Rust SDK
Never include a "Test plan" section in PR descriptions.
You should always run cargo check/cargo c before returning to make sure code compiles. If code does not compile, only make edits.
Avoid returning TODOs.
- Unit tests are colocated with source files
- Integration tests in
tests/directories - Use
cargo test -p <crate>to test specific components - Python tests require building and installing the package first
- Feature Flags: Many features are gated behind Cargo features. Always check what features are needed for your use case.
- Device Indices: CUDA device selection uses 0-based indexing
- Chat Templates: Models may need specific chat templates - check
chat_templates/directory - Quantization: Different quantization methods have different hardware requirements
- Never use
Tensor::{from_vec,arange}in hot loops:Tensor::{from_vec,arange}with a GPU device causes a CPU-to-GPU sync. If you need a small tensor on GPU during forward, either precompute it at model init or start of forward pass.
-
Vision encoder attention must be bidirectional (non-causal):
Sdpa.run_attentionwithflash_params: Nonedefaults tocausal = seq_len > 1on the CUDA flash-attn path, which silently breaks vision/audio encoders. Always passFlashParams { causal: false, cumulative_seqlens_q: HashMap::new(), cumulative_seqlens_k: HashMap::new(), max_q: 0, max_k: 0 }withSome(&flash_params)for any encoder that needs bidirectional attention. The emptycumulative_seqlenscause the flash backend to use the non-varlen kernel path, avoiding any tensor allocation in the forward pass. -
torch.bucketize(right=True)requiresOk(i) => i + 1: Rust'sbinary_search_byreturnsOk(i)at the found position (bisect_left semantics). Forright=True(bisect_right), you must useOk(i) => i + 1to insert after equal elements.Err(i) => iis correct for both. -
Mistral
consolidated.safetensorsstores Q/K weights with interleaved head dimensions: When loading from Mistral-nativeconsolidated.safetensors(as opposed to HF-convertedmodel.safetensors), the Q and K projection weights use an interleaved layout within each head:[x0, x_{d/2}, x1, x_{d/2+1}, ...]instead of the sequential HF layout[x0, x1, ..., x_{d/2-1}, x_{d/2}, ...]. This means you must useis_gptx=false(GPT-J/adjacent-pair style) forRotaryEmbedding, NOTis_gptx=true(GPT-NeoX/half-split style). Using the wrong RoPE style produces completely wrong attention outputs (cosine similarity ~0.02 with reference). To diagnose: compare a Q or K weight tensor betweenconsolidated.safetensorsandmodel.safetensors— if they differ (cosine ~0.02), apply the un-interleave:reshape(n_heads, head_dim/2, 2, dim).permute(0,2,1,3)and verify cosine ~1.0. -
Causal Conv1d padding formula: For causal convolution (left-pad only, no right-pad), the correct left padding is
effective_kernel_size - stride, NOT(kernel_size - 1) * dilation(which is the total padding for non-causal). For example, with kernel_size=3, stride=2, dilation=1: left_pad = 3 - 2 = 1, not 2. Verify against the HF model'sVoxtralRealtimeCausalConv1dor equivalent source.