High-performance Rust inference engine for 1-bit BitNet LLMs.
Warning
Pre-alpha (v0.2.1-dev). QK256 uses scalar kernels (~0.1 tok/s on 2B models). GPU backends are scaffolded but not validated. Significant correctness, performance, and validation work remains. Do not use in production.
# Build (always specify features — defaults are empty)
cargo build --locked --no-default-features --features cpu
# Download a model
cargo run --locked -p xtask -- download-model --id microsoft/bitnet-b1.58-2B-4T-gguf
# Run inference
RUST_LOG=warn cargo run --locked -p bitnet-cli --no-default-features --features cpu,full-cli -- run \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.json \
--prompt "What is 2+2?" --max-tokens 8
# Interactive chat (auto-detects prompt template)
RUST_LOG=warn cargo run --locked -p bitnet-cli --no-default-features --features cpu,full-cli -- chat \
--model models/microsoft-bitnet-b1.58-2B-4T-gguf/ggml-model-i2_s.gguf \
--tokenizer models/microsoft-bitnet-b1.58-2B-4T-gguf/tokenizer.jsonSee docs/quickstart.md for the full getting-started guide.
| Feature | State | Notes |
|---|---|---|
| CPU inference (I2_S BitNet32) | Working | Production path; SIMD-optimised |
| CPU inference (I2_S QK256) | Working | Scalar only (~0.1 tok/s on 2B); AVX2 foundation merged |
| Cross-validation vs C++ | Working | Per-token logits comparison framework |
| Honest-compute receipts | Working | Schema v1.0.0, 8 validation gates |
| Interactive chat (REPL) | Working | Auto-template detection, /help, /clear, /metrics |
| SafeTensors to GGUF export | Working | bitnet-st2gguf with F16 LayerNorm preservation |
| 59+ prompt templates | Working | LLaMA-3, Phi-4, Qwen, Gemma, Mistral, DeepSeek, etc. |
| GPU inference (CUDA) | Scaffold | Feature-gated; receipt validation pending |
| GPU backends (Metal/Vulkan/OpenCL/ROCm) | Scaffold | Stubs only; not validated |
| Server / HTTP API | Incomplete | Health endpoints wired; inference endpoints have TODOs |
bitnet-tokenizers ──────────────────────────────────────┐
│
bitnet-models (GGUF loader, dual I2_S flavor detection) │
└── bitnet-quantization (I2_S / TL1 / TL2 / IQ2_S) │
└── bitnet-kernels (AVX2 / AVX-512 / NEON / CUDA)│
▼
bitnet-inference (autoregressive engine)
├── bitnet-logits / bitnet-sampling / bitnet-generation
├── bitnet-prompt-templates (59+ variants)
└── bitnet-receipts (honest-compute schema)
│
┌──────────────┴──────────────┐
bitnet-cli bitnet-server
The workspace contains ~200 crates. See docs/architecture-overview.md for details.
cargo build --locked --no-default-features --features cpu # CPU (development)
cargo build --locked --no-default-features --features gpu # GPU (requires CUDA 12.x)
RUSTFLAGS="-C target-cpu=native -C opt-level=3 -C lto=thin" \
cargo build --locked --release --no-default-features --features cpu,full-cli # optimised release| Flag | Purpose |
|---|---|
cpu |
SIMD-optimised CPU inference (AVX2/AVX-512/NEON) |
gpu |
GPU umbrella — CUDA backend |
full-cli |
Enable all CLI subcommands |
ffi |
C++ FFI bridge for cross-validation |
fixtures |
GGUF fixture-based integration tests (test-only) |
Nix: nix develop && nix build .#bitnet-cli && nix flake check — see Nix guide.
# Run all enabled tests (recommended)
cargo nextest run --locked --workspace --no-default-features --features cpu
# CI profile (4 threads, no retries)
cargo nextest run --locked --profile ci
# Lint
cargo fmt --all && cargo clippy --locked --all-targets --no-default-features --features cpu -- -D warnings~58,700 test annotations spanning unit, property-based (proptest), snapshot (insta), fixture, fuzz (109 targets), and BDD categories. ~2,800 tests are intentionally #[ignore]-d with justification strings. See docs/development/test-suite.md.
Organised by Diataxis:
| Section | Contents |
|---|---|
| Tutorials | Getting started, first inference, tokenizer discovery |
| How-to | Install, run inference, export GGUF, cross-validate, validate models |
| Explanation | Architecture, quantization formats, feature flags |
| Reference | Inference CLI, Cross-validation CLI, environment variables, quantization |
See CONTRIBUTING.md. Before opening a PR:
./ci/local.sh # or: cargo fmt --all && cargo clippy --locked ... && cargo nextest run --locked ...Developer workflow boundary: new internal maintenance commands go in xtask; bitnet-task exists only to preserve legacy scripts/*.sh entrypoints while the migration is in flight.
See ROADMAP.md for project direction.
Dual-licensed under MIT and Apache 2.0.