tessera-rs is a Rust-first LLM inference serving framework inspired by vLLM, with a phased roadmap that starts from a focused single-GPU serving core and grows toward multi-GPU, multi-node, and adaptive serving at scale.
- Serve real decoder-only workloads with a pragmatic systems-first design.
- Reuse high-performance kernels and distributed communication backends rather than rewriting everything from scratch.
- Build in public as a proper OSS project with visible milestones, issues, CI, Docker, and Kubernetes support.
- Differentiate with
TAC(Tessera Adaptive Controller), an adaptive control-loop initiative focused on goodput and latency SLOs.
The repository is currently in early M1 Single GPU Serving Core.
The M0 OSS Foundation work is in place, and the codebase now also has a real first serving slice:
- request lifecycle and scheduler primitives
- paged KV metadata plus Tessera-owned backend KV storage for the current Llama path
- a concrete single-GPU decoder-only serving loop
- OpenAI-compatible completions with streaming
- TTFT, ITL, throughput, and request outcome metrics
- Docker, CI, and repository governance scaffolding from
M0
M0OSS FoundationM1Single GPU Serving CoreM2Latency and Throughput OptimizationsM3Single-Node Multi-GPUM4Static Multi-Node ServingM5Elastic Multi-Node ScalingM6Disaggregated Prefill and DecodeM7Speculative DecodingM8Hybrid Attention and Cache LayoutsM9Mixture of ExpertsM10General Model SupportM11Guided DecodingTACTessera Adaptive Controller
See ROADMAP.md for the detailed milestone view and seeded issues.
cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo run -p tessera-serverCUDA builds are opt-in. In a CUDA-capable environment, use:
cargo check -p tessera-engine --features cuda
TESSERA_DEVICE=cuda:0 cargo run -p tessera-server --features cudaYou can also use the root Makefile for the common local flow:
make download-llama
make server
make test-normal-completion
make test-stream-completion
make metrics-jsonFor CUDA via make:
make server CARGO_FEATURES=cuda DEVICE=cuda:0
make check-cudaThe bootstrap service listens on http://127.0.0.1:8080.
Useful endpoints:
GET /GET /healthzGET /readyzGET /version
docker build -t tessera-rs:dev .
docker run --rm -p 8080:8080 tessera-rs:devhelm install tessera deploy/helm/tesseracrates/tessera-engine-core- serving control-plane primitives such as request lifecycle, scheduler, sampler, and KV metadata
crates/tessera-engine- concrete runtime integration layer
runtime/owns session orchestration and output emissionbackend/owns backend traits and selection/loadingbackends/llama/owns the current Llama-family implementation
crates/tessera-server- HTTP/OpenAI-compatible serving surface, metrics endpoints, server bootstrap, and tracing setup
docs/architecture- current implementation notes such as the end-to-end request flow
docs/rfcs- accepted and proposed design documents
.github- CI, issue templates, pull request template, and roadmap bootstrap assets
deploy/helm/tessera- Helm chart skeleton
scripts/- GitHub bootstrap and workflow automation
This is the intended end-state structure for the full project. Not every directory or crate needs to exist immediately; add them when the roadmap actually demands them.
tessera-rs/
├── .github/
│ ├── bootstrap/ # roadmap + project board source of truth
│ ├── workflows/ # CI/CD workflows
│ └── ISSUE_TEMPLATE/ # issue forms and templates
├── crates/
│ ├── tessera-server/ # HTTP/OpenAI-compatible server entrypoint
│ ├── tessera-cli/ # admin, dev, and utility CLI
│ ├── tessera-core/ # shared config, errors, common types
│ ├── tessera-protocol/ # API and control-plane message types
│ ├── tessera-engine/ # public engine facade used by server/CLI
│ ├── tessera-engine-core/ # request lifecycle, scheduling, token budgets
│ ├── tessera-kv-cache/ # paged KV cache, allocators, prefix cache
│ ├── tessera-executor/ # executor traits and worker abstractions
│ ├── tessera-backend-cuda/ # CUDA/NCCL/FlashAttention backend bindings
│ ├── tessera-sampler/ # sampling, stop conditions, specdec, guided decoding
│ ├── tessera-models/ # model family adapters and capability registry
│ ├── tessera-control-plane/ # multi-node routing, admission, coordination
│ ├── tessera-tac/ # Tessera Adaptive Controller
│ └── tessera-bench/ # benchmarking and load-generation tooling
├── configs/
│ ├── local/ # local and development configs
│ ├── single-node/ # single-node serving profiles
│ └── multi-node/ # distributed serving profiles
├── deploy/
│ ├── helm/tessera/ # Helm chart
│ ├── compose/ # local multi-service orchestration
│ └── k8s/examples/ # concrete Kubernetes examples
├── docs/
│ ├── architecture/ # subsystem-specific architecture docs
│ ├── benchmarks/ # benchmark methodology and results
│ ├── operations/ # deployment and maintainer workflows
│ └── rfcs/ # design proposals and accepted RFCs
├── examples/
│ ├── single-gpu/ # minimal local serving examples
│ ├── multi-gpu/ # tensor-parallel examples
│ ├── multi-node/ # API + worker examples
│ ├── structured-output/ # guided decoding examples
│ └── speculative-decoding/ # speculative decoding examples
├── scripts/ # GitHub/bootstrap/dev automation
├── tests/
│ ├── integration/ # crate-level integration tests
│ ├── distributed/ # multi-process and multi-node tests
│ ├── e2e/ # HTTP/API end-to-end tests
│ └── fixtures/ # prompts, configs, and test data
├── ARCHITECTURE.md
├── CONTRIBUTING.md
├── ROADMAP.md
├── Cargo.toml
└── README.md
- Binary entrypoints should stay in focused crates like
tessera-server,tessera-cli, andtessera-bench. - Engine internals should be split by responsibility: engine core, KV-cache, executor, sampler, models, and control-plane.
tessera-tacshould remain its own crate so the adaptive controller can evolve independently from the serving core.- Deployment, operational docs, benchmark methodology, and GitHub automation should live outside the Rust crates.
- Avoid creating empty crates just to match the final tree; the structure is a guide for evolution, not a requirement to scaffold everything immediately.
The repository includes checked-in source of truth for milestones, labels, seeded issues, epic tasklists, and the GitHub Project board:
.github/bootstrap/roadmap.json.github/bootstrap/project.jsonscripts/bootstrap_github.shscripts/bootstrap_project.shscripts/sync_epic_tasklists.shscripts/sync_project_fields.sh
With GitHub CLI authenticated, you can bootstrap the roadmap like this:
./scripts/bootstrap_github.sh
./scripts/bootstrap_project.shStart with:
CONTRIBUTING.mdROADMAP.mdARCHITECTURE.mddocs/operations/issue-workflow.md
Apache-2.0. See LICENSE.