Skip to content

Ayobami-00/tessera-rs

tessera-rs

tessera-rs is a Rust-first LLM inference serving framework inspired by vLLM, with a phased roadmap that starts from a focused single-GPU serving core and grows toward multi-GPU, multi-node, and adaptive serving at scale.

Project goals

  • Serve real decoder-only workloads with a pragmatic systems-first design.
  • Reuse high-performance kernels and distributed communication backends rather than rewriting everything from scratch.
  • Build in public as a proper OSS project with visible milestones, issues, CI, Docker, and Kubernetes support.
  • Differentiate with TAC (Tessera Adaptive Controller), an adaptive control-loop initiative focused on goodput and latency SLOs.

Current phase

The repository is currently in early M1 Single GPU Serving Core.

The M0 OSS Foundation work is in place, and the codebase now also has a real first serving slice:

  • request lifecycle and scheduler primitives
  • paged KV metadata plus Tessera-owned backend KV storage for the current Llama path
  • a concrete single-GPU decoder-only serving loop
  • OpenAI-compatible completions with streaming
  • TTFT, ITL, throughput, and request outcome metrics
  • Docker, CI, and repository governance scaffolding from M0

Roadmap highlights

  • M0 OSS Foundation
  • M1 Single GPU Serving Core
  • M2 Latency and Throughput Optimizations
  • M3 Single-Node Multi-GPU
  • M4 Static Multi-Node Serving
  • M5 Elastic Multi-Node Scaling
  • M6 Disaggregated Prefill and Decode
  • M7 Speculative Decoding
  • M8 Hybrid Attention and Cache Layouts
  • M9 Mixture of Experts
  • M10 General Model Support
  • M11 Guided Decoding
  • TAC Tessera Adaptive Controller

See ROADMAP.md for the detailed milestone view and seeded issues.

Quickstart

Local Rust workflow

cargo fmt --all
cargo clippy --workspace --all-targets -- -D warnings
cargo test --workspace
cargo run -p tessera-server

CUDA builds are opt-in. In a CUDA-capable environment, use:

cargo check -p tessera-engine --features cuda
TESSERA_DEVICE=cuda:0 cargo run -p tessera-server --features cuda

You can also use the root Makefile for the common local flow:

make download-llama
make server
make test-normal-completion
make test-stream-completion
make metrics-json

For CUDA via make:

make server CARGO_FEATURES=cuda DEVICE=cuda:0
make check-cuda

The bootstrap service listens on http://127.0.0.1:8080.

Useful endpoints:

  • GET /
  • GET /healthz
  • GET /readyz
  • GET /version

Docker

docker build -t tessera-rs:dev .
docker run --rm -p 8080:8080 tessera-rs:dev

Helm

helm install tessera deploy/helm/tessera

Repository structure

Current working layout

  • crates/tessera-engine-core
    • serving control-plane primitives such as request lifecycle, scheduler, sampler, and KV metadata
  • crates/tessera-engine
    • concrete runtime integration layer
    • runtime/ owns session orchestration and output emission
    • backend/ owns backend traits and selection/loading
    • backends/llama/ owns the current Llama-family implementation
  • crates/tessera-server
    • HTTP/OpenAI-compatible serving surface, metrics endpoints, server bootstrap, and tracing setup
  • docs/architecture
    • current implementation notes such as the end-to-end request flow
  • docs/rfcs
    • accepted and proposed design documents
  • .github
    • CI, issue templates, pull request template, and roadmap bootstrap assets
  • deploy/helm/tessera
    • Helm chart skeleton
  • scripts/
    • GitHub bootstrap and workflow automation

Target long-term layout

This is the intended end-state structure for the full project. Not every directory or crate needs to exist immediately; add them when the roadmap actually demands them.

tessera-rs/
├── .github/
│   ├── bootstrap/                     # roadmap + project board source of truth
│   ├── workflows/                     # CI/CD workflows
│   └── ISSUE_TEMPLATE/                # issue forms and templates
├── crates/
│   ├── tessera-server/                # HTTP/OpenAI-compatible server entrypoint
│   ├── tessera-cli/                   # admin, dev, and utility CLI
│   ├── tessera-core/                  # shared config, errors, common types
│   ├── tessera-protocol/              # API and control-plane message types
│   ├── tessera-engine/                # public engine facade used by server/CLI
│   ├── tessera-engine-core/           # request lifecycle, scheduling, token budgets
│   ├── tessera-kv-cache/              # paged KV cache, allocators, prefix cache
│   ├── tessera-executor/              # executor traits and worker abstractions
│   ├── tessera-backend-cuda/          # CUDA/NCCL/FlashAttention backend bindings
│   ├── tessera-sampler/               # sampling, stop conditions, specdec, guided decoding
│   ├── tessera-models/                # model family adapters and capability registry
│   ├── tessera-control-plane/         # multi-node routing, admission, coordination
│   ├── tessera-tac/                   # Tessera Adaptive Controller
│   └── tessera-bench/                 # benchmarking and load-generation tooling
├── configs/
│   ├── local/                         # local and development configs
│   ├── single-node/                   # single-node serving profiles
│   └── multi-node/                    # distributed serving profiles
├── deploy/
│   ├── helm/tessera/                  # Helm chart
│   ├── compose/                       # local multi-service orchestration
│   └── k8s/examples/                  # concrete Kubernetes examples
├── docs/
│   ├── architecture/                  # subsystem-specific architecture docs
│   ├── benchmarks/                    # benchmark methodology and results
│   ├── operations/                    # deployment and maintainer workflows
│   └── rfcs/                          # design proposals and accepted RFCs
├── examples/
│   ├── single-gpu/                    # minimal local serving examples
│   ├── multi-gpu/                     # tensor-parallel examples
│   ├── multi-node/                    # API + worker examples
│   ├── structured-output/             # guided decoding examples
│   └── speculative-decoding/          # speculative decoding examples
├── scripts/                           # GitHub/bootstrap/dev automation
├── tests/
│   ├── integration/                   # crate-level integration tests
│   ├── distributed/                   # multi-process and multi-node tests
│   ├── e2e/                           # HTTP/API end-to-end tests
│   └── fixtures/                      # prompts, configs, and test data
├── ARCHITECTURE.md
├── CONTRIBUTING.md
├── ROADMAP.md
├── Cargo.toml
└── README.md

Structure conventions

  • Binary entrypoints should stay in focused crates like tessera-server, tessera-cli, and tessera-bench.
  • Engine internals should be split by responsibility: engine core, KV-cache, executor, sampler, models, and control-plane.
  • tessera-tac should remain its own crate so the adaptive controller can evolve independently from the serving core.
  • Deployment, operational docs, benchmark methodology, and GitHub automation should live outside the Rust crates.
  • Avoid creating empty crates just to match the final tree; the structure is a guide for evolution, not a requirement to scaffold everything immediately.

GitHub roadmap bootstrap

The repository includes checked-in source of truth for milestones, labels, seeded issues, epic tasklists, and the GitHub Project board:

  • .github/bootstrap/roadmap.json
  • .github/bootstrap/project.json
  • scripts/bootstrap_github.sh
  • scripts/bootstrap_project.sh
  • scripts/sync_epic_tasklists.sh
  • scripts/sync_project_fields.sh

With GitHub CLI authenticated, you can bootstrap the roadmap like this:

./scripts/bootstrap_github.sh
./scripts/bootstrap_project.sh

Contributing

Start with:

  • CONTRIBUTING.md
  • ROADMAP.md
  • ARCHITECTURE.md
  • docs/operations/issue-workflow.md

License

Apache-2.0. See LICENSE.

About

A minimal high-throughput Rust LLM serving engine inspired by vLLM

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors