One endpoint to rule all LLMs. Point Void (or any OpenAI-compatible client) at a local proxy server, and the router automatically picks the best available model, retries on failures, and falls back across providers — all transparently.
- Single endpoint — one
r.ask("...")call or one HTTP endpoint for everything - 10 providers out of the box — Google, Cohere, Mistral, OpenRouter, Cerebras, Groq, Hugging Face, Kilo Code, llm7.io, NVIDIA NIM
- Automatic routing — tier-based model selection, direct providers preferred over meta-providers
- Retry & fallback — exponential backoff, cascades to next model on failure
- Key rotation — multiple keys per provider with automatic cooldown on rate limits
- Quality modes —
quality,fast,cheap,code - OpenAI-compatible server — plug straight into Void, Continue, or any editor that supports custom endpoints
- SQLite logging — every request and response logged locally
git clone https://github.com/aravv27/llm-router
cd llm-router
pip install -e .Copy the example config and fill in your API keys:
cp router.yaml.example router.yamlThen edit router.yaml — add your keys and whitelist the models you want to use:
keys:
google: [YOUR_GOOGLE_API_KEY]
openrouter: [YOUR_OPENROUTER_KEY]
groq: [YOUR_GROQ_KEY]
# ... etc
models:
- id: gemini-2.5-flash
provider: google
tier: 1
capabilities: [chat, code, vision, tool_use]
context_window: 1048576
- id: openai/gpt-oss-120b:free
provider: openrouter
tier: 1
capabilities: [chat, code]
context_window: 32000
- id: llama-3.3-70b-versatile
provider: groq
tier: 2
capabilities: [chat, code]
context_window: 131072You control quality. Only models you list will ever be used. The router never picks a model outside your whitelist.
from router import Router
r = Router.from_config("router.yaml")
# Simple call — router picks best available model
answer = r.ask("Explain quantum entanglement in 3 sentences")
print(answer.text)
print(f" → used {answer.model} via {answer.provider} in {answer.latency:.2f}s")
# Streaming
for chunk in r.stream("Write a Python web scraper"):
print(chunk.delta, end="", flush=True)
# Mode selection
answer = r.ask("Fix this bug", mode="code") # prefer code-capable models
answer = r.ask("Summarize this", mode="fast") # prefer lowest-latency providers
answer = r.ask("Translate this", mode="cheap") # prefer lowest-cost models
# Pin to a specific model
answer = r.ask("Hello", model="gemini-2.5-flash")
# With a system prompt
answer = r.ask(
"Review this code",
system="You are a senior Python engineer.",
temperature=0.2,
)python -m router.server --config router.yaml
# ➜ Listening on http://127.0.0.1:8787Then in Void's settings, set:
API Base URL: http://127.0.0.1:8787/v1
API Key: not-needed
Model: auto
The server exposes a fully OpenAI-compatible API — Void thinks it's talking to OpenAI.
| Provider | Type | Notes |
|---|---|---|
| Direct | Gemini 2.5 Pro/Flash | |
| Cohere | Direct | Command A, Command R+ |
| Mistral | Direct | Mistral Large, Codestral |
| Cerebras | Direct | Llama, Qwen — ultra-fast inference |
| Groq | Direct | Llama, Qwen — ultra-fast inference |
| Hugging Face | Direct | Open-source models via Inference API |
| NVIDIA NIM | Direct | GLM, Mistral Nemotron, MiniMax |
| llm7.io | Direct | Free-tier models |
| OpenRouter | Meta | Gateway to 200+ models (GPT-4.1, Claude, etc.) |
| Kilo Code | Meta | Gateway with free-tier model access |
Meta-providers (OpenRouter, Kilo) are used as fallback — direct providers are always tried first.
1. Filter by capability → vision? tool_use? json_mode?
2. Filter by context → prompt fits in the model's window?
3. Sort by tier → quality: 1→2→3 | cheap/fast: 3→2→1
4. Prefer direct → direct providers before OpenRouter/Kilo
5. Round-robin → spread load across models in the same tier
6. On failure → retry once with backoff, then cascade to next
keys:
google: [key1, key2] # Multiple keys → automatic rotation
openrouter: [key1]
models:
- id: gemini-2.5-flash # Provider's native model ID
provider: google
tier: 1 # 1 = best quality, 2 = good, 3 = fast/cheap
capabilities: # Used to match models to request requirements
- chat
- code
- vision
- tool_use
- json_mode
context_window: 1048576
settings:
default_mode: quality # quality | fast | cheap | code
max_attempts: 5 # Total attempts before giving up
timeout_seconds: 120
retry_backoff_base: 1.0 # Seconds (doubles each retry)
key_cooldown_seconds: 60 # How long to bench a rate-limited key
server:
host: 127.0.0.1
port: 8787
logging:
database: router.db # SQLite file for request history
log_level: info # debug | info | warning | errorresponse = r.ask("Hello")
response.text # The answer
response.model # Which model answered (e.g. "gemini-2.5-flash")
response.provider # Which provider (e.g. "google")
response.latency # Seconds
response.usage # .prompt_tokens, .completion_tokens, .total_tokens
response.attempts # Full trace of every attempt (model, error, latency)| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completions — regular and streaming |
GET /v1/models |
List all registered models |
GET /health |
Health check |
src/router/
├── core.py # Router class — ask() and stream()
├── config.py # YAML loading and validation
├── routing.py # Model candidate selection
├── retry.py # Retry engine with backoff and fallback
├── keys.py # Key rotation and cooldown
├── models.py # Model registry
├── types.py # Response, Chunk, Attempt, Usage types
├── database.py # SQLite request logging
├── server.py # FastAPI OpenAI-compatible server
└── providers/
├── base.py # Abstract Provider interface
├── openai_compat.py # Generic adapter (9 providers)
└── google.py # Google Gemini adapter
pip install -e ".[dev]"
pytest tests/ -vMIT