LLM Router

One endpoint to rule all LLMs. Point Void (or any OpenAI-compatible client) at a local proxy server, and the router automatically picks the best available model, retries on failures, and falls back across providers — all transparently.

Features

Single endpoint — one r.ask("...") call or one HTTP endpoint for everything
10 providers out of the box — Google, Cohere, Mistral, OpenRouter, Cerebras, Groq, Hugging Face, Kilo Code, llm7.io, NVIDIA NIM
Automatic routing — tier-based model selection, direct providers preferred over meta-providers
Retry & fallback — exponential backoff, cascades to next model on failure
Key rotation — multiple keys per provider with automatic cooldown on rate limits
Quality modes — quality, fast, cheap, code
OpenAI-compatible server — plug straight into Void, Continue, or any editor that supports custom endpoints
SQLite logging — every request and response logged locally

Quick Start

1. Install

git clone https://github.com/aravv27/llm-router 
cd llm-router
pip install -e .

2. Configure

Copy the example config and fill in your API keys:

cp router.yaml.example router.yaml

Then edit router.yaml — add your keys and whitelist the models you want to use:

keys:
  google: [YOUR_GOOGLE_API_KEY]
  openrouter: [YOUR_OPENROUTER_KEY]
  groq: [YOUR_GROQ_KEY]
  # ... etc

models:
  - id: gemini-2.5-flash
    provider: google
    tier: 1
    capabilities: [chat, code, vision, tool_use]
    context_window: 1048576

  - id: openai/gpt-oss-120b:free
    provider: openrouter
    tier: 1
    capabilities: [chat, code]
    context_window: 32000

  - id: llama-3.3-70b-versatile
    provider: groq
    tier: 2
    capabilities: [chat, code]
    context_window: 131072

You control quality. Only models you list will ever be used. The router never picks a model outside your whitelist.

3. Use as a Python library

from router import Router

r = Router.from_config("router.yaml")

# Simple call — router picks best available model
answer = r.ask("Explain quantum entanglement in 3 sentences")
print(answer.text)
print(f"  → used {answer.model} via {answer.provider} in {answer.latency:.2f}s")

# Streaming
for chunk in r.stream("Write a Python web scraper"):
    print(chunk.delta, end="", flush=True)

# Mode selection
answer = r.ask("Fix this bug", mode="code")      # prefer code-capable models
answer = r.ask("Summarize this", mode="fast")    # prefer lowest-latency providers
answer = r.ask("Translate this", mode="cheap")   # prefer lowest-cost models

# Pin to a specific model
answer = r.ask("Hello", model="gemini-2.5-flash")

# With a system prompt
answer = r.ask(
    "Review this code",
    system="You are a senior Python engineer.",
    temperature=0.2,
)

4. Use as a local server (for Void / any editor)

python -m router.server --config router.yaml
# ➜ Listening on http://127.0.0.1:8787

Then in Void's settings, set:

API Base URL:  http://127.0.0.1:8787/v1
API Key:       not-needed
Model:         auto

The server exposes a fully OpenAI-compatible API — Void thinks it's talking to OpenAI.

Supported Providers

Provider	Type	Notes
Google	Direct	Gemini 2.5 Pro/Flash
Cohere	Direct	Command A, Command R+
Mistral	Direct	Mistral Large, Codestral
Cerebras	Direct	Llama, Qwen — ultra-fast inference
Groq	Direct	Llama, Qwen — ultra-fast inference
Hugging Face	Direct	Open-source models via Inference API
NVIDIA NIM	Direct	GLM, Mistral Nemotron, MiniMax
llm7.io	Direct	Free-tier models
OpenRouter	Meta	Gateway to 200+ models (GPT-4.1, Claude, etc.)
Kilo Code	Meta	Gateway with free-tier model access

Meta-providers (OpenRouter, Kilo) are used as fallback — direct providers are always tried first.

Routing Logic

1. Filter by capability   → vision? tool_use? json_mode?
2. Filter by context      → prompt fits in the model's window?
3. Sort by tier           → quality: 1→2→3 | cheap/fast: 3→2→1
4. Prefer direct          → direct providers before OpenRouter/Kilo
5. Round-robin            → spread load across models in the same tier
6. On failure             → retry once with backoff, then cascade to next

Configuration Reference

keys:
  google: [key1, key2]        # Multiple keys → automatic rotation
  openrouter: [key1]

models:
  - id: gemini-2.5-flash      # Provider's native model ID
    provider: google
    tier: 1                   # 1 = best quality, 2 = good, 3 = fast/cheap
    capabilities:             # Used to match models to request requirements
      - chat
      - code
      - vision
      - tool_use
      - json_mode
    context_window: 1048576

settings:
  default_mode: quality       # quality | fast | cheap | code
  max_attempts: 5             # Total attempts before giving up
  timeout_seconds: 120
  retry_backoff_base: 1.0     # Seconds (doubles each retry)
  key_cooldown_seconds: 60    # How long to bench a rate-limited key

server:
  host: 127.0.0.1
  port: 8787

logging:
  database: router.db         # SQLite file for request history
  log_level: info             # debug | info | warning | error

Response Object

response = r.ask("Hello")

response.text          # The answer
response.model         # Which model answered (e.g. "gemini-2.5-flash")
response.provider      # Which provider (e.g. "google")
response.latency       # Seconds
response.usage         # .prompt_tokens, .completion_tokens, .total_tokens
response.attempts      # Full trace of every attempt (model, error, latency)

Server Endpoints

Endpoint	Description
`POST /v1/chat/completions`	Chat completions — regular and streaming
`GET /v1/models`	List all registered models
`GET /health`	Health check

Project Structure

src/router/
├── core.py           # Router class — ask() and stream()
├── config.py         # YAML loading and validation
├── routing.py        # Model candidate selection
├── retry.py          # Retry engine with backoff and fallback
├── keys.py           # Key rotation and cooldown
├── models.py         # Model registry
├── types.py          # Response, Chunk, Attempt, Usage types
├── database.py       # SQLite request logging
├── server.py         # FastAPI OpenAI-compatible server
└── providers/
    ├── base.py           # Abstract Provider interface
    ├── openai_compat.py  # Generic adapter (9 providers)
    └── google.py         # Google Gemini adapter

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
src/router		src/router
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
router.yaml.example		router.yaml.example

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Router

Features

Quick Start

1. Install

2. Configure

3. Use as a Python library

4. Use as a local server (for Void / any editor)

Supported Providers

Routing Logic

Configuration Reference

Response Object

Server Endpoints

Project Structure

Running Tests

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LLM Router

Features

Quick Start

1. Install

2. Configure

3. Use as a Python library

4. Use as a local server (for Void / any editor)

Supported Providers

Routing Logic

Configuration Reference

Response Object

Server Endpoints

Project Structure

Running Tests

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages