Skip to content

aravv27/llm-router

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Router

One endpoint to rule all LLMs. Point Void (or any OpenAI-compatible client) at a local proxy server, and the router automatically picks the best available model, retries on failures, and falls back across providers — all transparently.

Features

  • Single endpoint — one r.ask("...") call or one HTTP endpoint for everything
  • 10 providers out of the box — Google, Cohere, Mistral, OpenRouter, Cerebras, Groq, Hugging Face, Kilo Code, llm7.io, NVIDIA NIM
  • Automatic routing — tier-based model selection, direct providers preferred over meta-providers
  • Retry & fallback — exponential backoff, cascades to next model on failure
  • Key rotation — multiple keys per provider with automatic cooldown on rate limits
  • Quality modesquality, fast, cheap, code
  • OpenAI-compatible server — plug straight into Void, Continue, or any editor that supports custom endpoints
  • SQLite logging — every request and response logged locally

Quick Start

1. Install

git clone https://github.com/aravv27/llm-router 
cd llm-router
pip install -e .

2. Configure

Copy the example config and fill in your API keys:

cp router.yaml.example router.yaml

Then edit router.yaml — add your keys and whitelist the models you want to use:

keys:
  google: [YOUR_GOOGLE_API_KEY]
  openrouter: [YOUR_OPENROUTER_KEY]
  groq: [YOUR_GROQ_KEY]
  # ... etc

models:
  - id: gemini-2.5-flash
    provider: google
    tier: 1
    capabilities: [chat, code, vision, tool_use]
    context_window: 1048576

  - id: openai/gpt-oss-120b:free
    provider: openrouter
    tier: 1
    capabilities: [chat, code]
    context_window: 32000

  - id: llama-3.3-70b-versatile
    provider: groq
    tier: 2
    capabilities: [chat, code]
    context_window: 131072

You control quality. Only models you list will ever be used. The router never picks a model outside your whitelist.

3. Use as a Python library

from router import Router

r = Router.from_config("router.yaml")

# Simple call — router picks best available model
answer = r.ask("Explain quantum entanglement in 3 sentences")
print(answer.text)
print(f"  → used {answer.model} via {answer.provider} in {answer.latency:.2f}s")

# Streaming
for chunk in r.stream("Write a Python web scraper"):
    print(chunk.delta, end="", flush=True)

# Mode selection
answer = r.ask("Fix this bug", mode="code")      # prefer code-capable models
answer = r.ask("Summarize this", mode="fast")    # prefer lowest-latency providers
answer = r.ask("Translate this", mode="cheap")   # prefer lowest-cost models

# Pin to a specific model
answer = r.ask("Hello", model="gemini-2.5-flash")

# With a system prompt
answer = r.ask(
    "Review this code",
    system="You are a senior Python engineer.",
    temperature=0.2,
)

4. Use as a local server (for Void / any editor)

python -m router.server --config router.yaml
# ➜ Listening on http://127.0.0.1:8787

Then in Void's settings, set:

API Base URL:  http://127.0.0.1:8787/v1
API Key:       not-needed
Model:         auto

The server exposes a fully OpenAI-compatible API — Void thinks it's talking to OpenAI.


Supported Providers

Provider Type Notes
Google Direct Gemini 2.5 Pro/Flash
Cohere Direct Command A, Command R+
Mistral Direct Mistral Large, Codestral
Cerebras Direct Llama, Qwen — ultra-fast inference
Groq Direct Llama, Qwen — ultra-fast inference
Hugging Face Direct Open-source models via Inference API
NVIDIA NIM Direct GLM, Mistral Nemotron, MiniMax
llm7.io Direct Free-tier models
OpenRouter Meta Gateway to 200+ models (GPT-4.1, Claude, etc.)
Kilo Code Meta Gateway with free-tier model access

Meta-providers (OpenRouter, Kilo) are used as fallback — direct providers are always tried first.


Routing Logic

1. Filter by capability   → vision? tool_use? json_mode?
2. Filter by context      → prompt fits in the model's window?
3. Sort by tier           → quality: 1→2→3 | cheap/fast: 3→2→1
4. Prefer direct          → direct providers before OpenRouter/Kilo
5. Round-robin            → spread load across models in the same tier
6. On failure             → retry once with backoff, then cascade to next

Configuration Reference

keys:
  google: [key1, key2]        # Multiple keys → automatic rotation
  openrouter: [key1]

models:
  - id: gemini-2.5-flash      # Provider's native model ID
    provider: google
    tier: 1                   # 1 = best quality, 2 = good, 3 = fast/cheap
    capabilities:             # Used to match models to request requirements
      - chat
      - code
      - vision
      - tool_use
      - json_mode
    context_window: 1048576

settings:
  default_mode: quality       # quality | fast | cheap | code
  max_attempts: 5             # Total attempts before giving up
  timeout_seconds: 120
  retry_backoff_base: 1.0     # Seconds (doubles each retry)
  key_cooldown_seconds: 60    # How long to bench a rate-limited key

server:
  host: 127.0.0.1
  port: 8787

logging:
  database: router.db         # SQLite file for request history
  log_level: info             # debug | info | warning | error

Response Object

response = r.ask("Hello")

response.text          # The answer
response.model         # Which model answered (e.g. "gemini-2.5-flash")
response.provider      # Which provider (e.g. "google")
response.latency       # Seconds
response.usage         # .prompt_tokens, .completion_tokens, .total_tokens
response.attempts      # Full trace of every attempt (model, error, latency)

Server Endpoints

Endpoint Description
POST /v1/chat/completions Chat completions — regular and streaming
GET /v1/models List all registered models
GET /health Health check

Project Structure

src/router/
├── core.py           # Router class — ask() and stream()
├── config.py         # YAML loading and validation
├── routing.py        # Model candidate selection
├── retry.py          # Retry engine with backoff and fallback
├── keys.py           # Key rotation and cooldown
├── models.py         # Model registry
├── types.py          # Response, Chunk, Attempt, Usage types
├── database.py       # SQLite request logging
├── server.py         # FastAPI OpenAI-compatible server
└── providers/
    ├── base.py           # Abstract Provider interface
    ├── openai_compat.py  # Generic adapter (9 providers)
    └── google.py         # Google Gemini adapter

Running Tests

pip install -e ".[dev]"
pytest tests/ -v

License

MIT

About

its a LLM router which gives you 1 endpoint which you can use all the places, no need to change api keys, codebases nothing, everything in yaml

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages