Skip to content
/ mimir Public

mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.

License

Notifications You must be signed in to change notification settings

aqstack/mimir

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

mimir

LLM Semantic Cache

mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.

Features

  • Semantic Caching - Cache hits for semantically similar prompts, not just exact matches
  • Free Local Embeddings - Use Ollama for embeddings with zero API costs
  • OpenAI-Compatible - Drop-in replacement proxy for OpenAI API
  • Configurable Threshold - Tune similarity sensitivity (0.0-1.0)
  • TTL Support - Time-based cache expiration
  • Zero Dependencies - Single binary, no external database required
  • Docker Ready - Simple containerized deployment

How It Works

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Client    │────▶│    mimir    │────▶│  LLM API    │
│  (app/pod)  │◀────│   (proxy)   │◀────│ (OpenAI/..) │
└─────────────┘     └──────┬──────┘     └─────────────┘
                           │
                    ┌──────▼──────┐
                    │ Vector Store│
                    │ (embeddings)│
                    └─────────────┘
  1. Incoming request is converted to an embedding
  2. Cache is searched for semantically similar previous requests
  3. If similarity exceeds threshold → return cached response
  4. Otherwise → forward to upstream, cache response

Quick Start

Option 1: Local Embeddings with Ollama (Free)

# Install Ollama (if not already installed)
brew install ollama  # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh  # Linux

# Start Ollama and pull embedding model
ollama serve &
ollama pull nomic-embed-text

# Clone and run mimir
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
./bin/mimir

Option 2: OpenAI Embeddings

# Clone and build
git clone https://github.com/aqstack/mimir.git
cd mimir
make build

# Run with OpenAI
export OPENAI_API_KEY=sk-...
./bin/mimir

Using Docker

# With Ollama (requires Ollama running on host)
docker run -p 8080:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 ghcr.io/aqstack/mimir:latest

# With OpenAI
docker run -p 8080:8080 -e OPENAI_API_KEY=$OPENAI_API_KEY ghcr.io/aqstack/mimir:latest

Usage

Point your OpenAI client to mimir instead of the OpenAI API:

from openai import OpenAI

# Point to mimir proxy
client = OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="your-api-key"  # or use OPENAI_API_KEY env var
)

response = client.chat.completions.create(
    model="gpt-4",
    messages=[{"role": "user", "content": "What is the capital of France?"}]
)

# Check cache status in response headers
# X-Mimir-Cache: HIT or MISS
# X-Mimir-Similarity: 0.9823 (if HIT)

Configuration

Environment Variable Default Description
MIMIR_EMBEDDING_PROVIDER ollama Embedding provider: ollama or openai
MIMIR_EMBEDDING_MODEL nomic-embed-text Embedding model name
OLLAMA_BASE_URL http://localhost:11434 Ollama server URL
OPENAI_API_KEY - OpenAI API key (auto-switches provider if set)
OPENAI_BASE_URL https://api.openai.com/v1 Upstream API URL
MIMIR_PORT 8080 Server port
MIMIR_HOST 0.0.0.0 Server host
MIMIR_SIMILARITY_THRESHOLD 0.95 Minimum similarity for cache hit (0.0-1.0)
MIMIR_CACHE_TTL 24h Cache entry time-to-live
MIMIR_MAX_CACHE_SIZE 10000 Maximum cache entries
MIMIR_LOG_JSON false JSON log format

Embedding Models

Ollama (free, local):

  • nomic-embed-text (768 dims, recommended)
  • mxbai-embed-large (1024 dims)
  • all-minilm (384 dims, fastest)

OpenAI (paid):

  • text-embedding-3-small (1536 dims, recommended)
  • text-embedding-3-large (3072 dims)
  • text-embedding-ada-002 (1536 dims)

API Endpoints

Endpoint Description
POST /v1/chat/completions Chat completions (cached)
GET /health Health check
GET /stats Cache statistics
* /v1/* Other OpenAI endpoints (passthrough)

Cache Statistics

curl http://localhost:8080/stats
{
  "total_entries": 150,
  "total_hits": 1234,
  "total_misses": 567,
  "hit_rate": 0.685,
  "estimated_saved_usd": 1.234
}

Tuning the Similarity Threshold

The MIMIR_SIMILARITY_THRESHOLD controls how similar a query must be to trigger a cache hit:

Threshold Behavior
0.99 Nearly exact matches only
0.95 Very similar queries (recommended)
0.90 Moderate similarity
0.85 Loose matching (may return less relevant)

Roadmap

  • Local embeddings with Ollama
  • Redis/Qdrant backend for persistence
  • Prometheus metrics
  • Cache warming
  • Support for Anthropic, Gemini APIs

Contributing

Contributions are welcome! Please open an issue or submit a pull request.

License

MIT License - see LICENSE for details.

About

mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published