LLM Semantic Cache
mimir is a drop-in proxy that caches LLM API responses using semantic similarity, reducing costs and latency for repeated or similar queries.
- Semantic Caching - Cache hits for semantically similar prompts, not just exact matches
- Free Local Embeddings - Use Ollama for embeddings with zero API costs
- OpenAI-Compatible - Drop-in replacement proxy for OpenAI API
- Configurable Threshold - Tune similarity sensitivity (0.0-1.0)
- TTL Support - Time-based cache expiration
- Zero Dependencies - Single binary, no external database required
- Docker Ready - Simple containerized deployment
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Client │────▶│ mimir │────▶│ LLM API │
│ (app/pod) │◀────│ (proxy) │◀────│ (OpenAI/..) │
└─────────────┘ └──────┬──────┘ └─────────────┘
│
┌──────▼──────┐
│ Vector Store│
│ (embeddings)│
└─────────────┘
- Incoming request is converted to an embedding
- Cache is searched for semantically similar previous requests
- If similarity exceeds threshold → return cached response
- Otherwise → forward to upstream, cache response
# Install Ollama (if not already installed)
brew install ollama # macOS
# or: curl -fsSL https://ollama.com/install.sh | sh # Linux
# Start Ollama and pull embedding model
ollama serve &
ollama pull nomic-embed-text
# Clone and run mimir
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
./bin/mimir# Clone and build
git clone https://github.com/aqstack/mimir.git
cd mimir
make build
# Run with OpenAI
export OPENAI_API_KEY=sk-...
./bin/mimir# With Ollama (requires Ollama running on host)
docker run -p 8080:8080 -e OLLAMA_BASE_URL=http://host.docker.internal:11434 ghcr.io/aqstack/mimir:latest
# With OpenAI
docker run -p 8080:8080 -e OPENAI_API_KEY=$OPENAI_API_KEY ghcr.io/aqstack/mimir:latestPoint your OpenAI client to mimir instead of the OpenAI API:
from openai import OpenAI
# Point to mimir proxy
client = OpenAI(
base_url="http://localhost:8080/v1",
api_key="your-api-key" # or use OPENAI_API_KEY env var
)
response = client.chat.completions.create(
model="gpt-4",
messages=[{"role": "user", "content": "What is the capital of France?"}]
)
# Check cache status in response headers
# X-Mimir-Cache: HIT or MISS
# X-Mimir-Similarity: 0.9823 (if HIT)| Environment Variable | Default | Description |
|---|---|---|
MIMIR_EMBEDDING_PROVIDER |
ollama |
Embedding provider: ollama or openai |
MIMIR_EMBEDDING_MODEL |
nomic-embed-text |
Embedding model name |
OLLAMA_BASE_URL |
http://localhost:11434 |
Ollama server URL |
OPENAI_API_KEY |
- | OpenAI API key (auto-switches provider if set) |
OPENAI_BASE_URL |
https://api.openai.com/v1 |
Upstream API URL |
MIMIR_PORT |
8080 |
Server port |
MIMIR_HOST |
0.0.0.0 |
Server host |
MIMIR_SIMILARITY_THRESHOLD |
0.95 |
Minimum similarity for cache hit (0.0-1.0) |
MIMIR_CACHE_TTL |
24h |
Cache entry time-to-live |
MIMIR_MAX_CACHE_SIZE |
10000 |
Maximum cache entries |
MIMIR_LOG_JSON |
false |
JSON log format |
Ollama (free, local):
nomic-embed-text(768 dims, recommended)mxbai-embed-large(1024 dims)all-minilm(384 dims, fastest)
OpenAI (paid):
text-embedding-3-small(1536 dims, recommended)text-embedding-3-large(3072 dims)text-embedding-ada-002(1536 dims)
| Endpoint | Description |
|---|---|
POST /v1/chat/completions |
Chat completions (cached) |
GET /health |
Health check |
GET /stats |
Cache statistics |
* /v1/* |
Other OpenAI endpoints (passthrough) |
curl http://localhost:8080/stats{
"total_entries": 150,
"total_hits": 1234,
"total_misses": 567,
"hit_rate": 0.685,
"estimated_saved_usd": 1.234
}The MIMIR_SIMILARITY_THRESHOLD controls how similar a query must be to trigger a cache hit:
| Threshold | Behavior |
|---|---|
0.99 |
Nearly exact matches only |
0.95 |
Very similar queries (recommended) |
0.90 |
Moderate similarity |
0.85 |
Loose matching (may return less relevant) |
- Local embeddings with Ollama
- Redis/Qdrant backend for persistence
- Prometheus metrics
- Cache warming
- Support for Anthropic, Gemini APIs
Contributions are welcome! Please open an issue or submit a pull request.
MIT License - see LICENSE for details.