Production deployment, Redis migration, troubleshooting, and optimization.
- Redis Index Migration
- Production Deployment
- Security Best Practices
- Performance Optimization
- Troubleshooting
Redis vector indexes have fixed dimensions. When you:
- Switch embedding models (Gemma → Nomic)
- Change Matryoshka dimensions (768 → 256)
- Switch providers (Ollama → HuggingFace)
The old index will cause dimension mismatch errors.
Using Make (Easiest):
make cache-clearUsing Redis CLI:
redis-cli FT.DROPINDEX semantic_cache DDUsing Python:
from semantic_cache.config import get_redis_client
client = get_redis_client()
client.execute_command("FT.DROPINDEX", "semantic_cache", "DD")Example: Switch from 768 → 256 dimensions
# 1. Stop the API
Ctrl+C
# 2. Update dependencies.py
# Change: output_dimension=768 → output_dimension=256
# 3. Clear Redis index
make cache-clear
# 4. Restart API
make dev
# 5. Verify logs
# Look for: "Created new index: semantic_cache"| Command | Index | Data |
|---|---|---|
FT.DROPINDEX semantic_cache |
✅ Deleted | ❌ Kept (unsearchable) |
FT.DROPINDEX semantic_cache DD |
✅ Deleted | ✅ Deleted |
make cache-clear uses the DD flag - deletes everything.
redis-cli FT.INFO semantic_cacheLook for the VECTOR field to see current dimensions.
docker-compose.yml:
version: '3.8'
services:
# Ollama (if using)
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
restart: unless-stopped
# Semantic Cache API
api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_URL=redis://redis:6379
- EMBEDDING_MODEL=embeddinggemma
# For HuggingFace (if using):
# - HF_TOKEN=${HF_TOKEN}
# - EMBEDDING_OUTPUT_DIMENSION=768
depends_on:
- redis
- ollama # Remove if using HuggingFace
restart: unless-stopped
# Redis Stack
redis:
image: redis/redis-stack:latest
ports:
- "6379:6379"
volumes:
- redis_data:/data
restart: unless-stopped
volumes:
ollama_data:
redis_data:Pull Ollama model:
docker-compose exec ollama ollama pull embeddinggemmaStart services:
docker-compose up -dollama-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
volumeMounts:
- name: models
mountPath: /root/.ollama
volumes:
- name: models
persistentVolumeClaim:
claimName: ollama-models
---
apiVersion: v1
kind: Service
metadata:
name: ollama
spec:
ports:
- port: 11434
selector:
app: ollamaapi-deployment.yaml:
apiVersion: apps/v1
kind: Deployment
metadata:
name: semantic-cache
spec:
replicas: 3
selector:
matchLabels:
app: semantic-cache
template:
metadata:
labels:
app: semantic-cache
spec:
containers:
- name: api
image: your-registry/semantic-cache:latest
ports:
- containerPort: 8000
env:
- name: REDIS_URL
value: "redis://redis:6379"
- name: EMBEDDING_MODEL
value: "embeddinggemma"
# For HuggingFace:
# - name: HF_TOKEN
# valueFrom:
# secretKeyRef:
# name: huggingface-token
# key: HF_TOKEN
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 10
periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
name: semantic-cache
spec:
type: LoadBalancer
ports:
- port: 8000
selector:
app: semantic-cachesecret.yaml (for HuggingFace):
apiVersion: v1
kind: Secret
metadata:
name: huggingface-token
type: Opaque
stringData:
HF_TOKEN: hf_your_token_here/etc/systemd/system/semantic-cache.service:
[Unit]
Description=Semantic Cache Service
After=network.target redis.service
[Service]
Type=simple
User=semantic-cache
WorkingDirectory=/opt/semantic-cache
Environment="REDIS_URL=redis://localhost:6379"
Environment="EMBEDDING_MODEL=embeddinggemma"
# For HuggingFace:
# Environment="HF_TOKEN=hf_your_token_here"
# Environment="EMBEDDING_OUTPUT_DIMENSION=768"
ExecStart=/opt/semantic-cache/.venv/bin/uvicorn semantic_cache.api.app:app --host 0.0.0.0 --port 8000
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.targetStart service:
sudo systemctl enable semantic-cache
sudo systemctl start semantic-cache
sudo systemctl status semantic-cacheDO ✅:
- Store secrets in
.env(add to.gitignore) - Use read-only HuggingFace tokens
- Use secret managers in production (AWS Secrets Manager, Vault)
- Rotate tokens periodically (every 6-12 months)
- Use separate tokens per environment (dev/staging/prod)
DON'T ❌:
- Commit tokens to git
- Share tokens in chat/email
- Use personal tokens for production
- Hardcode tokens in code
- Give write access unless needed
Check if token is exposed:
git log -p | grep -i "hf_"If exposed:
- Go to https://huggingface.co/settings/tokens
- Delete the token immediately
- Create new token
- Update
.env - Restart services
Enable authentication:
# docker-compose.yml
redis:
image: redis/redis-stack:latest
command: redis-server --requirepass your_secure_password
environment:
- REDIS_PASSWORD=your_secure_passwordUpdate connection:
# .env
REDIS_URL=redis://:your_secure_password@localhost:6379Production checklist:
- Use HTTPS/TLS for API
- Firewall Redis (only internal access)
- Rate limiting on API endpoints
- API authentication (JWT, API keys)
- Monitor for unusual traffic patterns
Threshold Tuning:
# Lower = stricter matching (fewer hits, higher precision)
CACHE_DISTANCE_THRESHOLD=0.10
# Medium = balanced (recommended)
CACHE_DISTANCE_THRESHOLD=0.15
# Higher = looser matching (more hits, risk false positives)
CACHE_DISTANCE_THRESHOLD=0.25Monitor hit rate:
curl http://localhost:8000/cache/statsTarget: 60-70% hit rate for optimal cost savings.
# Short TTL for fast-changing data
CACHE_TTL=3600 # 1 hour
# Medium TTL for typical use
CACHE_TTL=604800 # 7 days (default)
# Long TTL for stable data
CACHE_TTL=2592000 # 30 daysredis.conf settings:
# Memory limit
maxmemory 2gb
maxmemory-policy allkeys-lru
# Persistence (for durability)
save 900 1
save 300 10
save 60 10000
# Snapshotting
rdbcompression yes
rdbchecksum yes
Storage vs Quality:
- 768 dims: Best quality, 2x storage
- 512 dims: 95% quality, 1.3x storage
- 256 dims: 90% quality, 0.67x storage
- 128 dims: 85% quality, 0.33x storage
Recommendation:
- Small scale (<10K entries): Use 768
- Medium scale (10K-100K): Use 512
- Large scale (>100K): Use 256
# Process multiple queries in one embedding call
embeddings = provider.encode_batch([
"Query 1",
"Query 2",
"Query 3"
])Benefits: ~50% faster than individual calls.
Error:
redis.exceptions.ResponseError: Dimension mismatch: expected 384, got 768
Solution:
make cache-clear
make devError:
requests.exceptions.ConnectionError: Connection refused
Solution:
# Check if Ollama is running
curl http://localhost:11434/api/version
# If not, start it
ollama serveError:
HTTPError: 401 Client Error: Unauthorized
Solution:
# Authenticate
huggingface-cli login
# Accept model license
# Visit: https://huggingface.co/google/embeddinggemma-300m
# Click: "Agree and access repository"Error:
redis.exceptions.ConnectionError: Error connecting to Redis
Solution:
# Check Redis is running
docker compose ps
# Start Redis if needed
docker compose up -d redis
# Check Redis health
redis-cli ping
# Should return: PONGProblem: Embeddings take >200ms
Solutions:
- Check model loading: First request always slower (model loading)
- Use batch processing: Process multiple queries together
- Switch to lighter model: Try
all-minilm(384 dims, faster) - Check CPU load: Embeddings are CPU-intensive
- Consider GPU: For production scale
Problem: High memory usage
Solutions:
- Use smaller dimensions: 256 or 128 instead of 768
- Limit Redis memory: Set
maxmemoryin redis.conf - Enable eviction: Use
allkeys-lrupolicy - Monitor with:
redis-cli info memory
Error:
redis.exceptions.ResponseError: Index already exists
Solution:
# Force drop and recreate
redis-cli FT.DROPINDEX semantic_cache DD
make dev# API health
curl http://localhost:8000/health
# Redis health
redis-cli ping
# Ollama health (if using)
curl http://localhost:11434/api/versionCache Performance:
- Hit rate (target: 60-70%)
- Average lookup time (target: <5ms)
- Total entries
- Storage size
API Performance:
- Request latency (p50, p95, p99)
- Error rate
- Requests per second
Redis Metrics:
redis-cli info stats
redis-cli info memory
redis-cli FT.INFO semantic_cacheApplication logs:
# View logs
docker-compose logs -f api
# Or for systemd
journalctl -u semantic-cache -fRedis logs:
docker-compose logs -f redis# Manual backup
redis-cli BGSAVE
# Copy RDB file
cp /var/lib/redis/dump.rdb /backup/dump-$(date +%Y%m%d).rdb# Stop Redis
docker-compose stop redis
# Replace RDB file
cp /backup/dump-20260205.rdb /var/lib/redis/dump.rdb
# Start Redis
docker-compose start redisCron job:
# /etc/cron.daily/redis-backup
#!/bin/bash
redis-cli BGSAVE
sleep 10
cp /var/lib/redis/dump.rdb /backup/dump-$(date +%Y%m%d).rdb
find /backup -name "dump-*.rdb" -mtime +7 -deleteUse different index names per environment:
.env.dev:
CACHE_INDEX_NAME=semantic_cache_dev.env.staging:
CACHE_INDEX_NAME=semantic_cache_staging.env.prod:
CACHE_INDEX_NAME=semantic_cache_prodThis allows testing different models/configs without affecting production.
- ✅ Understand production considerations
- Deploy to your environment
- Set up monitoring
- Configure backups
- Tune performance based on metrics