Local Docker Compose stack wiring LiteLLM to a lemonade-server OpenAI-compatible backend, with Redis Semantic cache and Milvus standalone for RAG.
docker-compose.yml— eleven services on thelitellm-netbridge:litellm,redis,etcd,minio,milvus,db,prometheus,alertmanager,grafana,redis-exporter,postgres-exporter.config/config.yaml— proxy config: embedding model + Redis Semantic cache + Milvus vector store.monitoring/— Prometheus scrape config + rules, Grafana provisioning + dashboards, Alertmanager config, secret files..env— secrets (gitignored); copy from.env.example.data/— runtime volume mount root.
If scripts/up.sh is unavailable (e.g. partial clone), the raw docker compose commands below still work. See ## Scripts below for the happy path.
cp .env.example .env
# edit .env: set LITELLM_MASTER_KEY (e.g. `openssl rand -hex 32`)
docker compose up -d
docker compose psWait until litellm-proxy shows healthy (~30–60s the first time, while Milvus initialises). The monitoring stack (prometheus, alertmanager, grafana, exporters) comes up alongside the core stack.
# 1. Proxy health (no auth)
curl -s http://127.0.0.1:4000/health/readiness | jq .
# 2. Embed with master key
KEY=$(grep ^LITELLM_MASTER_KEY .env | cut -d= -f2)
curl -s -X POST http://127.0.0.1:4000/v1/embeddings \
-H "Authorization: Bearer $KEY" \
-H "Content-Type: application/json" \
-d '{"model":"harrier-oss-v1-0.6b","input":"hello world"}' | jq .
# 3. Embed again — verify Redis Semantic cache hit
# NOTE: As of 2026-06-30, the Redis Semantic cache's read-side semantic lookup
# is NOT triggered for /v1/embeddings (LiteLLM upstream limitation; the proxy
# route's hash-key lookup short-circuits before the semantic branch runs).
# Write side works — vectors are stored under litellm_semantic_cache: in Redis.
# /v1/chat/completions is unaffected. Verify write side only:
docker compose exec redis redis-cli KEYS 'litellm_semantic_cache:*' | head
# 4. UI login
open http://127.0.0.1:4000/ui # master_key as password
# 5. Milvus reachable from litellm container
# Note: Milvus v2.4 serves gRPC on 19530; its HTTP frontend returns 404 for
# /health-style paths, so a curl probe will see 404 even when Milvus is fully
# reachable. Verify reachability via TCP and the proxy's ability to talk to it:
docker compose exec litellm python3 -c "import socket; s=socket.create_connection(('milvus',19530),timeout=5); s.close(); print('milvus:19530 reachable')"
# Or check `docker compose logs milvus --tail 50` for [GIN] entries from the litellm container IP../scripts/up.sh # first-time bootstrap (refuses if .env exists)
./scripts/up.sh --reset # full wipe + re-bootstrap
./scripts/up.sh --dry-run # print plan, don't execute
./scripts/smoke.sh # re-run the smoke suite anytimeup.sh is the happy path for bringing the stack up. It enforces strict clean-slate (refuses to overwrite .env unless --reset is passed), generates secrets, brings the compose stack up, waits for services to be healthy, and runs the smoke suite at the end. smoke.sh is callable independently to re-verify an already-up stack.
See docs/superpowers/specs/2026-07-01-local-deploy-and-test-design.md for the full design.
- Grafana:
http://127.0.0.1:3030(loopback only). Login:admin+ the password frommonitoring/secrets/grafana_admin_password(mode 600, owned by the host uid so the container can read it). 3 dashboards are auto-provisioned under thelitellm-stackfolder. - Prometheus: internal only (
:9090on thelitellm-netbridge). All 7 scrape targets come up cleanly post-startup (MINIO_PROMETHEUS_AUTH_TYPE=publicmakes the minio cluster metrics endpoint publicly readable inside the bridge network — no auth required). - Alertmanager: internal only (
:9093on thelitellm-netbridge). Fires the 3 starter alerts to stdout.
- Litellm
/metricsis scraped usingLITELLM_MASTER_KEYas the Bearer token. The Prometheuslitellmjob readsmonitoring/secrets/litellm_master_key(mode 644, gitignored), which is the master key mirrored out of.env. There is no separate metrics token — keepLITELLM_MASTER_KEYin.envin sync with the contents ofmonitoring/secrets/litellm_master_key. - Chat completions depend on the lemonade backend.
harrier-oss-v1-0.6bis currently embedding-only on the lemonade server at${LEMONADE_HOST_IP:-192.168.31.246}:13305—t_chat_round_tripandt_semantic_cache_hitin the smoke suite will hard-fail until lemonade exposes/v1/chat/completionsfor this model. Embedding tests (t_embed,t_redis_cache_write) work today. - UI login is browser-only.
ghcr.io/berriai/litellm:latestserves a Swagger UI at/uithat requires DB-backed auth (Prisma init). ProgrammaticPOST /ui/loginreturns 405;POST /loginreturns 400 with "Not connected to DB" until Prisma is initialized against the configured Postgres. The smoke suite assertsGET /uireturns 200/302/307 (UI reachable); bearer-token auth is validated separately byt_embed.
docker compose down # stop + remove containers, keep volumes
docker compose down -v # also delete named volumes (wipes Redis/Milvus state)When lemonade exposes /v1/chat/completions, append to model_list in config/config.yaml:
model_list:
- model_name: harrier-oss-v1-0.6b
litellm_params:
model: openai/harrier-oss-v1-0.6b
api_base: http://host.docker.internal:13305/v1
api_key: dummy-not-usedThen docker compose restart litellm.