Skip to content

Commit b047174

Browse files
committed
updates
1 parent 518b202 commit b047174

File tree

9 files changed

+1481
-13
lines changed

9 files changed

+1481
-13
lines changed

LITELLM_MONITORING_GUIDE.md

Lines changed: 355 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,355 @@
1+
# LiteLLM Monitoring Setup - Quick Reference
2+
3+
## Overview
4+
Complete monitoring infrastructure for LiteLLM multi-provider gateway using Prometheus, Grafana, and AlertManager.
5+
6+
## ✅ Completed Setup
7+
8+
### 1. Monitoring Stack Deployment
9+
```bash
10+
# Services Deployed on:
11+
- Prometheus: http://localhost:9090
12+
- Grafana: http://localhost:3001 (user: admin, pass: admin)
13+
- AlertManager: http://localhost:9093
14+
- Node Exporter: http://localhost:9100
15+
- Postgres Exporter: http://localhost:9187
16+
- Redis Exporter: http://localhost:9121
17+
- Docker Exporter: http://localhost:9417
18+
```
19+
20+
### 2. Grafana Dashboard
21+
**Location:** `/config/grafana/dashboards/litellm_dashboard.json`
22+
23+
**Panels:**
24+
- LiteLLM Status (UP/DOWN indicator)
25+
- Request Rate (total requests vs errors)
26+
- P95 Latency (5-minute window)
27+
- Requests by Model (stacked bar chart)
28+
- Requests by Provider (time series)
29+
- Total Tokens Used (cumulative counter)
30+
- Total Cost USD (cumulative cost)
31+
- Error Rate (percentage)
32+
33+
**Access:**
34+
```bash
35+
# Local access (port 3001 to avoid conflict with Lago)
36+
http://localhost:3001
37+
38+
# Via Traefik (production)
39+
https://grafana.kubeworkz.io
40+
```
41+
42+
**Default Credentials:**
43+
- Username: `admin`
44+
- Password: `admin` (change on first login)
45+
46+
### 3. Prometheus Configuration
47+
**Location:** `/config/prometheus/prometheus.yml`
48+
49+
**LiteLLM Scrape Jobs:**
50+
```yaml
51+
# LiteLLM proxy metrics (native Prometheus export)
52+
- job_name: 'litellm'
53+
scrape_interval: 10s
54+
static_configs:
55+
- targets: ['unicorn-litellm-wilmer:4000']
56+
metrics_path: /metrics
57+
58+
# LiteLLM usage metrics (custom backend endpoint)
59+
- job_name: 'litellm-usage'
60+
scrape_interval: 30s
61+
static_configs:
62+
- targets: ['ops-center-direct:8084']
63+
metrics_path: /api/v1/llm/metrics
64+
```
65+
66+
### 4. Alert Rules
67+
**Location:** `/config/prometheus/rules/litellm_alerts.yml`
68+
69+
**Alert Groups (6 groups, 15 alerts):**
70+
71+
#### A. Availability Alerts
72+
- **LiteLLMDown** (critical): Proxy down for 2+ minutes
73+
- **LiteLLMUnhealthy** (warning): Health check fails for 5+ minutes
74+
75+
#### B. Performance Alerts
76+
- **LiteLLMHighLatency** (warning): P95 latency > 10s for 10 minutes
77+
- **LiteLLMHighErrorRate** (warning): Error rate > 5% for 5 minutes
78+
- **LiteLLMCriticalErrorRate** (critical): Error rate > 20% for 2 minutes
79+
80+
#### C. Usage Alerts
81+
- **LiteLLMHighRequestRate** (warning): > 100 req/s for 10 minutes (abuse detection)
82+
- **LiteLLMTokenLimitApproaching** (info): > 1M tokens used in 1 hour
83+
- **LiteLLMModelFailures** (warning): Model error rate > 10% for 5 minutes
84+
85+
#### D. Cost Alerts
86+
- **LiteLLMHighDailyCost** (warning): Daily cost > $100
87+
- **LiteLLMCostSpike** (warning): Cost increases 2x vs previous hour for 15 minutes
88+
89+
#### E. Database Alerts
90+
- **LiteLLMDatabasePoolExhausted** (critical): All DB connections active for 5 minutes
91+
- **LiteLLMSlowDatabaseQueries** (warning): P95 query time > 1s for 10 minutes
92+
93+
#### F. Provider Alerts
94+
- **LiteLLMGroqProviderDown** (critical): Groq failure rate > 90% for 5 minutes
95+
- **LiteLLMProviderRateLimited** (warning): > 10 rate limit errors in 5 minutes
96+
97+
### 5. Monitoring Script
98+
**Location:** `/scripts/monitor_litellm.sh`
99+
100+
**Features:**
101+
- Container status check
102+
- Health endpoint validation
103+
- Model availability verification
104+
- Database connectivity test
105+
- Performance metrics (CPU, memory, network)
106+
- Live inference test
107+
- Active alerts summary
108+
109+
**Usage:**
110+
```bash
111+
# Run once
112+
./scripts/monitor_litellm.sh
113+
114+
# Run continuously (every 30s)
115+
watch -n 30 ./scripts/monitor_litellm.sh
116+
```
117+
118+
## 📊 Key Metrics
119+
120+
### LiteLLM Native Metrics (expected but NOT YET AVAILABLE)
121+
```
122+
# Request metrics
123+
litellm_requests_total # Total requests counter
124+
litellm_request_errors_total # Total errors counter
125+
litellm_request_duration_seconds # Latency histogram
126+
127+
# Model metrics
128+
litellm_model_requests_total{model="..."} # Requests per model
129+
litellm_model_errors_total{model="..."} # Errors per model
130+
131+
# Token metrics
132+
litellm_tokens_used_total # Total tokens consumed
133+
litellm_prompt_tokens_total # Prompt tokens
134+
litellm_completion_tokens_total # Completion tokens
135+
136+
# Cost metrics
137+
litellm_total_cost_usd # Total cost in USD
138+
139+
# Provider metrics
140+
litellm_provider_requests_total{provider="groq"} # Requests per provider
141+
litellm_provider_errors_total{provider="groq"} # Errors per provider
142+
```
143+
144+
### Backend Custom Metrics (TO BE IMPLEMENTED)
145+
```
146+
# Usage analytics from database
147+
llm_credits_used_total{tenant_id="..."} # Credits consumed per tenant
148+
llm_requests_by_model{model="..."} # Historical request counts
149+
llm_cost_by_tenant{tenant_id="..."} # Cost tracking per tenant
150+
llm_response_time_seconds{model="..."} # Response time histograms
151+
```
152+
153+
## 🚨 Monitoring Stack Management
154+
155+
### Start Monitoring
156+
```bash
157+
docker-compose -f docker-compose.monitoring.yml up -d
158+
```
159+
160+
### Stop Monitoring
161+
```bash
162+
docker-compose -f docker-compose.monitoring.yml down
163+
```
164+
165+
### View Logs
166+
```bash
167+
# Prometheus
168+
docker logs ops-center-prometheus -f
169+
170+
# Grafana
171+
docker logs ops-center-grafana -f
172+
173+
# AlertManager
174+
docker logs ops-center-alertmanager -f
175+
```
176+
177+
### Restart Services
178+
```bash
179+
# Restart all monitoring
180+
docker-compose -f docker-compose.monitoring.yml restart
181+
182+
# Restart specific service
183+
docker restart ops-center-prometheus
184+
docker restart ops-center-grafana
185+
```
186+
187+
## 🔧 Configuration Updates
188+
189+
### Reload Prometheus Config (no restart needed)
190+
```bash
191+
curl -X POST http://localhost:9090/-/reload
192+
```
193+
194+
### Add New Alert Rules
195+
1. Edit `/config/prometheus/rules/litellm_alerts.yml`
196+
2. Reload Prometheus:
197+
```bash
198+
curl -X POST http://localhost:9090/-/reload
199+
```
200+
3. Verify rules loaded:
201+
```bash
202+
curl -s http://localhost:9090/api/v1/rules | jq '.data.groups[] | .name'
203+
```
204+
205+
### Import New Grafana Dashboard
206+
1. Place JSON file in `/config/grafana/dashboards/`
207+
2. Restart Grafana:
208+
```bash
209+
docker restart ops-center-grafana
210+
```
211+
212+
## 🎯 Current Status
213+
214+
**Operational:**
215+
- Prometheus scraping 7 exporters
216+
- Grafana dashboard created
217+
- 15 alert rules configured and loaded
218+
- Monitoring script functional
219+
- All monitoring containers running
220+
221+
⚠️ **Pending:**
222+
- **LiteLLM native metrics not exposed** - `/metrics` endpoint returns 404
223+
* Need to enable Prometheus exporter in LiteLLM config
224+
* May require setting `LITELLM_ENABLE_PROMETHEUS=true` environment variable
225+
* Alternative: Use LiteLLM database logging and export metrics from backend
226+
227+
- **Backend metrics endpoint not implemented** - `/api/v1/llm/metrics` doesn't exist
228+
* Need to create FastAPI route in backend
229+
* Query litellm_db for usage statistics
230+
* Export as Prometheus format (counter/gauge/histogram)
231+
232+
- **AlertManager configuration needed** - Currently restarting
233+
* Need to configure `/config/alertmanager/alertmanager.yml`
234+
* Set up notification channels (email, Slack, webhook)
235+
* Define routing rules for alert severity
236+
237+
## 🔍 Troubleshooting
238+
239+
### Check Scrape Targets
240+
```bash
241+
# View all targets
242+
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {job: .labels.job, health: .health, error: .lastError}'
243+
244+
# View only LiteLLM targets
245+
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job | contains("litellm"))'
246+
```
247+
248+
### Test Metrics Endpoint
249+
```bash
250+
# Test LiteLLM native metrics (currently 404)
251+
curl http://localhost:4000/metrics
252+
253+
# Test backend custom metrics (not implemented)
254+
curl http://localhost:8084/api/v1/llm/metrics
255+
```
256+
257+
### Query Prometheus
258+
```bash
259+
# Check if metrics exist
260+
curl -s 'http://localhost:9090/api/v1/label/__name__/values' | jq '.data[] | select(contains("litellm"))'
261+
262+
# Query specific metric
263+
curl -s 'http://localhost:9090/api/v1/query?query=up{job="litellm"}' | jq .
264+
```
265+
266+
### View Active Alerts
267+
```bash
268+
# All active alerts
269+
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | {alert: .labels.alertname, state: .state, value: .value}'
270+
271+
# Only firing alerts
272+
curl -s http://localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
273+
```
274+
275+
## 📝 Next Steps
276+
277+
### Priority 1: Enable LiteLLM Metrics
278+
1. Check LiteLLM documentation for Prometheus export
279+
2. Update environment variables or config to enable metrics
280+
3. Verify `/metrics` endpoint responds with Prometheus format
281+
4. Confirm Prometheus can scrape the endpoint
282+
283+
### Priority 2: Implement Backend Metrics Endpoint
284+
1. Create `/api/v1/llm/metrics` route in backend
285+
2. Query `litellm_db` for:
286+
- Total requests (from `llm_transactions`)
287+
- Credits used (from `llm_credits`)
288+
- Response times (from `llm_transactions`)
289+
- Error counts (from `llm_transactions` WHERE status='error')
290+
3. Format as Prometheus metrics:
291+
```python
292+
from prometheus_client import generate_latest, Counter, Histogram, Gauge
293+
294+
# Example metrics
295+
llm_requests = Counter('llm_requests_total', 'Total LLM requests', ['model', 'tenant'])
296+
llm_credits = Counter('llm_credits_used_total', 'Total credits used', ['tenant'])
297+
llm_latency = Histogram('llm_response_time_seconds', 'LLM response time', ['model'])
298+
```
299+
300+
### Priority 3: Configure AlertManager
301+
1. Create `/config/alertmanager/alertmanager.yml`
302+
2. Configure notification receivers:
303+
- Email for critical alerts
304+
- Slack for warnings
305+
- Webhook for custom integrations
306+
3. Set up routing based on severity and alert group
307+
4. Test alert delivery
308+
309+
### Priority 4: Additional Dashboards
310+
1. Create provider-specific dashboards (Groq, HuggingFace, local)
311+
2. Add cost analysis dashboard
312+
3. Create tenant usage dashboard
313+
4. Build capacity planning dashboard
314+
315+
## 📚 Resources
316+
317+
- Prometheus Docs: https://prometheus.io/docs/
318+
- Grafana Docs: https://grafana.com/docs/
319+
- LiteLLM Docs: https://docs.litellm.ai/
320+
- Prometheus Query Examples: https://prometheus.io/docs/prometheus/latest/querying/examples/
321+
322+
## 🔐 Security Notes
323+
324+
1. **Change Grafana default password immediately**
325+
```bash
326+
# Access http://localhost:3001
327+
# Login with admin/admin
328+
# You'll be prompted to change password
329+
```
330+
331+
2. **Restrict Prometheus access** - Currently open on port 9090
332+
- Use Traefik auth middleware in production
333+
- Or restrict to internal network only
334+
335+
3. **Protect metrics endpoints** - Ensure LiteLLM metrics require authentication
336+
- Set `LITELLM_MASTER_KEY` requirement for metrics endpoint
337+
- Use network policies to restrict scraper access
338+
339+
4. **AlertManager webhook security** - Use signed payloads for webhooks
340+
- Configure HMAC signatures
341+
- Validate webhook sources
342+
343+
## 📞 Support
344+
345+
For monitoring issues:
346+
1. Check monitoring script output: `./scripts/monitor_litellm.sh`
347+
2. Review Prometheus targets: http://localhost:9090/targets
348+
3. Check container logs: `docker logs ops-center-prometheus`
349+
4. Verify alert rules: http://localhost:9090/rules
350+
351+
---
352+
353+
**Last Updated:** 2026-02-13
354+
**Version:** 1.0
355+
**Status:** Monitoring stack deployed, metrics endpoints pending implementation

0 commit comments

Comments
 (0)