-
Notifications
You must be signed in to change notification settings - Fork 1
Performance Monitoring
Track backend TTFB, streaming duration, and request stage breakdown
Comprehensive performance tracking system providing detailed insights into request processing:
- Frontend Processing (< 3ms) - Request parsing, auth, preparation
- Backend API Response (~1.7s) - Time to first byte (TTFB)
- Stream Processing (~1.3s) - Streaming response to client
Stack: Prometheus + Grafana + PerformanceTracker
Already included in requirements.txt:
pip install prometheus-clientMetrics automatically exposed at /metrics endpoint.
from src.utils.performance_tracker import PerformanceTracker-
backend_ttfb_seconds- Histogram of backend API TTFB- Labels:
provider,model,endpoint - Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s
- Labels:
-
streaming_duration_seconds- Histogram of streaming response duration- Labels:
provider,model,endpoint - Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s
- Labels:
-
frontend_processing_seconds- Histogram of frontend processing time- Labels:
endpoint - Buckets: 0.001s, 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s
- Labels:
-
request_stage_duration_seconds- Histogram per stage- Labels:
stage,endpoint - Stages:
request_parsing,auth_validation,request_preparation,backend_fetch,stream_processing
- Labels:
-
stage_percentage- Gauge of percentage per stage- Labels:
stage,endpoint - Stages:
frontend_processing,backend_response,stream_processing
- Labels:
from src.utils.performance_tracker import PerformanceTracker
@router.post("/v1/chat/completions")
async def chat_completions(req: Request, api_key: str = Depends(get_api_key)):
tracker = PerformanceTracker(endpoint="/v1/chat/completions")
# Track request parsing
with tracker.stage("request_parsing"):
messages = req.messages
model = req.model
# Track auth validation
with tracker.stage("auth_validation"):
user = await validate_api_key(api_key)
# Track request preparation
with tracker.stage("request_preparation"):
headers = prepare_headers(api_key)
# Track backend request (TTFB)
with tracker.backend_request(provider="openrouter", model=model):
response = await make_backend_request(messages, model, headers)
# Track streaming (if applicable)
if req.stream:
with tracker.streaming():
async for chunk in stream_response(response):
yield chunk
else:
return response
# Record percentages
tracker.record_percentages()from src.utils.performance_tracker import track_request_stages
@router.post("/v1/responses")
async def unified_responses(req: Request):
with track_request_stages("/v1/responses") as tracker:
with tracker.stage("request_parsing"):
# parse request
pass
with tracker.backend_request(provider="portkey", model=req.model):
response = await make_request(...)
# Percentages automatically recorded on exithistogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le, provider))
sum(rate(request_stage_duration_seconds_sum[5m])) by (stage) /
sum(rate(request_stage_duration_seconds_count[5m])) by (stage)
avg(stage_percentage{stage="backend_response"})
- Open Grafana → Dashboards → Import
- Upload
dashboards/performance-monitoring.json - Select Prometheus datasource
- Click Import
- Backend API TTFB by Provider/Model (p50, p95, p99)
- Streaming Duration by Provider/Model (p50, p95, p99)
- Frontend Processing Time (p95, should be < 3ms)
- Request Stage Breakdown (Stacked)
- Time Distribution by Stage (%)
- Top 10 Slowest Backend TTFB
- Real-time Gauge Panels
Add to prometheus-alerts.yml:
groups:
- name: performance_alerts
rules:
# Backend TTFB Alerts
- alert: HighBackendTTFB
expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 2.0
for: 5m
annotations:
summary: "High backend TTFB detected"
description: "p95 TTFB > 2.0s for 5 minutes"
- alert: CriticalBackendTTFB
expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 3.0
for: 2m
annotations:
summary: "Critical backend TTFB"
description: "p95 TTFB > 3.0s for 2 minutes"
# Streaming Duration Alerts
- alert: HighStreamingDuration
expr: histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le)) > 1.5
for: 5m
annotations:
summary: "High streaming duration"
# Frontend Processing Alerts
- alert: HighFrontendProcessing
expr: histogram_quantile(0.95, sum(rate(frontend_processing_seconds_bucket[5m])) by (le)) > 0.01
for: 5m
annotations:
summary: "Frontend processing time excessive"- Monitor cold starts - Track model initialization time
- Scale infrastructure - Add backend capacity as needed
- Implement caching - Cache similar requests
- Route smartly - Use faster providers when possible
- Check latency - Monitor network latency to backend
- Use CDN/Edge - Deploy closer to users
- Optimize chunks - Tune chunk sizes for streaming
- Connection pooling - Maintain persistent connections
- Already minimal, no optimization needed
- Set Baseline Metrics - Establish normal performance during low load
- Monitor Trends - Watch for gradual degradation
- Alert on Anomalies - Set up alerts for spikes
- Regular Review - Weekly performance reviews
- Provider Comparison - Compare performance across providers
- Model Comparison - Track performance by model
Symptom: No metrics in Grafana
Solution:
- Check instrumentation - Ensure routes use
PerformanceTracker - Check Prometheus scraping - Verify target is "UP"
- Verify labels - Ensure provider/model labels are correct
- Check logs - Look for errors
Symptom: Backend TTFB > 3 seconds
Solution:
- Check backend health - Verify API is responsive
- Test network - Check latency to backend
- Compare providers - Some may be slower
- Check model - Some have longer cold starts
Symptom: Streaming takes > 2.5 seconds
Solution:
- Test network bandwidth
- Reduce chunk size if too large
- Check model generation speed
- Verify client processing isn't bottleneck
Performance monitoring complements:
- Sentry - Error tracking and debugging
- Prometheus - General metrics collection
- OpenTelemetry - Distributed tracing
All systems can coexist.
Add custom performance tracking:
from src.services.prometheus_metrics import track_duration
with track_duration("custom_operation"):
# your operation
pass- Prometheus Setup - Set up Prometheus
- Error Monitoring - Sentry integration
- Caching System - Performance optimization
Last Updated: December 2024 Status: Production Ready
- Monitoring-System — Full monitoring architecture
- Prometheus-Setup — Metric collection infrastructure
- Error-Monitoring — Error-specific monitoring
Reading Path (start here, in order)
- Conceptual Model
- Stability Definition
- Conceptual Model Features
- Features
- Delta Report
- Features-Acceptance-Criteria
Testing
Security & Access
Billing
Monitoring
Features
Providers
Operations
Data References