Skip to content

Performance Monitoring

arminrad edited this page Mar 16, 2026 · 2 revisions

Performance Monitoring

Track backend TTFB, streaming duration, and request stage breakdown


Overview

Comprehensive performance tracking system providing detailed insights into request processing:

  • Frontend Processing (< 3ms) - Request parsing, auth, preparation
  • Backend API Response (~1.7s) - Time to first byte (TTFB)
  • Stream Processing (~1.3s) - Streaming response to client

Stack: Prometheus + Grafana + PerformanceTracker


Quick Setup

1. Install Dependencies

Already included in requirements.txt:

pip install prometheus-client

2. Enable Metrics Endpoint

Metrics automatically exposed at /metrics endpoint.

3. Import PerformanceTracker

from src.utils.performance_tracker import PerformanceTracker

Metrics Collected

Backend TTFB Metrics

  • backend_ttfb_seconds - Histogram of backend API TTFB
    • Labels: provider, model, endpoint
    • Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s

Streaming Duration Metrics

  • streaming_duration_seconds - Histogram of streaming response duration
    • Labels: provider, model, endpoint
    • Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s

Frontend Processing Metrics

  • frontend_processing_seconds - Histogram of frontend processing time
    • Labels: endpoint
    • Buckets: 0.001s, 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s

Stage Breakdown Metrics

  • request_stage_duration_seconds - Histogram per stage

    • Labels: stage, endpoint
    • Stages: request_parsing, auth_validation, request_preparation, backend_fetch, stream_processing
  • stage_percentage - Gauge of percentage per stage

    • Labels: stage, endpoint
    • Stages: frontend_processing, backend_response, stream_processing

Usage

Basic Instrumentation

from src.utils.performance_tracker import PerformanceTracker

@router.post("/v1/chat/completions")
async def chat_completions(req: Request, api_key: str = Depends(get_api_key)):
    tracker = PerformanceTracker(endpoint="/v1/chat/completions")

    # Track request parsing
    with tracker.stage("request_parsing"):
        messages = req.messages
        model = req.model

    # Track auth validation
    with tracker.stage("auth_validation"):
        user = await validate_api_key(api_key)

    # Track request preparation
    with tracker.stage("request_preparation"):
        headers = prepare_headers(api_key)

    # Track backend request (TTFB)
    with tracker.backend_request(provider="openrouter", model=model):
        response = await make_backend_request(messages, model, headers)

    # Track streaming (if applicable)
    if req.stream:
        with tracker.streaming():
            async for chunk in stream_response(response):
                yield chunk
    else:
        return response

    # Record percentages
    tracker.record_percentages()

Simplified Context Manager

from src.utils.performance_tracker import track_request_stages

@router.post("/v1/responses")
async def unified_responses(req: Request):
    with track_request_stages("/v1/responses") as tracker:
        with tracker.stage("request_parsing"):
            # parse request
            pass

        with tracker.backend_request(provider="portkey", model=req.model):
            response = await make_request(...)

        # Percentages automatically recorded on exit

Prometheus Queries

Average Backend TTFB (p95)

histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le))

Average Streaming Duration (p95)

histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le))

Backend TTFB by Provider

histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le, provider))

Stage Breakdown

sum(rate(request_stage_duration_seconds_sum[5m])) by (stage) /
sum(rate(request_stage_duration_seconds_count[5m])) by (stage)

Percentage of Time in Backend

avg(stage_percentage{stage="backend_response"})

Grafana Dashboard

Importing Dashboard

  1. Open Grafana → DashboardsImport
  2. Upload dashboards/performance-monitoring.json
  3. Select Prometheus datasource
  4. Click Import

Dashboard Panels

  1. Backend API TTFB by Provider/Model (p50, p95, p99)
  2. Streaming Duration by Provider/Model (p50, p95, p99)
  3. Frontend Processing Time (p95, should be < 3ms)
  4. Request Stage Breakdown (Stacked)
  5. Time Distribution by Stage (%)
  6. Top 10 Slowest Backend TTFB
  7. Real-time Gauge Panels

Alerting

Prometheus Alert Rules

Add to prometheus-alerts.yml:

groups:
  - name: performance_alerts
    rules:
      # Backend TTFB Alerts
      - alert: HighBackendTTFB
        expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 2.0
        for: 5m
        annotations:
          summary: "High backend TTFB detected"
          description: "p95 TTFB > 2.0s for 5 minutes"

      - alert: CriticalBackendTTFB
        expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 3.0
        for: 2m
        annotations:
          summary: "Critical backend TTFB"
          description: "p95 TTFB > 3.0s for 2 minutes"

      # Streaming Duration Alerts
      - alert: HighStreamingDuration
        expr: histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le)) > 1.5
        for: 5m
        annotations:
          summary: "High streaming duration"

      # Frontend Processing Alerts
      - alert: HighFrontendProcessing
        expr: histogram_quantile(0.95, sum(rate(frontend_processing_seconds_bucket[5m])) by (le)) > 0.01
        for: 5m
        annotations:
          summary: "Frontend processing time excessive"

Performance Optimization

Backend Optimization (56% of time)

  1. Monitor cold starts - Track model initialization time
  2. Scale infrastructure - Add backend capacity as needed
  3. Implement caching - Cache similar requests
  4. Route smartly - Use faster providers when possible

Network Optimization (43% of time)

  1. Check latency - Monitor network latency to backend
  2. Use CDN/Edge - Deploy closer to users
  3. Optimize chunks - Tune chunk sizes for streaming
  4. Connection pooling - Maintain persistent connections

Frontend Optimization (< 0.1%)

  • Already minimal, no optimization needed

Monitoring Best Practices

  1. Set Baseline Metrics - Establish normal performance during low load
  2. Monitor Trends - Watch for gradual degradation
  3. Alert on Anomalies - Set up alerts for spikes
  4. Regular Review - Weekly performance reviews
  5. Provider Comparison - Compare performance across providers
  6. Model Comparison - Track performance by model

Troubleshooting

Metrics Not Appearing

Symptom: No metrics in Grafana

Solution:

  1. Check instrumentation - Ensure routes use PerformanceTracker
  2. Check Prometheus scraping - Verify target is "UP"
  3. Verify labels - Ensure provider/model labels are correct
  4. Check logs - Look for errors

High TTFB

Symptom: Backend TTFB > 3 seconds

Solution:

  1. Check backend health - Verify API is responsive
  2. Test network - Check latency to backend
  3. Compare providers - Some may be slower
  4. Check model - Some have longer cold starts

High Streaming Duration

Symptom: Streaming takes > 2.5 seconds

Solution:

  1. Test network bandwidth
  2. Reduce chunk size if too large
  3. Check model generation speed
  4. Verify client processing isn't bottleneck

Integration

With Existing Monitoring

Performance monitoring complements:

  • Sentry - Error tracking and debugging
  • Prometheus - General metrics collection
  • OpenTelemetry - Distributed tracing

All systems can coexist.

Custom Metrics

Add custom performance tracking:

from src.services.prometheus_metrics import track_duration

with track_duration("custom_operation"):
    # your operation
    pass

Related Documentation


Last Updated: December 2024 Status: Production Ready


Related

Clone this wiki locally