Performance Monitoring

Track backend TTFB, streaming duration, and request stage breakdown

Overview

Comprehensive performance tracking system providing detailed insights into request processing:

Frontend Processing (< 3ms) - Request parsing, auth, preparation
Backend API Response (~1.7s) - Time to first byte (TTFB)
Stream Processing (~1.3s) - Streaming response to client

Stack: Prometheus + Grafana + PerformanceTracker

Quick Setup

1. Install Dependencies

Already included in requirements.txt:

pip install prometheus-client

2. Enable Metrics Endpoint

Metrics automatically exposed at /metrics endpoint.

3. Import PerformanceTracker

from src.utils.performance_tracker import PerformanceTracker

Metrics Collected

Backend TTFB Metrics

backend_ttfb_seconds - Histogram of backend API TTFB
- Labels: provider, model, endpoint
- Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s

Streaming Duration Metrics

streaming_duration_seconds - Histogram of streaming response duration
- Labels: provider, model, endpoint
- Buckets: 0.1s, 0.5s, 1.0s, 1.5s, 2.0s, 2.5s, 3.0s, 5.0s, 10.0s

Frontend Processing Metrics

frontend_processing_seconds - Histogram of frontend processing time
- Labels: endpoint
- Buckets: 0.001s, 0.005s, 0.01s, 0.025s, 0.05s, 0.1s, 0.25s, 0.5s

Stage Breakdown Metrics

request_stage_duration_seconds - Histogram per stage
- Labels: stage, endpoint
- Stages: request_parsing, auth_validation, request_preparation, backend_fetch, stream_processing
stage_percentage - Gauge of percentage per stage
- Labels: stage, endpoint
- Stages: frontend_processing, backend_response, stream_processing

Usage

Basic Instrumentation

from src.utils.performance_tracker import PerformanceTracker

@router.post("/v1/chat/completions")
async def chat_completions(req: Request, api_key: str = Depends(get_api_key)):
    tracker = PerformanceTracker(endpoint="/v1/chat/completions")

    # Track request parsing
    with tracker.stage("request_parsing"):
        messages = req.messages
        model = req.model

    # Track auth validation
    with tracker.stage("auth_validation"):
        user = await validate_api_key(api_key)

    # Track request preparation
    with tracker.stage("request_preparation"):
        headers = prepare_headers(api_key)

    # Track backend request (TTFB)
    with tracker.backend_request(provider="openrouter", model=model):
        response = await make_backend_request(messages, model, headers)

    # Track streaming (if applicable)
    if req.stream:
        with tracker.streaming():
            async for chunk in stream_response(response):
                yield chunk
    else:
        return response

    # Record percentages
    tracker.record_percentages()

Simplified Context Manager

from src.utils.performance_tracker import track_request_stages

@router.post("/v1/responses")
async def unified_responses(req: Request):
    with track_request_stages("/v1/responses") as tracker:
        with tracker.stage("request_parsing"):
            # parse request
            pass

        with tracker.backend_request(provider="portkey", model=req.model):
            response = await make_request(...)

        # Percentages automatically recorded on exit

Prometheus Queries

Average Backend TTFB (p95)

histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le))

Average Streaming Duration (p95)

histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le))

Backend TTFB by Provider

histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le, provider))

Stage Breakdown

sum(rate(request_stage_duration_seconds_sum[5m])) by (stage) /
sum(rate(request_stage_duration_seconds_count[5m])) by (stage)

Percentage of Time in Backend

avg(stage_percentage{stage="backend_response"})

Grafana Dashboard

Importing Dashboard

Open Grafana → Dashboards → Import
Upload dashboards/performance-monitoring.json
Select Prometheus datasource
Click Import

Dashboard Panels

Backend API TTFB by Provider/Model (p50, p95, p99)
Streaming Duration by Provider/Model (p50, p95, p99)
Frontend Processing Time (p95, should be < 3ms)
Request Stage Breakdown (Stacked)
Time Distribution by Stage (%)
Top 10 Slowest Backend TTFB
Real-time Gauge Panels

Alerting

Prometheus Alert Rules

Add to prometheus-alerts.yml:

groups:
  - name: performance_alerts
    rules:
      # Backend TTFB Alerts
      - alert: HighBackendTTFB
        expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 2.0
        for: 5m
        annotations:
          summary: "High backend TTFB detected"
          description: "p95 TTFB > 2.0s for 5 minutes"

      - alert: CriticalBackendTTFB
        expr: histogram_quantile(0.95, sum(rate(backend_ttfb_seconds_bucket[5m])) by (le)) > 3.0
        for: 2m
        annotations:
          summary: "Critical backend TTFB"
          description: "p95 TTFB > 3.0s for 2 minutes"

      # Streaming Duration Alerts
      - alert: HighStreamingDuration
        expr: histogram_quantile(0.95, sum(rate(streaming_duration_seconds_bucket[5m])) by (le)) > 1.5
        for: 5m
        annotations:
          summary: "High streaming duration"

      # Frontend Processing Alerts
      - alert: HighFrontendProcessing
        expr: histogram_quantile(0.95, sum(rate(frontend_processing_seconds_bucket[5m])) by (le)) > 0.01
        for: 5m
        annotations:
          summary: "Frontend processing time excessive"

Performance Optimization

Backend Optimization (56% of time)

Monitor cold starts - Track model initialization time
Scale infrastructure - Add backend capacity as needed
Implement caching - Cache similar requests
Route smartly - Use faster providers when possible

Network Optimization (43% of time)

Check latency - Monitor network latency to backend
Use CDN/Edge - Deploy closer to users
Optimize chunks - Tune chunk sizes for streaming
Connection pooling - Maintain persistent connections

Frontend Optimization (< 0.1%)

Already minimal, no optimization needed

Monitoring Best Practices

Set Baseline Metrics - Establish normal performance during low load
Monitor Trends - Watch for gradual degradation
Alert on Anomalies - Set up alerts for spikes
Regular Review - Weekly performance reviews
Provider Comparison - Compare performance across providers
Model Comparison - Track performance by model

Troubleshooting

Metrics Not Appearing

Symptom: No metrics in Grafana

Solution:

Check instrumentation - Ensure routes use PerformanceTracker
Check Prometheus scraping - Verify target is "UP"
Verify labels - Ensure provider/model labels are correct
Check logs - Look for errors

High TTFB

Symptom: Backend TTFB > 3 seconds

Solution:

Check backend health - Verify API is responsive
Test network - Check latency to backend
Compare providers - Some may be slower
Check model - Some have longer cold starts

High Streaming Duration

Symptom: Streaming takes > 2.5 seconds

Solution:

Test network bandwidth
Reduce chunk size if too large
Check model generation speed
Verify client processing isn't bottleneck

Integration

With Existing Monitoring

Performance monitoring complements:

Sentry - Error tracking and debugging
Prometheus - General metrics collection
OpenTelemetry - Distributed tracing

All systems can coexist.

Custom Metrics

Add custom performance tracking:

from src.services.prometheus_metrics import track_duration

with track_duration("custom_operation"):
    # your operation
    pass

Performance Monitoring

Performance Monitoring

Overview

Quick Setup

1. Install Dependencies

2. Enable Metrics Endpoint

3. Import PerformanceTracker

Metrics Collected

Backend TTFB Metrics

Streaming Duration Metrics

Frontend Processing Metrics

Stage Breakdown Metrics

Usage

Basic Instrumentation

Simplified Context Manager

Prometheus Queries

Average Backend TTFB (p95)

Average Streaming Duration (p95)

Backend TTFB by Provider

Stage Breakdown

Percentage of Time in Backend

Grafana Dashboard

Importing Dashboard

Dashboard Panels

Alerting

Prometheus Alert Rules

Performance Optimization

Backend Optimization (56% of time)

Network Optimization (43% of time)

Frontend Optimization (< 0.1%)

Monitoring Best Practices

Troubleshooting

Metrics Not Appearing

High TTFB

High Streaming Duration

Integration

With Existing Monitoring

Custom Metrics

Related Documentation

Related

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally