Skip to content
arminrad edited this page Mar 16, 2026 · 2 revisions

Monitoring System Documentation

Overview

The Gatewayz backend features a comprehensive, multi-layered monitoring system that tracks the health, performance, and usage of 10,000+ models across 16+ providers. The monitoring system includes real-time metrics, long-term analytics, anomaly detection, circuit breakers, and public status pages.

Table of Contents


Architecture

Multi-Layered Monitoring Stack

┌─────────────────────────────────────────────────────────┐
│                    Client Applications                   │
│           (Admin Panel, Status Page, Dashboards)         │
└──────────────────┬──────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────┐
│                    API Layer                             │
│  - Monitoring Routes (/api/monitoring/*)                │
│  - Health Routes (/health/*)                            │
│  - Model Health Routes (/v1/model-health/*)             │
│  - Status Page Routes (/v1/status/*)                    │
│  - Analytics Routes (/v1/analytics/*)                   │
└──────────────────┬──────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────┐
│                 Service Layer                            │
│  - Model Health Monitor (Background Service)            │
│  - Metrics Aggregator (Hourly)                          │
│  - Prometheus Metrics Service                           │
│  - Redis Metrics Service                                │
│  - Analytics Service (Anomaly Detection)                │
└──────────────────┬──────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────┐
│              Data Storage Layer                          │
│  ┌──────────────┐  ┌──────────────┐  ┌───────────────┐ │
│  │    Redis     │  │  PostgreSQL  │  │  Prometheus   │ │
│  │ (Real-time)  │  │ (Long-term)  │  │  (Metrics)    │ │
│  └──────────────┘  └──────────────┘  └───────────────┘ │
└─────────────────────────────────────────────────────────┘
                   │
┌──────────────────▼──────────────────────────────────────┐
│             External Services                            │
│  - Sentry (Error Monitoring)                            │
│  - Statsig (Feature Flags & Analytics)                  │
│  - PostHog (Product Analytics)                          │
│  - Braintrust (AI Observability)                        │
└─────────────────────────────────────────────────────────┘

Data Flow

  1. Request Processing: Every API request is tracked via middleware
  2. Real-time Metrics: Metrics are stored in Redis for fast access (1-24 hours)
  3. Hourly Aggregation: Redis metrics are aggregated into PostgreSQL every hour
  4. Health Monitoring: Background service checks model health based on tiered schedules
  5. Anomaly Detection: System analyzes metrics for cost/latency spikes and high error rates
  6. Status Pages: Public and private dashboards pull from materialized views

Monitoring Endpoints

1. Monitoring API (/api/monitoring/*)

Authentication: API Key Required

Endpoint Method Description Query Parameters
/api/monitoring/health GET All provider health scores (0-100) None
/api/monitoring/health/{provider} GET Specific provider health score None
/api/monitoring/errors/{provider} GET Recent errors for a provider limit (1-1000, default: 100)
/api/monitoring/stats/realtime GET Real-time statistics from Redis hours (1-24, default: 1)
/api/monitoring/stats/hourly/{provider} GET Hourly statistics for a provider hours (1-168, default: 24)
/api/monitoring/circuit-breakers GET All circuit breaker states None
/api/monitoring/circuit-breakers/{provider} GET Provider-specific circuit breaker states None
/api/monitoring/providers/comparison GET Compare all providers across key metrics None
/api/monitoring/latency/{provider}/{model} GET Latency percentiles (p50, p95, p99) None
/api/monitoring/anomalies GET Detected anomalies None
/api/monitoring/trial-analytics GET Trial funnel metrics None
/api/monitoring/cost-analysis GET Cost breakdown by provider days (1-90, default: 7)
/api/monitoring/latency-trends/{provider} GET Latency trends over time hours (1-168, default: 24)
/api/monitoring/error-rates GET Error rates by model hours (1-168, default: 24)
/api/monitoring/token-efficiency/{provider}/{model} GET Token efficiency metrics None

Example Request:

curl -X GET "https://api.gatewayz.io/api/monitoring/stats/realtime?hours=24" \
  -H "Authorization: Bearer YOUR_API_KEY"

Example Response (/api/monitoring/stats/realtime):

{
  "timeframe": "Last 24 hours",
  "providers": [
    {
      "provider": "openai",
      "total_requests": 15420,
      "successful_requests": 15234,
      "failed_requests": 186,
      "total_cost_credits": 1842.56,
      "total_tokens": 2456789,
      "avg_latency_ms": 342.5,
      "error_rate": 0.012
    }
  ]
}

2. Health API (/health/*)

Authentication: Mixed (see table)

Endpoint Method Description Auth Required
/health GET Simple health check (always returns 200) No
/health/system GET Overall system health metrics Yes
/health/providers GET Health metrics for all providers Yes
/health/models GET Health metrics for all models Yes
/health/model/{model_id} GET Health metrics for specific model Yes
/health/provider/{provider} GET Health metrics for specific provider Yes
/health/summary GET Comprehensive health summary Yes
/health/check POST Perform immediate health check (background) Yes
/health/check/now POST Perform immediate health check (sync) Yes
/health/uptime GET Uptime metrics for frontend integration Yes
/health/dashboard GET Complete health dashboard data Yes
/health/status GET Simple health status Yes
/health/monitoring/status GET Monitoring service status Yes
/health/monitoring/start POST Start health monitoring service Yes
/health/monitoring/stop POST Stop health monitoring service Yes
/health/google-vertex GET Google Vertex AI provider health No
/health/database GET Database connectivity No

Example Request:

curl -X GET "https://api.gatewayz.io/health/dashboard" \
  -H "Authorization: Bearer YOUR_API_KEY"

Example Response (/health/dashboard):

{
  "system": {
    "status": "healthy",
    "uptime_seconds": 864532,
    "total_providers": 16,
    "healthy_providers": 15,
    "total_models": 10234,
    "healthy_models": 9856
  },
  "providers": [
    {
      "provider": "openai",
      "status": "healthy",
      "models_count": 45,
      "healthy_models": 44,
      "avg_response_time_ms": 342.5,
      "uptime_24h": 99.8,
      "circuit_breaker_state": "closed"
    }
  ],
  "recent_incidents": []
}

3. Model Health API (/v1/model-health/*)

Authentication: Optional (Enhanced features with API key)

Endpoint Method Description Query Parameters
/v1/model-health GET All model health metrics provider, min_calls, status, limit, offset
/v1/model-health/{provider}/{model} GET Specific model health metrics None
/v1/model-health/unhealthy GET Models with high error rates threshold (default: 0.1)
/v1/model-health/stats GET Aggregate model health statistics None
/v1/model-health/provider/{provider}/summary GET Provider health summary None
/v1/model-health/providers GET List all providers with health data None

Example Request:

curl -X GET "https://api.gatewayz.io/v1/model-health/openai/gpt-4"

Example Response:

{
  "provider": "openai",
  "model": "gpt-4",
  "status": "healthy",
  "last_response_time_ms": 356,
  "average_response_time_ms": 342,
  "call_count": 45678,
  "success_count": 45234,
  "error_count": 444,
  "error_rate": 0.0097,
  "uptime_24h": 99.8,
  "uptime_7d": 99.6,
  "uptime_30d": 99.5,
  "circuit_breaker_state": "closed",
  "last_called_at": "2025-12-02T08:45:23Z",
  "last_success_at": "2025-12-02T08:45:23Z"
}

4. Public Status Page API (/v1/status/*)

Authentication: None (Public)

Endpoint Method Description Query Parameters
/v1/status/ GET Overall system status None
/v1/status/providers GET Status for all providers None
/v1/status/models GET Status for models provider, status, limit, offset
/v1/status/models/{provider}/{model_id} GET Status for specific model None
/v1/status/incidents GET Recent incidents provider, severity, limit
/v1/status/uptime/{provider}/{model_id} GET Uptime history (24h, 7d, 30d) None
/v1/status/search GET Search models q (query string)
/v1/status/stats GET Overall statistics None

Example Request:

curl -X GET "https://api.gatewayz.io/v1/status/providers"

Example Response:

{
  "providers": [
    {
      "provider": "openai",
      "status": "operational",
      "total_models": 45,
      "healthy_models": 44,
      "degraded_models": 1,
      "offline_models": 0,
      "uptime_24h": 99.8,
      "avg_response_time_ms": 342.5
    }
  ],
  "last_updated": "2025-12-02T08:45:23Z"
}

5. Analytics API (/v1/analytics/*)

Authentication: User Authentication Required

Endpoint Method Description Body
/v1/analytics/events POST Log analytics event event_name, properties
/v1/analytics/batch POST Log multiple analytics events Array of events

6. Prometheus Metrics (/metrics)

Authentication: None

Exposes Prometheus-compatible metrics for scraping.

Metrics Categories:

  • HTTP request metrics (count, duration, size)
  • Model inference metrics (requests, duration, tokens, credits)
  • Database metrics (queries, duration)
  • Cache metrics (hits, misses, size)
  • Provider health metrics (availability, error rate, response time)
  • Performance stage metrics (TTFB, streaming, frontend processing)
  • Business metrics (credit balance, trials, subscriptions, rate limits)

Example Prometheus Query:

# Average latency per provider
avg(model_inference_duration_seconds) by (provider)

# Error rate by model
rate(model_inference_requests_total{status="error"}[5m]) by (model)

Data Collection

Real-time Metrics (Redis)

TTL: 24 hours Service: src/services/redis_metrics.py

Data Structures:

  1. Request Metrics: Individual request tracking

    • Provider, model, latency, success/failure, cost, tokens
    • Key pattern: metrics:request:{timestamp}
  2. Hourly Aggregates: Rolling hourly statistics

    • Total requests, costs, tokens per provider/hour
    • Key pattern: metrics:hourly:{provider}:{hour}
  3. Latency Tracking: Sorted sets for percentile calculations

    • Last hour of latency measurements
    • Key pattern: metrics:latency:{provider}:{model}
  4. Error Tracking: Recent error messages

    • Last 100 errors per provider with timestamps
    • Key pattern: metrics:errors:{provider}
  5. Provider Health Scores: 0-100 scale

    • Adjusted by success/failure rates
    • Key pattern: metrics:health:{provider}
  6. Circuit Breaker States: CLOSED, OPEN, HALF_OPEN

    • Fault tolerance state per provider/model
    • Key pattern: circuit_breaker:{provider}:{model}

Metrics Collected:

  • Total requests (per provider/hour)
  • Successful/failed requests
  • Input/output tokens
  • Total cost (credits/USD)
  • Latency percentiles (p50, p95, p99)
  • Error messages and counts
  • Health scores (0-100)

Long-term Storage (PostgreSQL)

Tables:

1. model_health_tracking

Primary table for model health metrics.

Columns:

  • provider, model (composite primary key)
  • last_response_time_ms, average_response_time_ms
  • last_status, call_count, success_count, error_count
  • last_called_at, last_success_at, last_failure_at
  • last_error_message
  • gateway, monitoring_tier (critical/popular/standard/on_demand)
  • uptime_percentage_24h, uptime_percentage_7d, uptime_percentage_30d
  • consecutive_failures, consecutive_successes
  • circuit_breaker_state (closed/open/half_open)
  • priority_score, usage_count_24h, usage_count_7d, usage_count_30d
  • next_check_at, check_interval_seconds
  • is_enabled, metadata (JSONB)

Indexes:

  • idx_model_health_provider on (provider)
  • idx_model_health_status on (last_status)
  • idx_model_health_tier on (monitoring_tier)
  • idx_model_health_next_check on (next_check_at)

2. metrics_hourly_aggregates

Hourly aggregated metrics for historical analysis.

Columns:

  • hour, provider, model (unique constraint)
  • total_requests, successful_requests, failed_requests
  • total_tokens_input, total_tokens_output
  • total_cost_credits
  • avg_latency_ms, p50_latency_ms, p95_latency_ms, p99_latency_ms
  • min_latency_ms, max_latency_ms
  • error_rate

Indexes:

  • idx_hourly_aggregates_hour on (hour)
  • idx_hourly_aggregates_provider on (provider)

3. model_health_incidents

Incident tracking and resolution.

Columns:

  • id (UUID, primary key)
  • provider, model, gateway
  • incident_type (outage/degradation/timeout/rate_limit)
  • severity (critical/high/medium/low)
  • started_at, resolved_at, duration_seconds
  • error_message, error_count, affected_requests
  • status (active/resolved/acknowledged)
  • resolution_notes, metadata (JSONB)

Indexes:

  • idx_incidents_provider_model on (provider, model)
  • idx_incidents_status on (status)
  • idx_incidents_started_at on (started_at)

4. model_health_history

Time-series data for trend analysis.

Columns:

  • id (UUID, primary key)
  • provider, model, gateway
  • checked_at, status, response_time_ms
  • error_message, http_status_code
  • circuit_breaker_state, metadata (JSONB)

Indexes:

  • idx_history_provider_model on (provider, model)
  • idx_history_checked_at on (checked_at)

Partitioning: Partitioned by month for performance

5. model_health_aggregates

Pre-computed statistics for fast queries.

Columns:

  • provider, model, gateway
  • aggregation_period (hour/day/week/month)
  • period_start, period_end
  • total_checks, successful_checks, failed_checks
  • avg_response_time_ms, min_response_time_ms, max_response_time_ms
  • p50_response_time_ms, p95_response_time_ms, p99_response_time_ms
  • uptime_percentage, incident_count

Indexes:

  • idx_aggregates_provider_model_period on (provider, model, aggregation_period)
  • idx_aggregates_period_start on (period_start)

Materialized Views

Refreshed every 5 minutes for fast queries.

provider_stats_24h

Fast provider comparison (last 24 hours).

Columns:

  • provider, total_requests, successful_requests, failed_requests
  • avg_latency_ms, total_cost_credits, total_tokens
  • avg_error_rate, unique_models_count

model_status_current

Current model status for public status page.

Columns:

  • provider, model, gateway
  • status (operational/degraded/partial_outage/major_outage)
  • uptime_24h, uptime_7d, uptime_30d
  • last_response_time_ms, circuit_breaker_state
  • active_incidents_count, last_checked_at

provider_health_current

Provider-level health aggregation.

Columns:

  • provider, total_models, healthy_models, degraded_models, offline_models
  • avg_uptime_24h, avg_response_time_ms
  • status (operational/degraded/major_outage)
  • last_updated_at

Prometheus Metrics

Scrape Interval: 10 seconds Service: src/services/prometheus_metrics.py

Metric Types:

  1. Counters (monotonically increasing):

    • fastapi_requests_total
    • model_inference_requests_total
    • tokens_used_total
    • credits_used_total
    • database_queries_total
    • cache_hits_total, cache_misses_total
    • fastapi_exceptions_total
    • rate_limited_requests_total
  2. Histograms (distributions):

    • fastapi_requests_duration_seconds
    • model_inference_duration_seconds
    • database_query_duration_seconds
    • provider_response_time_seconds
    • backend_ttfb_seconds
    • streaming_duration_seconds
    • frontend_processing_seconds
    • request_stage_duration_seconds
  3. Gauges (point-in-time values):

    • fastapi_requests_in_progress
    • fastapi_request_size_bytes
    • fastapi_response_size_bytes
    • cache_size_bytes
    • provider_availability
    • provider_error_rate
    • user_credit_balance
    • trial_active
    • subscription_count
    • stage_percentage

Authentication & Authorization

API Key Authentication

Dependency: get_api_key from src/security/deps.py

Used by:

  • All /api/monitoring/* endpoints
  • Most /health/* endpoints

Validation Checks:

  1. API key exists in database
  2. API key is active (not revoked)
  3. API key is not expired
  4. Request limits not exceeded (rate limiting)
  5. IP address in allowlist (if configured)
  6. Domain restrictions (if configured)

Header Format:

Authorization: Bearer YOUR_API_KEY

Error Codes:

  • 401 Unauthorized: Invalid or missing API key
  • 403 Forbidden: API key revoked or access denied
  • 429 Too Many Requests: Rate limit exceeded

User Authentication

Dependency: get_current_user from src/security/deps.py

Used by:

  • /v1/analytics/events
  • /v1/analytics/batch

Authentication Methods:

  • JWT token (from login)
  • Session cookie

Optional Authentication

Dependency: get_optional_user from src/security/deps.py

Used by:

  • /v1/model-health/* endpoints

Behavior:

  • Allows anonymous access
  • Enhanced features for authenticated users
  • User tracking if authenticated

Public Endpoints

No authentication required:

  • /health - Simple health check
  • /metrics - Prometheus metrics
  • /v1/status/* - Public status page endpoints
  • /health/google-vertex - Provider diagnostics
  • /health/database - Database health

Row Level Security (RLS)

All monitoring tables have RLS enabled.

Policies:

  1. Service Role: Full access to all tables (for background services)
  2. Authenticated Users: Read access to metrics, incidents, history, aggregates
  3. Anonymous Users: Read access to model_status_current and provider_health_current views only

Integration Guide

Frontend Integration

1. Setting up API Client

// src/lib/monitoring-client.ts
export class MonitoringClient {
  private baseUrl: string;
  private apiKey: string;

  constructor(baseUrl: string, apiKey: string) {
    this.baseUrl = baseUrl;
    this.apiKey = apiKey;
  }

  private async request<T>(endpoint: string, options?: RequestInit): Promise<T> {
    const response = await fetch(`${this.baseUrl}${endpoint}`, {
      ...options,
      headers: {
        'Authorization': `Bearer ${this.apiKey}`,
        'Content-Type': 'application/json',
        ...options?.headers,
      },
    });

    if (!response.ok) {
      throw new Error(`API request failed: ${response.statusText}`);
    }

    return response.json();
  }

  // Real-time Stats
  async getRealtimeStats(hours: number = 1) {
    return this.request(`/api/monitoring/stats/realtime?hours=${hours}`);
  }

  // Provider Health
  async getProviderHealth(provider?: string) {
    const endpoint = provider
      ? `/api/monitoring/health/${provider}`
      : `/api/monitoring/health`;
    return this.request(endpoint);
  }

  // Circuit Breakers
  async getCircuitBreakers(provider?: string) {
    const endpoint = provider
      ? `/api/monitoring/circuit-breakers/${provider}`
      : `/api/monitoring/circuit-breakers`;
    return this.request(endpoint);
  }

  // Anomalies
  async getAnomalies() {
    return this.request('/api/monitoring/anomalies');
  }

  // Provider Comparison
  async getProviderComparison() {
    return this.request('/api/monitoring/providers/comparison');
  }

  // Cost Analysis
  async getCostAnalysis(days: number = 7) {
    return this.request(`/api/monitoring/cost-analysis?days=${days}`);
  }

  // Error Rates
  async getErrorRates(hours: number = 24) {
    return this.request(`/api/monitoring/error-rates?hours=${hours}`);
  }

  // Latency Trends
  async getLatencyTrends(provider: string, hours: number = 24) {
    return this.request(`/api/monitoring/latency-trends/${provider}?hours=${hours}`);
  }

  // Model Health
  async getModelHealth(provider: string, model: string) {
    return this.request(`/v1/model-health/${provider}/${model}`);
  }

  // Recent Errors
  async getRecentErrors(provider: string, limit: number = 100) {
    return this.request(`/api/monitoring/errors/${provider}?limit=${limit}`);
  }
}

2. React Hook for Real-time Monitoring

// src/hooks/useMonitoring.ts
import { useQuery } from '@tanstack/react-query';
import { MonitoringClient } from '@/lib/monitoring-client';

export function useRealtimeStats(hours: number = 1) {
  const client = new MonitoringClient(
    process.env.NEXT_PUBLIC_API_URL,
    process.env.NEXT_PUBLIC_API_KEY
  );

  return useQuery({
    queryKey: ['monitoring', 'realtime', hours],
    queryFn: () => client.getRealtimeStats(hours),
    refetchInterval: 30000, // Refresh every 30 seconds
  });
}

export function useProviderHealth(provider?: string) {
  const client = new MonitoringClient(
    process.env.NEXT_PUBLIC_API_URL,
    process.env.NEXT_PUBLIC_API_KEY
  );

  return useQuery({
    queryKey: ['monitoring', 'health', provider],
    queryFn: () => client.getProviderHealth(provider),
    refetchInterval: 60000, // Refresh every minute
  });
}

export function useAnomalies() {
  const client = new MonitoringClient(
    process.env.NEXT_PUBLIC_API_URL,
    process.env.NEXT_PUBLIC_API_KEY
  );

  return useQuery({
    queryKey: ['monitoring', 'anomalies'],
    queryFn: () => client.getAnomalies(),
    refetchInterval: 120000, // Refresh every 2 minutes
  });
}

3. Example Dashboard Component

// src/components/monitoring-dashboard.tsx
'use client';

import { useRealtimeStats, useProviderHealth, useAnomalies } from '@/hooks/useMonitoring';
import { Card, CardHeader, CardTitle, CardContent } from '@/components/ui/card';

export function MonitoringDashboard() {
  const { data: realtimeStats, isLoading: statsLoading } = useRealtimeStats(24);
  const { data: providerHealth, isLoading: healthLoading } = useProviderHealth();
  const { data: anomalies, isLoading: anomaliesLoading } = useAnomalies();

  if (statsLoading || healthLoading || anomaliesLoading) {
    return <div>Loading...</div>;
  }

  return (
    <div className="grid grid-cols-1 md:grid-cols-2 lg:grid-cols-3 gap-4">
      {/* Real-time Stats */}
      <Card>
        <CardHeader>
          <CardTitle>Real-time Stats (24h)</CardTitle>
        </CardHeader>
        <CardContent>
          {realtimeStats?.providers.map((provider) => (
            <div key={provider.provider} className="mb-4">
              <h3 className="font-semibold">{provider.provider}</h3>
              <p>Requests: {provider.total_requests.toLocaleString()}</p>
              <p>Success Rate: {((provider.successful_requests / provider.total_requests) * 100).toFixed(2)}%</p>
              <p>Avg Latency: {provider.avg_latency_ms}ms</p>
              <p>Cost: ${provider.total_cost_credits.toFixed(2)}</p>
            </div>
          ))}
        </CardContent>
      </Card>

      {/* Provider Health */}
      <Card>
        <CardHeader>
          <CardTitle>Provider Health</CardTitle>
        </CardHeader>
        <CardContent>
          {providerHealth?.providers.map((provider) => (
            <div key={provider.provider} className="mb-4">
              <div className="flex justify-between items-center">
                <h3 className="font-semibold">{provider.provider}</h3>
                <span className={`px-2 py-1 rounded ${
                  provider.health_score >= 90 ? 'bg-green-500' :
                  provider.health_score >= 70 ? 'bg-yellow-500' :
                  'bg-red-500'
                } text-white text-sm`}>
                  {provider.health_score}
                </span>
              </div>
              <p className="text-sm text-gray-600">{provider.status}</p>
            </div>
          ))}
        </CardContent>
      </Card>

      {/* Anomalies */}
      <Card>
        <CardHeader>
          <CardTitle>Detected Anomalies</CardTitle>
        </CardHeader>
        <CardContent>
          {anomalies?.anomalies.length === 0 ? (
            <p className="text-green-600">No anomalies detected</p>
          ) : (
            anomalies?.anomalies.map((anomaly, idx) => (
              <div key={idx} className={`mb-4 p-3 rounded ${
                anomaly.severity === 'critical' ? 'bg-red-100' :
                anomaly.severity === 'warning' ? 'bg-yellow-100' :
                'bg-blue-100'
              }`}>
                <h4 className="font-semibold">{anomaly.type}</h4>
                <p className="text-sm">{anomaly.provider}</p>
                <p className="text-sm">{anomaly.message}</p>
              </div>
            ))
          )}
        </CardContent>
      </Card>
    </div>
  );
}

Backend Integration (Python)

Recording Metrics in Your Code

from src.services.redis_metrics import RedisMetricsService
from src.db.model_health import record_model_call

async def make_api_call(provider: str, model: str):
    start_time = time.time()

    try:
        # Make your API call
        response = await api_client.complete(...)

        # Calculate metrics
        latency_ms = (time.time() - start_time) * 1000
        tokens_used = response.usage.total_tokens
        cost = calculate_cost(tokens_used, model)

        # Record success in Redis (real-time)
        await RedisMetricsService.record_request(
            provider=provider,
            model=model,
            latency_ms=latency_ms,
            success=True,
            cost_credits=cost,
            tokens_input=response.usage.prompt_tokens,
            tokens_output=response.usage.completion_tokens
        )

        # Update model health (database)
        await record_model_call(
            provider=provider,
            model=model,
            response_time_ms=latency_ms,
            success=True,
            error_message=None
        )

        return response

    except Exception as e:
        # Record failure
        latency_ms = (time.time() - start_time) * 1000

        await RedisMetricsService.record_request(
            provider=provider,
            model=model,
            latency_ms=latency_ms,
            success=False,
            error_message=str(e)
        )

        await record_model_call(
            provider=provider,
            model=model,
            response_time_ms=latency_ms,
            success=False,
            error_message=str(e)
        )

        raise

Configuration

Environment Variables

Add to .env:

# Monitoring Services
PROMETHEUS_ENABLED=true
PROMETHEUS_SCRAPE_ENABLED=true

# Error Monitoring
SENTRY_DSN=https://your-sentry-dsn@sentry.io/project-id
SENTRY_ENABLED=true
SENTRY_ENVIRONMENT=production
SENTRY_TRACES_SAMPLE_RATE=1.0

# Analytics
STATSIG_SERVER_SECRET_KEY=your-statsig-key
POSTHOG_API_KEY=your-posthog-key
BRAINTRUST_API_KEY=your-braintrust-key

# Observability
TEMPO_ENABLED=true  # Distributed tracing
LOKI_ENABLED=true   # Structured logging

# Redis (optional, graceful degradation)
REDIS_URL=redis://localhost:6379

Prometheus Configuration

File: prometheus.yml

global:
  scrape_interval: 10s
  evaluation_interval: 10s
  external_labels:
    monitor: 'gatewayz-monitor'
    environment: 'production'

scrape_configs:
  - job_name: 'gatewayz-api'
    static_configs:
      - targets: ['api.railway.internal:8000']
    metrics_path: '/metrics'

Health Monitoring Service Configuration

The background health monitoring service runs continuously with these settings:

File: src/services/model_health_monitor.py

# Monitoring Tiers (check intervals)
CRITICAL_TIER = 300      # 5 minutes
POPULAR_TIER = 1800      # 30 minutes
STANDARD_TIER = 7200     # 2 hours
ON_DEMAND_TIER = 14400   # 4 hours

# Circuit Breaker Thresholds
FAILURE_THRESHOLD = 5              # Consecutive failures to open circuit
HALF_OPEN_SUCCESS_THRESHOLD = 3   # Consecutive successes to close circuit
CIRCUIT_OPEN_DURATION = 300       # 5 minutes before attempting half-open

# Batch Processing
BATCH_SIZE = 10                   # Models to check per batch
BATCH_INTERVAL = 1                # Seconds between batches

Best Practices

1. Monitoring Dashboard Design

  • Refresh Intervals:

    • Real-time metrics: 30-60 seconds
    • Health checks: 1-2 minutes
    • Anomalies: 2-5 minutes
    • Historical data: 5-10 minutes
  • Error Handling:

    • Always handle API errors gracefully
    • Show cached data during outages
    • Display loading states clearly
  • Performance:

    • Use pagination for large datasets
    • Implement virtual scrolling for long lists
    • Cache responses when appropriate

2. Alert Configuration

Recommended Alerts:

  1. Critical:

    • Provider down (uptime < 95%)
    • High error rate (> 25%)
    • Circuit breaker opened
    • Database connection lost
  2. Warning:

    • Degraded performance (uptime < 98%)
    • Elevated error rate (> 10%)
    • Anomaly detected (cost/latency spike)
  3. Info:

    • New incident created
    • Incident resolved

3. Capacity Planning

Monitor these metrics for capacity planning:

  • Request rates (per provider/model)
  • Token consumption trends
  • Cost trends
  • Latency percentiles (p95, p99)
  • Error rates by time of day

4. Security

  • Never expose API keys in frontend code
  • Use environment variables for sensitive data
  • Implement rate limiting on monitoring endpoints
  • Enable CORS only for trusted domains
  • Use HTTPS for all API calls
  • Regularly rotate API keys

5. Performance Optimization

Database:

  • Use materialized views for frequently accessed data
  • Partition large tables (e.g., model_health_history)
  • Create appropriate indexes
  • Archive old data regularly

Redis:

  • Set appropriate TTLs (24 hours for metrics)
  • Use pipelining for bulk operations
  • Monitor memory usage

API:

  • Implement response caching
  • Use pagination for large results
  • Enable compression (gzip)
  • Monitor query performance

Troubleshooting

Issue: Monitoring data not updating

Possible causes:

  1. Redis connection lost
  2. Background monitoring service stopped
  3. Database connection issues
  4. Rate limiting

Solutions:

# Check Redis connection
curl https://api.gatewayz.io/health/database

# Check monitoring service status
curl -H "Authorization: Bearer YOUR_API_KEY" \
  https://api.gatewayz.io/health/monitoring/status

# Restart monitoring service
curl -X POST -H "Authorization: Bearer YOUR_API_KEY" \
  https://api.gatewayz.io/health/monitoring/start

Issue: High latency on monitoring endpoints

Possible causes:

  1. Large result sets
  2. Missing indexes
  3. Materialized views not refreshed
  4. Redis cache miss

Solutions:

  • Use pagination (limit and offset)
  • Reduce time ranges (e.g., 24h instead of 7d)
  • Check database indexes
  • Refresh materialized views manually

Issue: Missing metrics for a provider

Possible causes:

  1. Provider not enabled
  2. No recent API calls
  3. Monitoring tier too low
  4. Circuit breaker open

Solutions:

# Enable provider monitoring
await record_model_call(
    provider="new-provider",
    model="new-model",
    response_time_ms=0,
    success=True,
    monitoring_tier="critical"  # Set appropriate tier
)

API Reference Summary

Category Endpoint Prefix Authentication Use Case
Monitoring /api/monitoring/* API Key Admin dashboards, internal tools
Health /health/* Mixed System health, diagnostics
Model Health /v1/model-health/* Optional Public model status, provider health
Status Page /v1/status/* None Public status pages
Analytics /v1/analytics/* User Auth Event tracking
Metrics /metrics None Prometheus scraping

Related Documentation


Support

For questions or issues:


Related

Clone this wiki locally