Skip to content

feat: Add comprehensive monitoring and alerting for production observability #15

@doughayden

Description

@doughayden

Feature Request

Implement comprehensive monitoring and alerting to improve production observability and proactive issue detection.

Level of Effort: 🔥 Large (5-7 days)

  • Metrics implementation: 2-3 days for application-level metrics
  • Dashboard creation: 1-2 days for monitoring dashboards
  • Alerting setup: 1-2 days for alert rules and notification channels
  • Documentation: 1 day for runbooks and monitoring guides

Current Monitoring State

What we have:

  • Basic Cloud Run monitoring (CPU, memory, request count)
  • Default Google Cloud logging
  • Basic health check endpoint (/healthz)

What's missing:

  • Application-level metrics (query success rates, response times, user patterns)
  • Custom dashboards for business metrics
  • Proactive alerting for issues
  • Performance trend analysis
  • Error rate monitoring and alerting

Recommended Implementation

1. Application Metrics

Add custom metrics throughout the application:

# src/answer_app/metrics.py
from google.cloud import monitoring_v3
import time
from functools import wraps

class MetricsCollector:
    def __init__(self, project_id: str):
        self.client = monitoring_v3.MetricServiceClient()
        self.project_name = f"projects/{project_id}"
    
    def track_query_duration(self, duration: float, success: bool):
        """Track query processing time and success rate."""
        
    def track_user_session(self, user_email: str, action: str):
        """Track user session activities."""
        
    def track_discovery_engine_performance(self, response_time: float, token_count: int):
        """Track Discovery Engine API performance."""

2. Key Metrics to Track

Performance Metrics:

  • Query response time (p50, p95, p99)
  • Discovery Engine API response time
  • BigQuery insert latency
  • Error rates by endpoint

Business Metrics:

  • Active users per day/week
  • Queries per user session
  • Popular query patterns
  • User satisfaction (feedback scores)
  • Session duration and engagement

System Metrics:

  • Memory usage patterns
  • CPU utilization trends
  • Network I/O
  • Authentication success/failure rates

3. Dashboard Implementation

Google Cloud Monitoring Dashboards:

# monitoring/dashboards/answer-app-overview.json
{
  "displayName": "Answer App Overview",
  "widgets": [
    {
      "title": "Query Response Time",
      "scorecard": {
        "timeSeriesQuery": {
          "timeSeriesFilter": {
            "filter": "resource.type=\"cloud_run_revision\"",
            "aggregation": {
              "alignmentPeriod": "60s",
              "perSeriesAligner": "ALIGN_RATE"
            }
          }
        }
      }
    }
  ]
}

4. Alerting Rules

Critical Alerts:

  • Error rate > 5% for 5 minutes
  • Response time p95 > 10 seconds
  • Service availability < 99%
  • Authentication failures > 20/minute

Warning Alerts:

  • Response time p95 > 5 seconds
  • Memory usage > 80%
  • Unusual query volume spikes
  • Discovery Engine API latency increases

Implementation Areas

Files to Create/Modify:

  • src/answer_app/metrics.py: Metrics collection utilities
  • src/answer_app/main.py: Add metrics middleware
  • src/answer_app/utils.py: Instrument key operations
  • monitoring/dashboards/: Dashboard configurations
  • monitoring/alerts/: Alert policy definitions
  • terraform/modules/monitoring/: Infrastructure for monitoring

Integration Points:

  • FastAPI middleware for request metrics
  • Discovery Engine wrapper with timing
  • BigQuery operations instrumentation
  • Streamlit app user interaction tracking

Configuration

Add to config.yaml:

monitoring:
  enabled: true
  metrics_project: ${PROJECT}
  dashboard_enabled: true
  alert_channels:
    - email: ops-team@company.com
    - slack: "#alerts"
  
  thresholds:
    error_rate_critical: 0.05
    response_time_warning: 5.0
    response_time_critical: 10.0

Terraform Infrastructure

# terraform/modules/monitoring/main.tf
resource "google_monitoring_dashboard" "answer_app" {
  dashboard_json = file("${path.module}/dashboards/answer-app-overview.json")
}

resource "google_monitoring_alert_policy" "high_error_rate" {
  display_name = "Answer App High Error Rate"
  conditions {
    display_name = "Error rate > 5%"
    # Alert condition configuration
  }
}

Testing Strategy

  • Metrics validation: Ensure metrics are properly recorded
  • Dashboard testing: Verify dashboard displays work correctly
  • Alert testing: Test alert triggering and notification delivery
  • Load testing: Generate metrics under various load conditions

Acceptance Criteria

  • Application metrics implemented for key operations
  • Custom dashboards created for business and technical metrics
  • Alert policies configured for critical and warning conditions
  • Notification channels set up and tested
  • Runbooks created for common alert scenarios
  • Metrics documented for team understanding
  • Performance impact assessment completed

Priority

Low - Important for production operations but not critical for current functionality.

When to Implement

This becomes more valuable when:

  • Application has consistent production traffic
  • Team needs proactive issue detection
  • Performance optimization becomes important
  • Multiple team members support the application
  • SLA/SLO requirements are established

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions