feat: Add comprehensive monitoring and alerting for production observability

## Feature Request

Implement comprehensive monitoring and alerting to improve production observability and proactive issue detection.

## Level of Effort: 🔥 Large (5-7 days)
- **Metrics implementation**: 2-3 days for application-level metrics
- **Dashboard creation**: 1-2 days for monitoring dashboards
- **Alerting setup**: 1-2 days for alert rules and notification channels
- **Documentation**: 1 day for runbooks and monitoring guides

## Current Monitoring State

**What we have:**
- Basic Cloud Run monitoring (CPU, memory, request count)
- Default Google Cloud logging
- Basic health check endpoint (`/healthz`)

**What's missing:**
- Application-level metrics (query success rates, response times, user patterns)
- Custom dashboards for business metrics
- Proactive alerting for issues
- Performance trend analysis
- Error rate monitoring and alerting

## Recommended Implementation

### 1. Application Metrics
Add custom metrics throughout the application:

```python
# src/answer_app/metrics.py
from google.cloud import monitoring_v3
import time
from functools import wraps

class MetricsCollector:
    def __init__(self, project_id: str):
        self.client = monitoring_v3.MetricServiceClient()
        self.project_name = f"projects/{project_id}"
    
    def track_query_duration(self, duration: float, success: bool):
        """Track query processing time and success rate."""
        
    def track_user_session(self, user_email: str, action: str):
        """Track user session activities."""
        
    def track_discovery_engine_performance(self, response_time: float, token_count: int):
        """Track Discovery Engine API performance."""
```

### 2. Key Metrics to Track

**Performance Metrics:**
- Query response time (p50, p95, p99)
- Discovery Engine API response time
- BigQuery insert latency
- Error rates by endpoint

**Business Metrics:**
- Active users per day/week
- Queries per user session
- Popular query patterns
- User satisfaction (feedback scores)
- Session duration and engagement

**System Metrics:**
- Memory usage patterns
- CPU utilization trends
- Network I/O
- Authentication success/failure rates

### 3. Dashboard Implementation

**Google Cloud Monitoring Dashboards:**
```yaml
# monitoring/dashboards/answer-app-overview.json
{
  "displayName": "Answer App Overview",
  "widgets": [
    {
      "title": "Query Response Time",
      "scorecard": {
        "timeSeriesQuery": {
          "timeSeriesFilter": {
            "filter": "resource.type=\"cloud_run_revision\"",
            "aggregation": {
              "alignmentPeriod": "60s",
              "perSeriesAligner": "ALIGN_RATE"
            }
          }
        }
      }
    }
  ]
}
```

### 4. Alerting Rules

**Critical Alerts:**
- Error rate > 5% for 5 minutes
- Response time p95 > 10 seconds
- Service availability < 99%
- Authentication failures > 20/minute

**Warning Alerts:**
- Response time p95 > 5 seconds
- Memory usage > 80%
- Unusual query volume spikes
- Discovery Engine API latency increases

## Implementation Areas

### Files to Create/Modify:
- **`src/answer_app/metrics.py`**: Metrics collection utilities
- **`src/answer_app/main.py`**: Add metrics middleware
- **`src/answer_app/utils.py`**: Instrument key operations
- **`monitoring/dashboards/`**: Dashboard configurations
- **`monitoring/alerts/`**: Alert policy definitions
- **`terraform/modules/monitoring/`**: Infrastructure for monitoring

### Integration Points:
- FastAPI middleware for request metrics
- Discovery Engine wrapper with timing
- BigQuery operations instrumentation
- Streamlit app user interaction tracking

## Configuration

### Add to `config.yaml`:
```yaml
monitoring:
  enabled: true
  metrics_project: ${PROJECT}
  dashboard_enabled: true
  alert_channels:
    - email: ops-team@company.com
    - slack: "#alerts"
  
  thresholds:
    error_rate_critical: 0.05
    response_time_warning: 5.0
    response_time_critical: 10.0
```

## Terraform Infrastructure

```hcl
# terraform/modules/monitoring/main.tf
resource "google_monitoring_dashboard" "answer_app" {
  dashboard_json = file("${path.module}/dashboards/answer-app-overview.json")
}

resource "google_monitoring_alert_policy" "high_error_rate" {
  display_name = "Answer App High Error Rate"
  conditions {
    display_name = "Error rate > 5%"
    # Alert condition configuration
  }
}
```

## Testing Strategy

- **Metrics validation**: Ensure metrics are properly recorded
- **Dashboard testing**: Verify dashboard displays work correctly
- **Alert testing**: Test alert triggering and notification delivery
- **Load testing**: Generate metrics under various load conditions

## Acceptance Criteria

- [ ] Application metrics implemented for key operations
- [ ] Custom dashboards created for business and technical metrics
- [ ] Alert policies configured for critical and warning conditions
- [ ] Notification channels set up and tested
- [ ] Runbooks created for common alert scenarios
- [ ] Metrics documented for team understanding
- [ ] Performance impact assessment completed

## Priority

**Low** - Important for production operations but not critical for current functionality.

## When to Implement

This becomes more valuable when:
- Application has consistent production traffic
- Team needs proactive issue detection
- Performance optimization becomes important
- Multiple team members support the application
- SLA/SLO requirements are established

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add comprehensive monitoring and alerting for production observability #15

Feature Request

Level of Effort: 🔥 Large (5-7 days)

Current Monitoring State

Recommended Implementation

1. Application Metrics

2. Key Metrics to Track

3. Dashboard Implementation

4. Alerting Rules

Implementation Areas

Files to Create/Modify:

Integration Points:

Configuration

Add to `config.yaml`:

Terraform Infrastructure

Testing Strategy

Acceptance Criteria

Priority

When to Implement

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Add comprehensive monitoring and alerting for production observability #15

Description

Feature Request

Level of Effort: 🔥 Large (5-7 days)

Current Monitoring State

Recommended Implementation

1. Application Metrics

2. Key Metrics to Track

3. Dashboard Implementation

4. Alerting Rules

Implementation Areas

Files to Create/Modify:

Integration Points:

Configuration

Add to config.yaml:

Terraform Infrastructure

Testing Strategy

Acceptance Criteria

Priority

When to Implement

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Add to `config.yaml`: