Feature Request
Implement comprehensive monitoring and alerting to improve production observability and proactive issue detection.
Level of Effort: 🔥 Large (5-7 days)
- Metrics implementation: 2-3 days for application-level metrics
- Dashboard creation: 1-2 days for monitoring dashboards
- Alerting setup: 1-2 days for alert rules and notification channels
- Documentation: 1 day for runbooks and monitoring guides
Current Monitoring State
What we have:
- Basic Cloud Run monitoring (CPU, memory, request count)
- Default Google Cloud logging
- Basic health check endpoint (
/healthz)
What's missing:
- Application-level metrics (query success rates, response times, user patterns)
- Custom dashboards for business metrics
- Proactive alerting for issues
- Performance trend analysis
- Error rate monitoring and alerting
Recommended Implementation
1. Application Metrics
Add custom metrics throughout the application:
# src/answer_app/metrics.py
from google.cloud import monitoring_v3
import time
from functools import wraps
class MetricsCollector:
def __init__(self, project_id: str):
self.client = monitoring_v3.MetricServiceClient()
self.project_name = f"projects/{project_id}"
def track_query_duration(self, duration: float, success: bool):
"""Track query processing time and success rate."""
def track_user_session(self, user_email: str, action: str):
"""Track user session activities."""
def track_discovery_engine_performance(self, response_time: float, token_count: int):
"""Track Discovery Engine API performance."""
2. Key Metrics to Track
Performance Metrics:
- Query response time (p50, p95, p99)
- Discovery Engine API response time
- BigQuery insert latency
- Error rates by endpoint
Business Metrics:
- Active users per day/week
- Queries per user session
- Popular query patterns
- User satisfaction (feedback scores)
- Session duration and engagement
System Metrics:
- Memory usage patterns
- CPU utilization trends
- Network I/O
- Authentication success/failure rates
3. Dashboard Implementation
Google Cloud Monitoring Dashboards:
# monitoring/dashboards/answer-app-overview.json
{
"displayName": "Answer App Overview",
"widgets": [
{
"title": "Query Response Time",
"scorecard": {
"timeSeriesQuery": {
"timeSeriesFilter": {
"filter": "resource.type=\"cloud_run_revision\"",
"aggregation": {
"alignmentPeriod": "60s",
"perSeriesAligner": "ALIGN_RATE"
}
}
}
}
}
]
}
4. Alerting Rules
Critical Alerts:
- Error rate > 5% for 5 minutes
- Response time p95 > 10 seconds
- Service availability < 99%
- Authentication failures > 20/minute
Warning Alerts:
- Response time p95 > 5 seconds
- Memory usage > 80%
- Unusual query volume spikes
- Discovery Engine API latency increases
Implementation Areas
Files to Create/Modify:
src/answer_app/metrics.py: Metrics collection utilities
src/answer_app/main.py: Add metrics middleware
src/answer_app/utils.py: Instrument key operations
monitoring/dashboards/: Dashboard configurations
monitoring/alerts/: Alert policy definitions
terraform/modules/monitoring/: Infrastructure for monitoring
Integration Points:
- FastAPI middleware for request metrics
- Discovery Engine wrapper with timing
- BigQuery operations instrumentation
- Streamlit app user interaction tracking
Configuration
Add to config.yaml:
monitoring:
enabled: true
metrics_project: ${PROJECT}
dashboard_enabled: true
alert_channels:
- email: ops-team@company.com
- slack: "#alerts"
thresholds:
error_rate_critical: 0.05
response_time_warning: 5.0
response_time_critical: 10.0
Terraform Infrastructure
# terraform/modules/monitoring/main.tf
resource "google_monitoring_dashboard" "answer_app" {
dashboard_json = file("${path.module}/dashboards/answer-app-overview.json")
}
resource "google_monitoring_alert_policy" "high_error_rate" {
display_name = "Answer App High Error Rate"
conditions {
display_name = "Error rate > 5%"
# Alert condition configuration
}
}
Testing Strategy
- Metrics validation: Ensure metrics are properly recorded
- Dashboard testing: Verify dashboard displays work correctly
- Alert testing: Test alert triggering and notification delivery
- Load testing: Generate metrics under various load conditions
Acceptance Criteria
Priority
Low - Important for production operations but not critical for current functionality.
When to Implement
This becomes more valuable when:
- Application has consistent production traffic
- Team needs proactive issue detection
- Performance optimization becomes important
- Multiple team members support the application
- SLA/SLO requirements are established
Feature Request
Implement comprehensive monitoring and alerting to improve production observability and proactive issue detection.
Level of Effort: 🔥 Large (5-7 days)
Current Monitoring State
What we have:
/healthz)What's missing:
Recommended Implementation
1. Application Metrics
Add custom metrics throughout the application:
2. Key Metrics to Track
Performance Metrics:
Business Metrics:
System Metrics:
3. Dashboard Implementation
Google Cloud Monitoring Dashboards:
4. Alerting Rules
Critical Alerts:
Warning Alerts:
Implementation Areas
Files to Create/Modify:
src/answer_app/metrics.py: Metrics collection utilitiessrc/answer_app/main.py: Add metrics middlewaresrc/answer_app/utils.py: Instrument key operationsmonitoring/dashboards/: Dashboard configurationsmonitoring/alerts/: Alert policy definitionsterraform/modules/monitoring/: Infrastructure for monitoringIntegration Points:
Configuration
Add to
config.yaml:Terraform Infrastructure
Testing Strategy
Acceptance Criteria
Priority
Low - Important for production operations but not critical for current functionality.
When to Implement
This becomes more valuable when: