Skip to content

Runbook

Alex-CloudOps edited this page Mar 8, 2026 · 1 revision

Runbook

NOC-style operational procedures for responding to ecosystem health status changes detected by observability-dashboard.


Health Status Response Procedures

OBS-001 — Ecosystem Health: CRITICAL

Trigger: Overall ecosystem health score is CRITICAL

Severity: CRITICAL — Immediate response required

Initial Response:

  1. Run the dashboard immediately to get the latest snapshot:
   python main.py
  1. Review the Component Health section — identify which components are CRITICAL
  2. Review the Incidents section — how many CRITICAL incidents are open?
  3. Determine if this is:
    • Single component failure — Isolated issue, targeted response
    • Multiple component failure — Systemic issue, all-hands response

If Single Component CRITICAL:

Navigate to the affected component's runbook:

  • Telemetry CRITICAL → See cloud-telemetry-agent Runbook
  • Uptime CRITICAL → See synthetic-uptime-monitor Runbook
  • Log Intel CRITICAL → See log-intelligence-engine Runbook
  • Incident Pipeline CRITICAL → See incident-alert-pipeline Runbook

If Multiple Components CRITICAL:

  1. Escalate immediately — systemic failure in progress
  2. Check AWS service health: https://health.aws.amazon.com
  3. Check network connectivity:
   ping 8.8.8.8
  1. Notify all stakeholders
  2. Begin incident documentation immediately

Resolution Verification: Run the dashboard again after remediation:

python main.py

Confirm overall health returns to HEALTHY or DEGRADED before closing incident.


OBS-002 — Ecosystem Health: DEGRADED

Trigger: Overall ecosystem health score is DEGRADED

Severity: MEDIUM → HIGH if worsening

Initial Response:

  1. Run the dashboard to identify degraded components:
   python main.py
  1. Review component health grid — which component is DEGRADED?
  2. Review key metrics — which specific metric is elevated?
  3. Determine trend — is it improving, stable, or worsening?

If Stable Degradation:

  • Document the condition
  • Monitor on next scheduled run
  • Investigate root cause during business hours

If Worsening Degradation:

  • Escalate to CRITICAL response procedures immediately
  • Notify component owner

OBS-003 — Open Critical Incidents Driving CRITICAL Status

Trigger: Ecosystem health CRITICAL due to open CRITICAL incidents in incident-alert-pipeline

Initial Response:

  1. Review open CRITICAL incidents:
   type incident-alert-pipeline\data\incidents.json
  1. Identify incidents with "severity": "CRITICAL" and "status": "OPEN"
  2. Investigate and resolve each incident
  3. Close resolved incidents:
   from pipeline.incident_manager import close_incident
   close_incident('INC-XXXXXXXXXXXXXXXX', notes='Resolved — describe action')
  1. Re-run the dashboard to confirm health status improves:
   python main.py

Operational Procedures

Scheduled Dashboard Run

For recurring ecosystem health checks establish a consistent procedure:

  1. Collect latest data from all portfolio repos
  2. Place data files in data/ directory
  3. Run the dashboard:
   python main.py
  1. Review terminal output for immediate health status
  2. Open exports/dashboard_summary.json for full detail
  3. Archive the summary with a timestamp:
   copy exports\dashboard_summary.json exports\summary-2026-03-08.json
  1. Document findings if health status is not HEALTHY

Comparing Dashboard Runs

Maintain dated summary archives to identify trends:

exports/
├── summary-2026-03-05.json
├── summary-2026-03-06.json
├── summary-2026-03-07.json
├── summary-2026-03-08.json
└── dashboard_summary.json  ← always the latest run

Compare overall_health, open_incidents, and key metrics across dates to identify:

  • Improving trends — Health scores stabilizing, incident count dropping
  • Degrading trends — Metrics climbing, incident count growing
  • Sudden changes — Isolated event vs. ongoing pattern

Refreshing Power BI Data

After each dashboard run refresh your Power BI report:

  1. Open your Power BI Desktop report
  2. Click Refresh in the Home ribbon
  3. Power BI will reload exports/powerbi_dataset.json
  4. All visualizations update automatically

Adding a New Data Source

When onboarding a new portfolio repository to the dashboard:

  1. Add the new source to config/sources.json:
   {
       "name": "New Monitor",
       "repo": "new-monitor-repo",
       "type": "custom",
       "data_file": "data/new_monitor.json",
       "enabled": true,
       "description": "Description of what this source provides"
   }
  1. Add aggregation logic to aggregator.py
  2. Add transformation logic to transformer.py
  3. Update health scoring in summary.py if needed
  4. Update Power BI export structure in exporter.py
  5. Test with sample data before enabling live data

Escalation Matrix

Condition Severity Action
Ecosystem CRITICAL, single component HIGH Follow component runbook immediately
Ecosystem CRITICAL, multiple components CRITICAL All-hands, systemic failure
Ecosystem CRITICAL, open incidents only HIGH Close resolved incidents, investigate open
Ecosystem DEGRADED, stable MEDIUM Monitor, investigate during business hours
Ecosystem DEGRADED, worsening HIGH Escalate immediately
Dashboard not producing exports MEDIUM Check config paths, verify data files
Power BI data stale LOW Refresh dataset, verify export path

Post-Incident Documentation

After every CRITICAL ecosystem health event document the following:

  • Dashboard run timestamp — When was the CRITICAL status detected
  • Components affected — Which repos contributed to CRITICAL status
  • Key metrics at time of incident — From dashboard_summary.json
  • Open incidents count — How many CRITICAL incidents were open
  • Root cause — What drove the ecosystem to CRITICAL
  • Time to detect — When did the condition start vs. when dashboard caught it
  • Time to resolve — When ecosystem health returned to HEALTHY
  • Action taken — What was done across each affected component
  • Follow-up — Monitoring adjustments, config changes, tickets raised