Runbook

NOC-style operational procedures for responding to ecosystem health status changes detected by observability-dashboard.

Health Status Response Procedures

OBS-001 — Ecosystem Health: CRITICAL

Trigger: Overall ecosystem health score is CRITICAL

Severity: CRITICAL — Immediate response required

Initial Response:

Run the dashboard immediately to get the latest snapshot:

   python main.py

Review the Component Health section — identify which components are CRITICAL
Review the Incidents section — how many CRITICAL incidents are open?
Determine if this is:
- Single component failure — Isolated issue, targeted response
- Multiple component failure — Systemic issue, all-hands response

If Single Component CRITICAL:

Navigate to the affected component's runbook:

Telemetry CRITICAL → See cloud-telemetry-agent Runbook
Uptime CRITICAL → See synthetic-uptime-monitor Runbook
Log Intel CRITICAL → See log-intelligence-engine Runbook
Incident Pipeline CRITICAL → See incident-alert-pipeline Runbook

If Multiple Components CRITICAL:

Escalate immediately — systemic failure in progress
Check AWS service health: https://health.aws.amazon.com
Check network connectivity:

   ping 8.8.8.8

Notify all stakeholders
Begin incident documentation immediately

Resolution Verification: Run the dashboard again after remediation:

python main.py

Confirm overall health returns to HEALTHY or DEGRADED before closing incident.

OBS-002 — Ecosystem Health: DEGRADED

Trigger: Overall ecosystem health score is DEGRADED

Severity: MEDIUM → HIGH if worsening

Initial Response:

Run the dashboard to identify degraded components:

   python main.py

Review component health grid — which component is DEGRADED?
Review key metrics — which specific metric is elevated?
Determine trend — is it improving, stable, or worsening?

If Stable Degradation:

Document the condition
Monitor on next scheduled run
Investigate root cause during business hours

If Worsening Degradation:

Escalate to CRITICAL response procedures immediately
Notify component owner

OBS-003 — Open Critical Incidents Driving CRITICAL Status

Trigger: Ecosystem health CRITICAL due to open CRITICAL incidents in incident-alert-pipeline

Initial Response:

Review open CRITICAL incidents:

   type incident-alert-pipeline\data\incidents.json

Identify incidents with "severity": "CRITICAL" and "status": "OPEN"
Investigate and resolve each incident
Close resolved incidents:

   from pipeline.incident_manager import close_incident
   close_incident('INC-XXXXXXXXXXXXXXXX', notes='Resolved — describe action')

Re-run the dashboard to confirm health status improves:

   python main.py

Operational Procedures

Scheduled Dashboard Run

For recurring ecosystem health checks establish a consistent procedure:

Collect latest data from all portfolio repos
Place data files in data/ directory
Run the dashboard:

   python main.py

Review terminal output for immediate health status
Open exports/dashboard_summary.json for full detail
Archive the summary with a timestamp:

   copy exports\dashboard_summary.json exports\summary-2026-03-08.json

Document findings if health status is not HEALTHY

Comparing Dashboard Runs

Maintain dated summary archives to identify trends:

exports/
├── summary-2026-03-05.json
├── summary-2026-03-06.json
├── summary-2026-03-07.json
├── summary-2026-03-08.json
└── dashboard_summary.json  ← always the latest run

Compare overall_health, open_incidents, and key metrics across dates to identify:

Improving trends — Health scores stabilizing, incident count dropping
Degrading trends — Metrics climbing, incident count growing
Sudden changes — Isolated event vs. ongoing pattern

Refreshing Power BI Data

After each dashboard run refresh your Power BI report:

Open your Power BI Desktop report
Click Refresh in the Home ribbon
Power BI will reload exports/powerbi_dataset.json
All visualizations update automatically

Adding a New Data Source

When onboarding a new portfolio repository to the dashboard:

Add the new source to config/sources.json:

   {
       "name": "New Monitor",
       "repo": "new-monitor-repo",
       "type": "custom",
       "data_file": "data/new_monitor.json",
       "enabled": true,
       "description": "Description of what this source provides"
   }

Add aggregation logic to aggregator.py
Add transformation logic to transformer.py
Update health scoring in summary.py if needed
Update Power BI export structure in exporter.py
Test with sample data before enabling live data

Escalation Matrix

Condition	Severity	Action
Ecosystem CRITICAL, single component	HIGH	Follow component runbook immediately
Ecosystem CRITICAL, multiple components	CRITICAL	All-hands, systemic failure
Ecosystem CRITICAL, open incidents only	HIGH	Close resolved incidents, investigate open
Ecosystem DEGRADED, stable	MEDIUM	Monitor, investigate during business hours
Ecosystem DEGRADED, worsening	HIGH	Escalate immediately
Dashboard not producing exports	MEDIUM	Check config paths, verify data files
Power BI data stale	LOW	Refresh dataset, verify export path

Post-Incident Documentation

After every CRITICAL ecosystem health event document the following:

Dashboard run timestamp — When was the CRITICAL status detected
Components affected — Which repos contributed to CRITICAL status
Key metrics at time of incident — From dashboard_summary.json
Open incidents count — How many CRITICAL incidents were open
Root cause — What drove the ecosystem to CRITICAL
Time to detect — When did the condition start vs. when dashboard caught it
Time to resolve — When ecosystem health returned to HEALTHY
Action taken — What was done across each affected component
Follow-up — Monitoring adjustments, config changes, tickets raised

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runbook

Runbook

Health Status Response Procedures

OBS-001 — Ecosystem Health: CRITICAL

OBS-002 — Ecosystem Health: DEGRADED

OBS-003 — Open Critical Incidents Driving CRITICAL Status

Operational Procedures

Scheduled Dashboard Run

Comparing Dashboard Runs

Refreshing Power BI Data

Adding a New Data Source

Escalation Matrix

Post-Incident Documentation

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally