-
Notifications
You must be signed in to change notification settings - Fork 0
Runbook
NOC-style operational procedures for responding to ecosystem health status changes detected by
observability-dashboard.
Trigger: Overall ecosystem health score is CRITICAL
Severity: CRITICAL — Immediate response required
Initial Response:
- Run the dashboard immediately to get the latest snapshot:
python main.py- Review the Component Health section — identify which components are CRITICAL
- Review the Incidents section — how many CRITICAL incidents are open?
- Determine if this is:
- Single component failure — Isolated issue, targeted response
- Multiple component failure — Systemic issue, all-hands response
If Single Component CRITICAL:
Navigate to the affected component's runbook:
- Telemetry CRITICAL → See
cloud-telemetry-agentRunbook - Uptime CRITICAL → See
synthetic-uptime-monitorRunbook - Log Intel CRITICAL → See
log-intelligence-engineRunbook - Incident Pipeline CRITICAL → See
incident-alert-pipelineRunbook
If Multiple Components CRITICAL:
- Escalate immediately — systemic failure in progress
- Check AWS service health: https://health.aws.amazon.com
- Check network connectivity:
ping 8.8.8.8- Notify all stakeholders
- Begin incident documentation immediately
Resolution Verification: Run the dashboard again after remediation:
python main.pyConfirm overall health returns to HEALTHY or DEGRADED before closing incident.
Trigger: Overall ecosystem health score is DEGRADED
Severity: MEDIUM → HIGH if worsening
Initial Response:
- Run the dashboard to identify degraded components:
python main.py- Review component health grid — which component is DEGRADED?
- Review key metrics — which specific metric is elevated?
- Determine trend — is it improving, stable, or worsening?
If Stable Degradation:
- Document the condition
- Monitor on next scheduled run
- Investigate root cause during business hours
If Worsening Degradation:
- Escalate to CRITICAL response procedures immediately
- Notify component owner
Trigger: Ecosystem health CRITICAL due to open CRITICAL incidents in incident-alert-pipeline
Initial Response:
- Review open CRITICAL incidents:
type incident-alert-pipeline\data\incidents.json- Identify incidents with
"severity": "CRITICAL"and"status": "OPEN" - Investigate and resolve each incident
- Close resolved incidents:
from pipeline.incident_manager import close_incident
close_incident('INC-XXXXXXXXXXXXXXXX', notes='Resolved — describe action')- Re-run the dashboard to confirm health status improves:
python main.pyFor recurring ecosystem health checks establish a consistent procedure:
- Collect latest data from all portfolio repos
- Place data files in
data/directory - Run the dashboard:
python main.py- Review terminal output for immediate health status
- Open
exports/dashboard_summary.jsonfor full detail - Archive the summary with a timestamp:
copy exports\dashboard_summary.json exports\summary-2026-03-08.json- Document findings if health status is not HEALTHY
Maintain dated summary archives to identify trends:
exports/
├── summary-2026-03-05.json
├── summary-2026-03-06.json
├── summary-2026-03-07.json
├── summary-2026-03-08.json
└── dashboard_summary.json ← always the latest run
Compare overall_health, open_incidents, and key metrics across dates to identify:
- Improving trends — Health scores stabilizing, incident count dropping
- Degrading trends — Metrics climbing, incident count growing
- Sudden changes — Isolated event vs. ongoing pattern
After each dashboard run refresh your Power BI report:
- Open your Power BI Desktop report
- Click Refresh in the Home ribbon
- Power BI will reload
exports/powerbi_dataset.json - All visualizations update automatically
When onboarding a new portfolio repository to the dashboard:
- Add the new source to
config/sources.json:
{
"name": "New Monitor",
"repo": "new-monitor-repo",
"type": "custom",
"data_file": "data/new_monitor.json",
"enabled": true,
"description": "Description of what this source provides"
}- Add aggregation logic to
aggregator.py - Add transformation logic to
transformer.py - Update health scoring in
summary.pyif needed - Update Power BI export structure in
exporter.py - Test with sample data before enabling live data
| Condition | Severity | Action |
|---|---|---|
| Ecosystem CRITICAL, single component | HIGH | Follow component runbook immediately |
| Ecosystem CRITICAL, multiple components | CRITICAL | All-hands, systemic failure |
| Ecosystem CRITICAL, open incidents only | HIGH | Close resolved incidents, investigate open |
| Ecosystem DEGRADED, stable | MEDIUM | Monitor, investigate during business hours |
| Ecosystem DEGRADED, worsening | HIGH | Escalate immediately |
| Dashboard not producing exports | MEDIUM | Check config paths, verify data files |
| Power BI data stale | LOW | Refresh dataset, verify export path |
After every CRITICAL ecosystem health event document the following:
- Dashboard run timestamp — When was the CRITICAL status detected
- Components affected — Which repos contributed to CRITICAL status
-
Key metrics at time of incident — From
dashboard_summary.json - Open incidents count — How many CRITICAL incidents were open
- Root cause — What drove the ecosystem to CRITICAL
- Time to detect — When did the condition start vs. when dashboard caught it
- Time to resolve — When ecosystem health returned to HEALTHY
- Action taken — What was done across each affected component
- Follow-up — Monitoring adjustments, config changes, tickets raised