Predictive SRE Agent — Stop cascading failures before they happen
Echo is an AI-powered SRE agent that models microservice dependencies as temporal graphs, learns cascade failure patterns using a GNN + LSTM architecture, and autonomously fires remediation actions 5–30 minutes before impact — turning reactive incident response into proactive prevention. When a prediction fires, Echo simultaneously:
- Posts a rich Slack Block Kit alert to
#sre-alertswith an AI-generated incident summary - Creates a structured Notion incident page in your database
- Raises an autonomous GitHub PR with a remediation report on the target repo
Microservices (Auth → Orders → Payments → Ledger + dynamic services)
│
│ Metrics (latency, errors, RPS)
│ via direct /metrics scrape → Prometheus
▼
┌──────────────────────────────────────────────────────┐
│ ECHO INFERENCE LOOP (2s) │
│ │
│ Prometheus ──▶ Normalizer ──▶ Graph Manager ──▶ |
│ (cached 15s) (clamp/NaN) (feature deques) |
│ │ │
│ GCN + LSTM ◀─────┘ │
│ (predict) │
│ │ │
│ Policy Engine │
│ (4-tier severity) │
│ │ │
│ Action Executor │
│ (scale / circuit-break │
│ / alert / dry-run) │
│ │ │
│ Groq LLM Engine │
│ (incident summary + memory) │
│ │ │
│ Notifier │
│ (Slack · Notion · PR) │
└──────────────────────────────────────────────────────┘
│ │
▼ ▼
React Dashboard Kubernetes API
(WebSocket, 2s) (real or dry-run)
Every 2 seconds, Echo:
- Reads cached Prometheus metrics (polled every 15s, reused between ticks)
- Normalizes — clamps to safe bounds, handles NaN/Inf, falls back to last-good
- Updates the sliding-window service graph (16-dim node features × 60-tick history)
- Runs inference through GCN → LSTM → Attention with three prediction heads
- Evaluates a 4-tier severity policy (critical / high / medium / low)
- Executes remediation actions — or simulates them in dry-run mode. Live mode requires a real Kubernetes client.
- Generates an intelligent incident summary using Groq's LLM (
llama-3.1), augmented by a rolling memory of the last 5 incidents to detect cascading patterns - Notifies via Slack, Notion, and GitHub (with per-service cooldown)
- Broadcasts the full state over WebSocket to the live dashboard
echo_/
├── backend/
│ ├── api/
│ │ └── main.py # FastAPI app, REST + WebSocket endpoints
│ ├── core/
│ │ └── orchestrator.py # 2s deterministic tick loop, TickContext, warmup
│ ├── ml/
│ │ ├── echo_model.py # GCNConv×2 → LSTM×2 → Attention → 3 heads
│ │ └── inference_wrapper.py # Thread-safe async inference with tensor cloning
│ ├── graph/
│ │ ├── features.py # NodeFeatures + EdgeFeatures (shared dataclasses)
│ │ ├── state_manager.py # Static/dynamic graph topology + feature deques
│ │ ├── graph_builder.py # Physics simulator (OFFLINE training only)
│ │ └── snapshot.py # Redis state persistence (zstd + orjson)
│ ├── ingestion/
│ │ ├── prometheus.py # Real PromQL queries, 15s async poll cache
│ │ └── normalizer.py # Clamp, NaN handling, stale fallback
│ ├── executor/
│ │ ├── policy_engine.py # 4-tier severity + rate limiter
│ │ ├── action_executor.py # K8s scale actions + dry-run + rollback
│ │ └── notifier.py # Slack · Notion · GitHub PR (w/ Groq LLM + memory)
│ ├── config/
│ │ └── settings.py # Pydantic-settings, all env vars
│ └── requirements.txt
│
├── frontend/
│ └── src/
│ ├── App.tsx # React Router v6 entry
│ ├── store/echo.ts # Zustand store (graph, predictions, actions)
│ ├── hooks/useWebSocket.ts # Auto-reconnect WebSocket client
│ ├── components/ # ServiceGraph, PredictionPanel, ActionLog, etc.
│ └── pages/ # Landing, Dashboard, Incidents, Services, Settings
│
├── services/
│ ├── auth/ # :8001 — token issuing + fault injection
│ ├── orders/ # :8002 — depends on auth
│ ├── payments/ # :8003 — depends on orders
│ └── ledger/ # :8004 — depends on payments
│
├── scripts/
│ ├── chaos/ # Fault injection scenarios (3 scripts)
│ ├── demo/ # start-demo.sh, reset-demo.sh, test-notifications.sh
│ ├── generate-training-data.py # 2000 labeled samples
│ └── train-model.py # PyTorch training loop
│
├── docker/
│ ├── backend.Dockerfile
│ ├── otel-collector-config.yaml
│ └── prometheus.yml
│
├── models/
│ └── echo_model_best.ckpt # Pre-trained weights
├── docker-compose.yml # Full stack: all 10 services
├── .env.example # All supported environment variables
└── k8s/kind-config.yaml # Kind cluster (1 control-plane + 2 workers)
Echo uses a Temporal Graph Neural Network combining spatial graph learning with temporal sequence modeling:
Input: 30 graph snapshots × N services × 16 features
│
▼
┌──────────────────────┐
│ GCNConv × 2 │ ← Propagate signals across service edges per snapshot
│ (per snapshot) │
└──────────┬───────────┘
│ Global mean pool → one embedding per timestep
▼
┌──────────────────────┐
│ LSTM × 2 │ ← Learn temporal degradation patterns over 30 ticks
│ hidden=64, drop=0.2 │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ Temporal Attention │ ← Attend to the most anomalous timesteps
└──────────┬───────────┘
│
┌────┼────┐
▼ ▼ ▼
P(fail) TTF Conf ← sigmoid · ReLU · sigmoid
| Component | Detail |
|---|---|
| Node features | 16-dim: mean + std of latency p50/p99, error rate, RPS, CPU %, memory %, replicas, restarts |
| Graph convolution | 2-layer GCNConv (PyG), fallback to linear if PyG unavailable |
| Temporal model | 2-layer LSTM (hidden=64, dropout=0.2) + softmax attention over 30 steps |
| Prediction heads | Failure probability (sigmoid), time-to-failure minutes (ReLU), confidence (sigmoid) |
| Cascade path | Beam-search PathRanker over node embeddings + edge structure |
| Training | PyTorch, BCE + MSE + calibration loss, cosine LR scheduler |
Predictions pass through a 4-tier policy system with built-in safety guards:
| Severity | Trigger | Actions |
|---|---|---|
| 🔴 Critical | P(fail) ≥ 90% | Scale to 4 replicas + circuit-breaker + downstream pre-scale |
| 🟠 High | P(fail) ≥ 70% | Scale to 3 replicas + page on-call |
| 🟡 Medium | P(fail) ≥ 50% | Pre-warm 2 replicas + warning alert |
| 🔵 Low | P(fail) ≥ 30% | Info alert, continue monitoring |
Safety mechanisms:
- Shadow mode — predictions are logged but no actions fire (default: enabled)
- Confidence gating — predictions below threshold (default 0.7) are silently dropped
- Warmup suppression — first 30 ticks (~60s) suppress all actions to let LSTM fill
- Rate limiting — max 3 actions per service per 5-minute window
- Stale degradation — confidence halved if telemetry is stale for 3+ consecutive ticks
- Dry-run mode — all actions simulated by default; no infrastructure is touched
- Notifier cooldown — one notification set per service per 5 minutes (configurable)
When a plan is executed, Echo fires three real HTTP calls concurrently:
Posts a rich Block Kit message to #sre-alerts including:
- Severity header, service name, failure probability, ETF countdown
- Cascade propagation path
- Actions taken (scale-up targets, circuit-breaker mode)
- Link to the auto-raised GitHub PR
Creates a structured incident page in your database with properties:
Severity, Service, Status (Open), Failure Prob, ETF Minutes, Cascade Path, PR URL
And full page body: Incident Summary · Actions Taken · Reasoning.
Raises a pull request on the configured repo:
- Branch:
echo-fix/{service}/{prediction_id[:8]} - File:
remediation/{service}-{pred_id[:8]}.md— full remediation report - PR title:
[Echo Auto-Fix] CRITICAL: auth degradation detected - Merging the PR = acknowledging the incident
Setup instructions are in the docstring at the top of
backend/executor/notifier.py.
| Tool | Min version |
|---|---|
| Docker | 20+ (Compose v2 included) |
| Python | 3.11+ (local dev only) |
| Node.js | 20+ (frontend dev only) |
git clone https://github.com/your-username/echo-onSRE
cd echo-onSRE
cp .env.example .env # fill in Slack/Notion/GitHub tokens to enable notifications
docker-compose up --build -dOpens at http://localhost:5173.
# Health check
curl http://localhost:8000/health
# Check current topology
curl http://localhost:8000/api/services
# Check notification integrations
curl http://localhost:8000/api/notifications/status
# Non-UI MVP demo: runtime service growth + inference/action path
./scripts/demo/non-ui-mvp-demo.sh| Service | URL |
|---|---|
| Dashboard | localhost:5173 |
| Landing Page | localhost:5173/ |
| API + Swagger | localhost:8000/docs |
| Prometheus | localhost:9090 |
| Jaeger | localhost:16686 |
# Backend
python3 -m venv venv && source venv/bin/activate
pip install -r backend/requirements.txt
PYTHONPATH=. uvicorn backend.api.main:app --reload --port 8000
# Frontend (separate terminal)
cd frontend && npm install && npm run devdocker-compose up --build -d
curl http://localhost:8000/health
# → {"status": "ok", "tick_id": 1, "warmup_remaining": 30, ...}Wait ~60s for warmup to complete (warmup_remaining: 0).
# Latency spike on auth service
curl -X POST http://localhost:8000/api/inject/auth \
-H "Content-Type: application/json" \
-d '{"type": "latency", "magnitude": 0.9}'# Poll predictions (or watch the dashboard at localhost:5173/dashboard/overview)
curl http://localhost:8000/api/graph | python -m json.tool
curl -X POST http://localhost:8000/api/predict | python -m json.toolcurl -X POST http://localhost:8000/api/services \
-H "Content-Type: application/json" \
-d '{"name": "inventory", "edges": [["orders", "inventory"], ["payments", "inventory"]]}'
# → {"status": "registered", "service": "inventory", "edges_added": 2, ...}For the curated non-UI demo flow, see DEMO.md or run:
./scripts/demo/non-ui-mvp-demo.sh# Test all integrations (fires a alert)
curl -X POST http://localhost:8000/api/notifications/test# Clear a specific fault
curl -X DELETE http://localhost:8000/api/inject/auth
# Or reset all services
for svc in auth orders payments ledger; do
curl -X DELETE "http://localhost:8000/api/inject/$svc"
done| Method | Endpoint | Description |
|---|---|---|
GET |
/api/graph |
Current graph state (nodes, edges, metrics) |
POST |
/api/predict |
Trigger on-demand inference |
GET |
/api/actions |
Action execution history |
GET |
/api/config |
Runtime configuration |
PATCH |
/api/config |
Update dry_run, confidence_threshold |
GET |
/api/services |
List current topology (services + edges) |
POST |
/api/services |
Register a new service at runtime |
POST |
/api/inject/{service} |
Inject fault (type: latency/error_rate/cpu/traffic_spike, magnitude: 0-1) |
DELETE |
/api/inject/{service} |
Clear fault injection for a service |
GET |
/api/notifications/status |
Which integrations are configured |
POST |
/api/notifications/test |
Fire test notification to all integrations |
GET |
/health |
Health check with tick_id, staleness, warmup status |
WS |
/ws |
Live graph + prediction stream (2s interval) |
The frontend connects to ws://localhost:8000/ws. The backend pushes a JSON payload every 2 seconds:
{
"type": "update",
"graph": { "nodes": [...], "edges": [...], "tick_id": 42 },
"prediction": {
"failure_prob": 0.73,
"time_to_failure_minutes": 8.2,
"confidence": 0.85,
"cascade_path": [0, 1, 2],
"node_names": ["auth", "orders", "payments"]
},
"action_plan": { "severity": "high", "actions": [...] },
"dry_run": true
}On connect, the server sends a "type": "snapshot" message with the current state.
If the backend loop encounters 5+ consecutive errors, it sends:
{ "type": "error", "message": "...", "consecutive_errors": 5 }Connecting from any client:
# Python (websockets)
import asyncio, websockets, json
async def main():
async with websockets.connect("ws://localhost:8000/ws") as ws:
async for message in ws:
data = json.loads(message)
print(f"Tick {data['graph']['tick_id']}: failure_prob={data['prediction'].get('failure_prob', 0):.3f}")
asyncio.run(main())// JavaScript (browser or Node.js)
const ws = new WebSocket('ws://localhost:8000/ws');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
console.log('Tick:', data.graph.tick_id, 'Failure:', data.prediction?.failure_prob);
};All settings are managed via environment variables (see .env.example):
| Variable | Default | Description |
|---|---|---|
DRY_RUN |
true |
Simulate actions without touching infrastructure |
SHADOW_MODE |
true |
Log predictions and suppress actions (for calibration) |
CONFIDENCE_THRESHOLD |
0.7 |
Minimum model confidence to trigger policy |
GRAPH_UPDATE_INTERVAL |
2.0 |
Seconds between inference cycles (LOCKED to LSTM training) |
MODEL_CHECKPOINT |
models/echo_model_best.ckpt |
Path to trained weights |
PREDICTION_HORIZON_MINUTES |
15 |
Look-ahead window for TTF estimates |
K8S_NAMESPACE |
echo-demo |
Target Kubernetes namespace for real actions |
K8S_IN_CLUSTER |
false |
Set true when running inside a cluster |
SLACK_WEBHOOK_URL |
"" |
Incoming webhook URL for #sre-alerts |
NOTION_TOKEN |
"" |
Notion internal integration token |
NOTION_DATABASE_ID |
"" |
32-char hex ID of your incident database |
GITHUB_TOKEN |
"" |
PAT with repo or public_repo scope |
GITHUB_REPO |
"" |
Target repo in owner/repo format |
NOTIFIER_COOLDOWN |
300 |
Seconds between notifications per service |
# Generate 2,000 labeled graph snapshots
python scripts/generate-training-data.py
# Train — cosine LR, best checkpoint auto-saved
python scripts/train-model.py
# Output: models/echo_model_best.ckptTraining data distribution: 20% healthy + 15% recovery + 10% near-failure + 12% delayed cascade + 43% standard cascade.
| Layer | Technology |
|---|---|
| ML | PyTorch 2.3, PyTorch Geometric |
| Backend | FastAPI 0.111, Uvicorn, WebSockets, Pydantic Settings, httpx |
| Frontend | React 18, TypeScript, Vite 5, React Router v6, Framer Motion, TanStack Query |
| UI components | ReactFlow 11, Recharts, Zustand, Tailwind CSS v3, Lucide React |
| Integrations | Slack Incoming Webhooks, Notion API, GitHub REST API |
| Observability | Prometheus service metrics, OpenTelemetry traces, Jaeger |
| Infrastructure | Docker Compose, Kubernetes (Kind), Redis |
- Circuit-breaker actions mutate Kubernetes. Echo creates/patches an Istio
DestinationRulewhen Istio is installed, otherwise it records the breaker state inConfigMap/echo-circuit-breakers. DRY_RUN=falseenables live Kubernetes actions. KeepDRY_RUN=truefor notification-only demos.- No mock Kubernetes fallback. If
DRY_RUN=false, a real Kubernetes client must be available or actions fail explicitly. - Model checkpoint is required by default. Set
ALLOW_UNTRAINED_MODEL=trueonly for local demo work without trained weights. - No authentication. The API and WebSocket are open. Add an API key middleware before any public deployment.
- Dynamic services don't persist across restarts. Services added via
POST /api/servicesexist only in memory.
MIT