⚡ Echo

Predictive SRE Agent — Stop cascading failures before they happen

Echo is an AI-powered SRE agent that models microservice dependencies as temporal graphs, learns cascade failure patterns using a GNN + LSTM architecture, and autonomously fires remediation actions 5–30 minutes before impact — turning reactive incident response into proactive prevention. When a prediction fires, Echo simultaneously:

Posts a rich Slack Block Kit alert to #sre-alerts with an AI-generated incident summary
Creates a structured Notion incident page in your database
Raises an autonomous GitHub PR with a remediation report on the target repo

How It Works

  Microservices (Auth → Orders → Payments → Ledger + dynamic services)
         │
         │  Metrics (latency, errors, RPS)
         │  via direct /metrics scrape → Prometheus
         ▼
  ┌──────────────────────────────────────────────────────┐
  │                ECHO  INFERENCE  LOOP  (2s)           │
  │                                                      │
  │  Prometheus ──▶ Normalizer ──▶ Graph Manager ──▶    |
  │  (cached 15s)   (clamp/NaN)    (feature deques)      |
  │                                      │               │
  │                     GCN + LSTM ◀─────┘               │
  │                     (predict)                        │
  │                         │                            │
  │                  Policy Engine                       │
  │               (4-tier severity)                      │
  │                         │                            │
  │                  Action Executor                     │
  │               (scale / circuit-break                 │
  │                / alert / dry-run)                    │
  │                         │                            │
  │                  Groq LLM Engine                     │
  │              (incident summary + memory)             │
  │                         │                            │
  │                    Notifier                          │
  │               (Slack · Notion · PR)                  │
  └──────────────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
  React Dashboard           Kubernetes API
  (WebSocket, 2s)           (real or dry-run)

Every 2 seconds, Echo:

Reads cached Prometheus metrics (polled every 15s, reused between ticks)
Normalizes — clamps to safe bounds, handles NaN/Inf, falls back to last-good
Updates the sliding-window service graph (16-dim node features × 60-tick history)
Runs inference through GCN → LSTM → Attention with three prediction heads
Evaluates a 4-tier severity policy (critical / high / medium / low)
Executes remediation actions — or simulates them in dry-run mode. Live mode requires a real Kubernetes client.
Generates an intelligent incident summary using Groq's LLM (llama-3.1), augmented by a rolling memory of the last 5 incidents to detect cascading patterns
Notifies via Slack, Notion, and GitHub (with per-service cooldown)
Broadcasts the full state over WebSocket to the live dashboard

Architecture

echo_/
├── backend/
│   ├── api/
│   │   └── main.py                 # FastAPI app, REST + WebSocket endpoints
│   ├── core/
│   │   └── orchestrator.py         # 2s deterministic tick loop, TickContext, warmup
│   ├── ml/
│   │   ├── echo_model.py           # GCNConv×2 → LSTM×2 → Attention → 3 heads
│   │   └── inference_wrapper.py    # Thread-safe async inference with tensor cloning
│   ├── graph/
│   │   ├── features.py             # NodeFeatures + EdgeFeatures (shared dataclasses)
│   │   ├── state_manager.py        # Static/dynamic graph topology + feature deques
│   │   ├── graph_builder.py        # Physics simulator (OFFLINE training only)
│   │   └── snapshot.py             # Redis state persistence (zstd + orjson)
│   ├── ingestion/
│   │   ├── prometheus.py           # Real PromQL queries, 15s async poll cache
│   │   └── normalizer.py           # Clamp, NaN handling, stale fallback
│   ├── executor/
│   │   ├── policy_engine.py        # 4-tier severity + rate limiter
│   │   ├── action_executor.py      # K8s scale actions + dry-run + rollback
│   │   └── notifier.py             # Slack · Notion · GitHub PR (w/ Groq LLM + memory)
│   ├── config/
│   │   └── settings.py             # Pydantic-settings, all env vars
│   └── requirements.txt
│
├── frontend/
│   └── src/
│       ├── App.tsx                  # React Router v6 entry
│       ├── store/echo.ts            # Zustand store (graph, predictions, actions)
│       ├── hooks/useWebSocket.ts    # Auto-reconnect WebSocket client
│       ├── components/              # ServiceGraph, PredictionPanel, ActionLog, etc.
│       └── pages/                   # Landing, Dashboard, Incidents, Services, Settings
│
├── services/
│   ├── auth/                        # :8001 — token issuing + fault injection
│   ├── orders/                      # :8002 — depends on auth
│   ├── payments/                    # :8003 — depends on orders
│   └── ledger/                      # :8004 — depends on payments
│
├── scripts/
│   ├── chaos/                       # Fault injection scenarios (3 scripts)
│   ├── demo/                        # start-demo.sh, reset-demo.sh, test-notifications.sh
│   ├── generate-training-data.py    # 2000 labeled samples
│   └── train-model.py              # PyTorch training loop
│
├── docker/
│   ├── backend.Dockerfile
│   ├── otel-collector-config.yaml
│   └── prometheus.yml
│
├── models/
│   └── echo_model_best.ckpt         # Pre-trained weights
├── docker-compose.yml               # Full stack: all 10 services
├── .env.example                     # All supported environment variables
└── k8s/kind-config.yaml             # Kind cluster (1 control-plane + 2 workers)

ML Model

Echo uses a Temporal Graph Neural Network combining spatial graph learning with temporal sequence modeling:

Input: 30 graph snapshots × N services × 16 features
    │
    ▼
┌──────────────────────┐
│  GCNConv × 2         │  ← Propagate signals across service edges per snapshot
│  (per snapshot)      │
└──────────┬───────────┘
           │  Global mean pool → one embedding per timestep
           ▼
┌──────────────────────┐
│  LSTM × 2            │  ← Learn temporal degradation patterns over 30 ticks
│  hidden=64, drop=0.2 │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Temporal Attention  │  ← Attend to the most anomalous timesteps
└──────────┬───────────┘
           │
      ┌────┼────┐
      ▼    ▼    ▼
   P(fail) TTF Conf     ← sigmoid · ReLU · sigmoid

Component	Detail
Node features	16-dim: mean + std of latency p50/p99, error rate, RPS, CPU %, memory %, replicas, restarts
Graph convolution	2-layer GCNConv (PyG), fallback to linear if PyG unavailable
Temporal model	2-layer LSTM (hidden=64, dropout=0.2) + softmax attention over 30 steps
Prediction heads	Failure probability (sigmoid), time-to-failure minutes (ReLU), confidence (sigmoid)
Cascade path	Beam-search PathRanker over node embeddings + edge structure
Training	PyTorch, BCE + MSE + calibration loss, cosine LR scheduler

Policy Engine

Predictions pass through a 4-tier policy system with built-in safety guards:

Severity	Trigger	Actions
🔴 Critical	P(fail) ≥ 90%	Scale to 4 replicas + circuit-breaker + downstream pre-scale
🟠 High	P(fail) ≥ 70%	Scale to 3 replicas + page on-call
🟡 Medium	P(fail) ≥ 50%	Pre-warm 2 replicas + warning alert
🔵 Low	P(fail) ≥ 30%	Info alert, continue monitoring

Safety mechanisms:

Shadow mode — predictions are logged but no actions fire (default: enabled)
Confidence gating — predictions below threshold (default 0.7) are silently dropped
Warmup suppression — first 30 ticks (~60s) suppress all actions to let LSTM fill
Rate limiting — max 3 actions per service per 5-minute window
Stale degradation — confidence halved if telemetry is stale for 3+ consecutive ticks
Dry-run mode — all actions simulated by default; no infrastructure is touched
Notifier cooldown — one notification set per service per 5 minutes (configurable)

Autonomous Integrations

When a plan is executed, Echo fires three real HTTP calls concurrently:

Slack

Posts a rich Block Kit message to #sre-alerts including:

Severity header, service name, failure probability, ETF countdown
Cascade propagation path
Actions taken (scale-up targets, circuit-breaker mode)
Link to the auto-raised GitHub PR

Notion

Creates a structured incident page in your database with properties: Severity, Service, Status (Open), Failure Prob, ETF Minutes, Cascade Path, PR URL

And full page body: Incident Summary · Actions Taken · Reasoning.

GitHub

Raises a pull request on the configured repo:

Branch: echo-fix/{service}/{prediction_id[:8]}
File: remediation/{service}-{pred_id[:8]}.md — full remediation report
PR title: [Echo Auto-Fix] CRITICAL: auth degradation detected
Merging the PR = acknowledging the incident

Setup instructions are in the docstring at the top of backend/executor/notifier.py.

Quick Start

Prerequisites

Tool	Min version
Docker	20+ (Compose v2 included)
Python	3.11+ (local dev only)
Node.js	20+ (frontend dev only)

One-command demo

git clone https://github.com/your-username/echo-onSRE
cd echo-onSRE
cp .env.example .env        # fill in Slack/Notion/GitHub tokens to enable notifications
docker-compose up --build -d

Opens at http://localhost:5173.

Verify

# Health check
curl http://localhost:8000/health

# Check current topology
curl http://localhost:8000/api/services

# Check notification integrations
curl http://localhost:8000/api/notifications/status

# Non-UI MVP demo: runtime service growth + inference/action path
./scripts/demo/non-ui-mvp-demo.sh

Service	URL
Dashboard	localhost:5173
Landing Page	localhost:5173/
API + Swagger	localhost:8000/docs
Prometheus	localhost:9090
Jaeger	localhost:16686

Local development

# Backend
python3 -m venv venv && source venv/bin/activate
pip install -r backend/requirements.txt
PYTHONPATH=. uvicorn backend.api.main:app --reload --port 8000

# Frontend (separate terminal)
cd frontend && npm install && npm run dev

Demo Walkthrough

1. Start the stack

docker-compose up --build -d
curl http://localhost:8000/health
# → {"status": "ok", "tick_id": 1, "warmup_remaining": 30, ...}

Wait ~60s for warmup to complete (warmup_remaining: 0).

2. Inject a failure

# Latency spike on auth service
curl -X POST http://localhost:8000/api/inject/auth \
  -H "Content-Type: application/json" \
  -d '{"type": "latency", "magnitude": 0.9}'

3. Watch Echo predict

# Poll predictions (or watch the dashboard at localhost:5173/dashboard/overview)
curl http://localhost:8000/api/graph | python -m json.tool
curl -X POST http://localhost:8000/api/predict | python -m json.tool

4. Add a 5th service live

curl -X POST http://localhost:8000/api/services \
  -H "Content-Type: application/json" \
  -d '{"name": "inventory", "edges": [["orders", "inventory"], ["payments", "inventory"]]}'
# → {"status": "registered", "service": "inventory", "edges_added": 2, ...}

For the curated non-UI demo flow, see DEMO.md or run:

./scripts/demo/non-ui-mvp-demo.sh

5. Check notifications

# Test all integrations (fires a alert)
curl -X POST http://localhost:8000/api/notifications/test

6. Reset

# Clear a specific fault
curl -X DELETE http://localhost:8000/api/inject/auth

# Or reset all services
for svc in auth orders payments ledger; do
  curl -X DELETE "http://localhost:8000/api/inject/$svc"
done

API Reference

Method	Endpoint	Description
`GET`	`/api/graph`	Current graph state (nodes, edges, metrics)
`POST`	`/api/predict`	Trigger on-demand inference
`GET`	`/api/actions`	Action execution history
`GET`	`/api/config`	Runtime configuration
`PATCH`	`/api/config`	Update `dry_run`, `confidence_threshold`
`GET`	`/api/services`	List current topology (services + edges)
`POST`	`/api/services`	Register a new service at runtime
`POST`	`/api/inject/{service}`	Inject fault (`type`: latency/error_rate/cpu/traffic_spike, `magnitude`: 0-1)
`DELETE`	`/api/inject/{service}`	Clear fault injection for a service
`GET`	`/api/notifications/status`	Which integrations are configured
`POST`	`/api/notifications/test`	Fire test notification to all integrations
`GET`	`/health`	Health check with tick_id, staleness, warmup status
`WS`	`/ws`	Live graph + prediction stream (2s interval)

WebSocket Protocol

The frontend connects to ws://localhost:8000/ws. The backend pushes a JSON payload every 2 seconds:

{
  "type": "update",
  "graph": { "nodes": [...], "edges": [...], "tick_id": 42 },
  "prediction": {
    "failure_prob": 0.73,
    "time_to_failure_minutes": 8.2,
    "confidence": 0.85,
    "cascade_path": [0, 1, 2],
    "node_names": ["auth", "orders", "payments"]
  },
  "action_plan": { "severity": "high", "actions": [...] },
  "dry_run": true
}

On connect, the server sends a "type": "snapshot" message with the current state.

If the backend loop encounters 5+ consecutive errors, it sends:

{ "type": "error", "message": "...", "consecutive_errors": 5 }

Connecting from any client:

# Python (websockets)
import asyncio, websockets, json

async def main():
    async with websockets.connect("ws://localhost:8000/ws") as ws:
        async for message in ws:
            data = json.loads(message)
            print(f"Tick {data['graph']['tick_id']}: failure_prob={data['prediction'].get('failure_prob', 0):.3f}")

asyncio.run(main())

// JavaScript (browser or Node.js)
const ws = new WebSocket('ws://localhost:8000/ws');
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Tick:', data.graph.tick_id, 'Failure:', data.prediction?.failure_prob);
};

Configuration

All settings are managed via environment variables (see .env.example):

Variable	Default	Description
`DRY_RUN`	`true`	Simulate actions without touching infrastructure
`SHADOW_MODE`	`true`	Log predictions and suppress actions (for calibration)
`CONFIDENCE_THRESHOLD`	`0.7`	Minimum model confidence to trigger policy
`GRAPH_UPDATE_INTERVAL`	`2.0`	Seconds between inference cycles (LOCKED to LSTM training)
`MODEL_CHECKPOINT`	`models/echo_model_best.ckpt`	Path to trained weights
`PREDICTION_HORIZON_MINUTES`	`15`	Look-ahead window for TTF estimates
`K8S_NAMESPACE`	`echo-demo`	Target Kubernetes namespace for real actions
`K8S_IN_CLUSTER`	`false`	Set `true` when running inside a cluster
`SLACK_WEBHOOK_URL`	`""`	Incoming webhook URL for `#sre-alerts`
`NOTION_TOKEN`	`""`	Notion internal integration token
`NOTION_DATABASE_ID`	`""`	32-char hex ID of your incident database
`GITHUB_TOKEN`	`""`	PAT with `repo` or `public_repo` scope
`GITHUB_REPO`	`""`	Target repo in `owner/repo` format
`NOTIFIER_COOLDOWN`	`300`	Seconds between notifications per service

Training

# Generate 2,000 labeled graph snapshots
python scripts/generate-training-data.py

# Train — cosine LR, best checkpoint auto-saved
python scripts/train-model.py

# Output: models/echo_model_best.ckpt

Training data distribution: 20% healthy + 15% recovery + 10% near-failure + 12% delayed cascade + 43% standard cascade.

Tech Stack

Layer	Technology
ML	PyTorch 2.3, PyTorch Geometric
Backend	FastAPI 0.111, Uvicorn, WebSockets, Pydantic Settings, httpx
Frontend	React 18, TypeScript, Vite 5, React Router v6, Framer Motion, TanStack Query
UI components	ReactFlow 11, Recharts, Zustand, Tailwind CSS v3, Lucide React
Integrations	Slack Incoming Webhooks, Notion API, GitHub REST API
Observability	Prometheus service metrics, OpenTelemetry traces, Jaeger
Infrastructure	Docker Compose, Kubernetes (Kind), Redis

Known Limitations (MVP)

Circuit-breaker actions mutate Kubernetes. Echo creates/patches an Istio DestinationRule when Istio is installed, otherwise it records the breaker state in ConfigMap/echo-circuit-breakers.
DRY_RUN=false enables live Kubernetes actions. Keep DRY_RUN=true for notification-only demos.
No mock Kubernetes fallback. If DRY_RUN=false, a real Kubernetes client must be available or actions fail explicitly.
Model checkpoint is required by default. Set ALLOW_UNTRAINED_MODEL=true only for local demo work without trained weights.
No authentication. The API and WebSocket are open. Add an API key middleware before any public deployment.
Dynamic services don't persist across restarts. Services added via POST /api/services exist only in memory.

License

MIT

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ Echo

How It Works

Architecture

ML Model

Policy Engine

Autonomous Integrations

Slack

Notion

GitHub

Quick Start

Prerequisites

One-command demo

Verify

Local development

Demo Walkthrough

1. Start the stack

2. Inject a failure

3. Watch Echo predict

4. Add a 5th service live

5. Check notifications

6. Reset

API Reference

WebSocket Protocol

Configuration

Training

Tech Stack

Known Limitations (MVP)

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
backend		backend
data		data
docker		docker
frontend		frontend
k8s		k8s
models		models
scripts		scripts
services		services
.codex		.codex
.env.example		.env.example
.gitignore		.gitignore
DEMO.md		DEMO.md
README.md		README.md
docker-compose.yml		docker-compose.yml
pyrightconfig.json		pyrightconfig.json

Folders and files

Latest commit

History

Repository files navigation

⚡ Echo

How It Works

Architecture

ML Model

Policy Engine

Autonomous Integrations

Slack

Notion

GitHub

Quick Start

Prerequisites

One-command demo

Verify

Local development

Demo Walkthrough

1. Start the stack

2. Inject a failure

3. Watch Echo predict

4. Add a 5th service live

5. Check notifications

6. Reset

API Reference

WebSocket Protocol

Configuration

Training

Tech Stack

Known Limitations (MVP)

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages