Skip to content

siddhi1229/echo-nitte

Repository files navigation

⚡ Echo

Predictive SRE Agent — Stop cascading failures before they happen

Python PyTorch FastAPI React Vite Docker License


Echo is an AI-powered SRE agent that models microservice dependencies as temporal graphs, learns cascade failure patterns using a GNN + LSTM architecture, and autonomously fires remediation actions 5–30 minutes before impact — turning reactive incident response into proactive prevention. When a prediction fires, Echo simultaneously:

  • Posts a rich Slack Block Kit alert to #sre-alerts with an AI-generated incident summary
  • Creates a structured Notion incident page in your database
  • Raises an autonomous GitHub PR with a remediation report on the target repo

How It Works

  Microservices (Auth → Orders → Payments → Ledger + dynamic services)
         │
         │  Metrics (latency, errors, RPS)
         │  via direct /metrics scrape → Prometheus
         ▼
  ┌──────────────────────────────────────────────────────┐
  │                ECHO  INFERENCE  LOOP  (2s)           │
  │                                                      │
  │  Prometheus ──▶ Normalizer ──▶ Graph Manager ──▶    |
  │  (cached 15s)   (clamp/NaN)    (feature deques)      |
  │                                      │               │
  │                     GCN + LSTM ◀─────┘               │
  │                     (predict)                        │
  │                         │                            │
  │                  Policy Engine                       │
  │               (4-tier severity)                      │
  │                         │                            │
  │                  Action Executor                     │
  │               (scale / circuit-break                 │
  │                / alert / dry-run)                    │
  │                         │                            │
  │                  Groq LLM Engine                     │
  │              (incident summary + memory)             │
  │                         │                            │
  │                    Notifier                          │
  │               (Slack · Notion · PR)                  │
  └──────────────────────────────────────────────────────┘
         │                          │
         ▼                          ▼
  React Dashboard           Kubernetes API
  (WebSocket, 2s)           (real or dry-run)

Every 2 seconds, Echo:

  1. Reads cached Prometheus metrics (polled every 15s, reused between ticks)
  2. Normalizes — clamps to safe bounds, handles NaN/Inf, falls back to last-good
  3. Updates the sliding-window service graph (16-dim node features × 60-tick history)
  4. Runs inference through GCN → LSTM → Attention with three prediction heads
  5. Evaluates a 4-tier severity policy (critical / high / medium / low)
  6. Executes remediation actions — or simulates them in dry-run mode. Live mode requires a real Kubernetes client.
  7. Generates an intelligent incident summary using Groq's LLM (llama-3.1), augmented by a rolling memory of the last 5 incidents to detect cascading patterns
  8. Notifies via Slack, Notion, and GitHub (with per-service cooldown)
  9. Broadcasts the full state over WebSocket to the live dashboard

Architecture

echo_/
├── backend/
│   ├── api/
│   │   └── main.py                 # FastAPI app, REST + WebSocket endpoints
│   ├── core/
│   │   └── orchestrator.py         # 2s deterministic tick loop, TickContext, warmup
│   ├── ml/
│   │   ├── echo_model.py           # GCNConv×2 → LSTM×2 → Attention → 3 heads
│   │   └── inference_wrapper.py    # Thread-safe async inference with tensor cloning
│   ├── graph/
│   │   ├── features.py             # NodeFeatures + EdgeFeatures (shared dataclasses)
│   │   ├── state_manager.py        # Static/dynamic graph topology + feature deques
│   │   ├── graph_builder.py        # Physics simulator (OFFLINE training only)
│   │   └── snapshot.py             # Redis state persistence (zstd + orjson)
│   ├── ingestion/
│   │   ├── prometheus.py           # Real PromQL queries, 15s async poll cache
│   │   └── normalizer.py           # Clamp, NaN handling, stale fallback
│   ├── executor/
│   │   ├── policy_engine.py        # 4-tier severity + rate limiter
│   │   ├── action_executor.py      # K8s scale actions + dry-run + rollback
│   │   └── notifier.py             # Slack · Notion · GitHub PR (w/ Groq LLM + memory)
│   ├── config/
│   │   └── settings.py             # Pydantic-settings, all env vars
│   └── requirements.txt
│
├── frontend/
│   └── src/
│       ├── App.tsx                  # React Router v6 entry
│       ├── store/echo.ts            # Zustand store (graph, predictions, actions)
│       ├── hooks/useWebSocket.ts    # Auto-reconnect WebSocket client
│       ├── components/              # ServiceGraph, PredictionPanel, ActionLog, etc.
│       └── pages/                   # Landing, Dashboard, Incidents, Services, Settings
│
├── services/
│   ├── auth/                        # :8001 — token issuing + fault injection
│   ├── orders/                      # :8002 — depends on auth
│   ├── payments/                    # :8003 — depends on orders
│   └── ledger/                      # :8004 — depends on payments
│
├── scripts/
│   ├── chaos/                       # Fault injection scenarios (3 scripts)
│   ├── demo/                        # start-demo.sh, reset-demo.sh, test-notifications.sh
│   ├── generate-training-data.py    # 2000 labeled samples
│   └── train-model.py              # PyTorch training loop
│
├── docker/
│   ├── backend.Dockerfile
│   ├── otel-collector-config.yaml
│   └── prometheus.yml
│
├── models/
│   └── echo_model_best.ckpt         # Pre-trained weights
├── docker-compose.yml               # Full stack: all 10 services
├── .env.example                     # All supported environment variables
└── k8s/kind-config.yaml             # Kind cluster (1 control-plane + 2 workers)

ML Model

Echo uses a Temporal Graph Neural Network combining spatial graph learning with temporal sequence modeling:

Input: 30 graph snapshots × N services × 16 features
    │
    ▼
┌──────────────────────┐
│  GCNConv × 2         │  ← Propagate signals across service edges per snapshot
│  (per snapshot)      │
└──────────┬───────────┘
           │  Global mean pool → one embedding per timestep
           ▼
┌──────────────────────┐
│  LSTM × 2            │  ← Learn temporal degradation patterns over 30 ticks
│  hidden=64, drop=0.2 │
└──────────┬───────────┘
           │
           ▼
┌──────────────────────┐
│  Temporal Attention  │  ← Attend to the most anomalous timesteps
└──────────┬───────────┘
           │
      ┌────┼────┐
      ▼    ▼    ▼
   P(fail) TTF Conf     ← sigmoid · ReLU · sigmoid
Component Detail
Node features 16-dim: mean + std of latency p50/p99, error rate, RPS, CPU %, memory %, replicas, restarts
Graph convolution 2-layer GCNConv (PyG), fallback to linear if PyG unavailable
Temporal model 2-layer LSTM (hidden=64, dropout=0.2) + softmax attention over 30 steps
Prediction heads Failure probability (sigmoid), time-to-failure minutes (ReLU), confidence (sigmoid)
Cascade path Beam-search PathRanker over node embeddings + edge structure
Training PyTorch, BCE + MSE + calibration loss, cosine LR scheduler

Policy Engine

Predictions pass through a 4-tier policy system with built-in safety guards:

Severity Trigger Actions
🔴 Critical P(fail) ≥ 90% Scale to 4 replicas + circuit-breaker + downstream pre-scale
🟠 High P(fail) ≥ 70% Scale to 3 replicas + page on-call
🟡 Medium P(fail) ≥ 50% Pre-warm 2 replicas + warning alert
🔵 Low P(fail) ≥ 30% Info alert, continue monitoring

Safety mechanisms:

  • Shadow mode — predictions are logged but no actions fire (default: enabled)
  • Confidence gating — predictions below threshold (default 0.7) are silently dropped
  • Warmup suppression — first 30 ticks (~60s) suppress all actions to let LSTM fill
  • Rate limiting — max 3 actions per service per 5-minute window
  • Stale degradation — confidence halved if telemetry is stale for 3+ consecutive ticks
  • Dry-run mode — all actions simulated by default; no infrastructure is touched
  • Notifier cooldown — one notification set per service per 5 minutes (configurable)

Autonomous Integrations

When a plan is executed, Echo fires three real HTTP calls concurrently:

Slack

Posts a rich Block Kit message to #sre-alerts including:

  • Severity header, service name, failure probability, ETF countdown
  • Cascade propagation path
  • Actions taken (scale-up targets, circuit-breaker mode)
  • Link to the auto-raised GitHub PR

Notion

Creates a structured incident page in your database with properties: Severity, Service, Status (Open), Failure Prob, ETF Minutes, Cascade Path, PR URL

And full page body: Incident Summary · Actions Taken · Reasoning.

GitHub

Raises a pull request on the configured repo:

  • Branch: echo-fix/{service}/{prediction_id[:8]}
  • File: remediation/{service}-{pred_id[:8]}.md — full remediation report
  • PR title: [Echo Auto-Fix] CRITICAL: auth degradation detected
  • Merging the PR = acknowledging the incident

Setup instructions are in the docstring at the top of backend/executor/notifier.py.


Quick Start

Prerequisites

Tool Min version
Docker 20+ (Compose v2 included)
Python 3.11+ (local dev only)
Node.js 20+ (frontend dev only)

One-command demo

git clone https://github.com/your-username/echo-onSRE
cd echo-onSRE
cp .env.example .env        # fill in Slack/Notion/GitHub tokens to enable notifications
docker-compose up --build -d

Opens at http://localhost:5173.

Verify

# Health check
curl http://localhost:8000/health

# Check current topology
curl http://localhost:8000/api/services

# Check notification integrations
curl http://localhost:8000/api/notifications/status

# Non-UI MVP demo: runtime service growth + inference/action path
./scripts/demo/non-ui-mvp-demo.sh
Service URL
Dashboard localhost:5173
Landing Page localhost:5173/
API + Swagger localhost:8000/docs
Prometheus localhost:9090
Jaeger localhost:16686

Local development

# Backend
python3 -m venv venv && source venv/bin/activate
pip install -r backend/requirements.txt
PYTHONPATH=. uvicorn backend.api.main:app --reload --port 8000

# Frontend (separate terminal)
cd frontend && npm install && npm run dev

Demo Walkthrough

1. Start the stack

docker-compose up --build -d
curl http://localhost:8000/health
# → {"status": "ok", "tick_id": 1, "warmup_remaining": 30, ...}

Wait ~60s for warmup to complete (warmup_remaining: 0).

2. Inject a failure

# Latency spike on auth service
curl -X POST http://localhost:8000/api/inject/auth \
  -H "Content-Type: application/json" \
  -d '{"type": "latency", "magnitude": 0.9}'

3. Watch Echo predict

# Poll predictions (or watch the dashboard at localhost:5173/dashboard/overview)
curl http://localhost:8000/api/graph | python -m json.tool
curl -X POST http://localhost:8000/api/predict | python -m json.tool

4. Add a 5th service live

curl -X POST http://localhost:8000/api/services \
  -H "Content-Type: application/json" \
  -d '{"name": "inventory", "edges": [["orders", "inventory"], ["payments", "inventory"]]}'
# → {"status": "registered", "service": "inventory", "edges_added": 2, ...}

For the curated non-UI demo flow, see DEMO.md or run:

./scripts/demo/non-ui-mvp-demo.sh

5. Check notifications

# Test all integrations (fires a alert)
curl -X POST http://localhost:8000/api/notifications/test

6. Reset

# Clear a specific fault
curl -X DELETE http://localhost:8000/api/inject/auth

# Or reset all services
for svc in auth orders payments ledger; do
  curl -X DELETE "http://localhost:8000/api/inject/$svc"
done

API Reference

Method Endpoint Description
GET /api/graph Current graph state (nodes, edges, metrics)
POST /api/predict Trigger on-demand inference
GET /api/actions Action execution history
GET /api/config Runtime configuration
PATCH /api/config Update dry_run, confidence_threshold
GET /api/services List current topology (services + edges)
POST /api/services Register a new service at runtime
POST /api/inject/{service} Inject fault (type: latency/error_rate/cpu/traffic_spike, magnitude: 0-1)
DELETE /api/inject/{service} Clear fault injection for a service
GET /api/notifications/status Which integrations are configured
POST /api/notifications/test Fire test notification to all integrations
GET /health Health check with tick_id, staleness, warmup status
WS /ws Live graph + prediction stream (2s interval)

WebSocket Protocol

The frontend connects to ws://localhost:8000/ws. The backend pushes a JSON payload every 2 seconds:

{
  "type": "update",
  "graph": { "nodes": [...], "edges": [...], "tick_id": 42 },
  "prediction": {
    "failure_prob": 0.73,
    "time_to_failure_minutes": 8.2,
    "confidence": 0.85,
    "cascade_path": [0, 1, 2],
    "node_names": ["auth", "orders", "payments"]
  },
  "action_plan": { "severity": "high", "actions": [...] },
  "dry_run": true
}

On connect, the server sends a "type": "snapshot" message with the current state.

If the backend loop encounters 5+ consecutive errors, it sends:

{ "type": "error", "message": "...", "consecutive_errors": 5 }

Connecting from any client:

# Python (websockets)
import asyncio, websockets, json

async def main():
    async with websockets.connect("ws://localhost:8000/ws") as ws:
        async for message in ws:
            data = json.loads(message)
            print(f"Tick {data['graph']['tick_id']}: failure_prob={data['prediction'].get('failure_prob', 0):.3f}")

asyncio.run(main())
// JavaScript (browser or Node.js)
const ws = new WebSocket('ws://localhost:8000/ws');
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  console.log('Tick:', data.graph.tick_id, 'Failure:', data.prediction?.failure_prob);
};

Configuration

All settings are managed via environment variables (see .env.example):

Variable Default Description
DRY_RUN true Simulate actions without touching infrastructure
SHADOW_MODE true Log predictions and suppress actions (for calibration)
CONFIDENCE_THRESHOLD 0.7 Minimum model confidence to trigger policy
GRAPH_UPDATE_INTERVAL 2.0 Seconds between inference cycles (LOCKED to LSTM training)
MODEL_CHECKPOINT models/echo_model_best.ckpt Path to trained weights
PREDICTION_HORIZON_MINUTES 15 Look-ahead window for TTF estimates
K8S_NAMESPACE echo-demo Target Kubernetes namespace for real actions
K8S_IN_CLUSTER false Set true when running inside a cluster
SLACK_WEBHOOK_URL "" Incoming webhook URL for #sre-alerts
NOTION_TOKEN "" Notion internal integration token
NOTION_DATABASE_ID "" 32-char hex ID of your incident database
GITHUB_TOKEN "" PAT with repo or public_repo scope
GITHUB_REPO "" Target repo in owner/repo format
NOTIFIER_COOLDOWN 300 Seconds between notifications per service

Training

# Generate 2,000 labeled graph snapshots
python scripts/generate-training-data.py

# Train — cosine LR, best checkpoint auto-saved
python scripts/train-model.py

# Output: models/echo_model_best.ckpt

Training data distribution: 20% healthy + 15% recovery + 10% near-failure + 12% delayed cascade + 43% standard cascade.


Tech Stack

Layer Technology
ML PyTorch 2.3, PyTorch Geometric
Backend FastAPI 0.111, Uvicorn, WebSockets, Pydantic Settings, httpx
Frontend React 18, TypeScript, Vite 5, React Router v6, Framer Motion, TanStack Query
UI components ReactFlow 11, Recharts, Zustand, Tailwind CSS v3, Lucide React
Integrations Slack Incoming Webhooks, Notion API, GitHub REST API
Observability Prometheus service metrics, OpenTelemetry traces, Jaeger
Infrastructure Docker Compose, Kubernetes (Kind), Redis

Known Limitations (MVP)

  • Circuit-breaker actions mutate Kubernetes. Echo creates/patches an Istio DestinationRule when Istio is installed, otherwise it records the breaker state in ConfigMap/echo-circuit-breakers.
  • DRY_RUN=false enables live Kubernetes actions. Keep DRY_RUN=true for notification-only demos.
  • No mock Kubernetes fallback. If DRY_RUN=false, a real Kubernetes client must be available or actions fail explicitly.
  • Model checkpoint is required by default. Set ALLOW_UNTRAINED_MODEL=true only for local demo work without trained weights.
  • No authentication. The API and WebSocket are open. Add an API key middleware before any public deployment.
  • Dynamic services don't persist across restarts. Services added via POST /api/services exist only in memory.

License

MIT

About

ai-powered SRE agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors