Real-time pod anomaly detection and LLM-powered self-healing for Kubernetes workloads.
KubePulse is an autonomous AIOps agent that continuously monitors Kubernetes pod memory usage, predicts anomalies before they cause outages, and leverages a Large Language Model to automatically suggest remediation commands — all in real time.
Traditional Kubernetes monitoring tools alert you after a problem occurs. KubePulse is predictive: it uses a hybrid LSTM + LightGBM ML pipeline to forecast memory behavior and detect anomalies before they escalate into CrashLoopBackOff or OOMKilled events.
┌─────────────────────────────────────────────────────────────────┐
│ KubePulse Agent │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌───────────────────┐ │
│ │ Prometheus │───▶│ Node.js │───▶│ Python ML Engine │ │
│ │ (Metrics) │ │ Orchestrator│ │ (LSTM + LightGBM)│ │
│ └──────────────┘ └──────┬───────┘ └────────┬──────────┘ │
│ ▲ │ │ │
│ │ POST │ Anomaly? │ │
│ ┌────────┴───────┐ │ ▼ │
│ │ Go Prometheus │◀────────┘ ┌───────────────────────┐ │
│ │ Exporter │ /report │ Gemini 1.5 Pro (LLM) │ │
│ │ /metrics │ │ kubectl fix command │ │
│ └────────────────┘ └───────────────────────┘ │
│ ┌──────────────┐ │
│ │ Kubernetes │◀── HPA can scale based on custom metrics │
│ │ API Server │ │
│ └──────────────┘ │
└─────────────────────────────────────────────────────────────────┘
- Metrics Collection — Queries Prometheus every second for
container_memory_usage_bytesof the target pod - Normalization — Memory values are normalized against the pod's configured memory limit (fetched from K8s API)
- Sliding Window — Maintains a rolling window of 10 data points fed as a time-series sequence
- ML Prediction — The Node.js process streams the window to a Python subprocess via
stdin; the LSTM model forecasts the next memory value - Anomaly Classification — The predicted value is passed to a LightGBM classifier to determine if it's anomalous
- Metrics Export — Every prediction (healthy or anomalous) is POSTed to the Go Prometheus Exporter which exposes custom metrics on
/metricsfor Prometheus to scrape - LLM Remediation — If an anomaly is detected, the agent queries Gemini 1.5 Pro with full pod context and gets a targeted
kubectlfix command - Critical Triage — If the pod is already in
CrashLoopBackOff, the agent escalates to a cluster administrator alert instead of auto-remediating
| Component | Role | Details |
|---|---|---|
| LSTM | Time-series forecasting | Trained on historical pod memory sequences (10-step input → next value) |
| LightGBM | Anomaly classification | Binary classifier on LSTM-predicted output: 0 = healthy, 1 = anomaly |
The hybrid approach separates concerns: LSTM captures temporal patterns in memory usage while LightGBM provides fast, interpretable anomaly classification on top of the forecast — a design that reduces false positives compared to threshold-based alerting.
| Layer | Technology |
|---|---|
| Orchestration | Node.js (ESM), Express |
| Kubernetes Integration | Kubernetes REST API (kubectl proxy) |
| Metrics | Prometheus + PromQL |
| Custom Metrics Exporter | Go 1.22, Prometheus client_golang SDK |
| ML Inference | TensorFlow/Keras (LSTM), LightGBM, NumPy |
| LLM Integration | LangChain + Google Gemini 1.5 Pro |
| IPC | Node.js child_process → Python subprocess via stdin/stdout |
| Containerization | Docker (multi-stage build) |
| Deployment | Kubernetes Deployment + Service YAML |
KubePulse/
├── index.js # Main agent: metrics collection, orchestration, LLM integration
├── hybrid_predict.py # ML inference engine: LSTM + LightGBM anomaly detection
├── hybrid_lstm_model.keras # Pre-trained LSTM model (TensorFlow/Keras)
├── hybrid_lstm_model.h5 # LSTM model (HDF5 format)
├── lightgbm_anomaly.pkl # Trained LightGBM anomaly classifier
├── exp.js # Utility: K8s API exploration script
├── package.json # Node.js dependencies
└── kubepulse-exporter/ # Go Prometheus custom metrics exporter
├── main.go # HTTP server exposing /report and /metrics
├── go.mod # Go module definition
├── Dockerfile # Multi-stage Docker build
└── k8s/
└── exporter-deployment.yaml # Kubernetes Deployment + Service
A purpose-built Go microservice that acts as the metrics bridge between the KubePulse agent and Prometheus.
| Endpoint | Method | Description |
|---|---|---|
POST /report |
POST | Receives JSON predictions from the Node.js agent |
GET /metrics |
GET | Prometheus scrape endpoint with custom metrics |
GET /healthz |
GET | Liveness/readiness health check |
| Metric | Type | Description |
|---|---|---|
kubepulse_predicted_memory_ratio |
Gauge | LSTM-forecasted memory usage (0–1, per pod label) |
kubepulse_anomaly_detected |
Gauge | 1 = anomaly active, 0 = healthy (per pod label) |
kubepulse_anomalies_total |
Counter | Cumulative anomaly count since exporter start |
# Build the Docker image
cd kubepulse-exporter
docker build -t kubepulse-exporter:latest .
# Deploy to Kubernetes
kubectl apply -f k8s/exporter-deployment.yaml
# Verify it's running
kubectl get pods -l app=kubepulse-exporterSince KubePulse exposes kubepulse_predicted_memory_ratio as a real Prometheus metric, Kubernetes HPA can use it via the Prometheus Adapter to auto-scale workloads based on predicted memory — closing the full AIOps loop.
- Kubernetes cluster running locally (e.g., minikube or kind)
- Prometheus deployed in the cluster with
container_memory_usage_bytesmetrics available - Python 3.10+ with pip
- Node.js 20+
kubectl proxyrunning onlocalhost:8001
git clone https://github.com/Mounusha25/Kube_Pulse.git
cd Kube_Pulsenpm installpip install tensorflow keras lightgbm numpy joblibkubectl proxy --port=8001In index.js, update the pod name to match your deployment:
const podName = "your-pod-name-here"npm run devKubePulse will begin collecting metrics. After 10 seconds of warm-up, ML predictions start streaming and anomaly detection goes live.
[KubePulse] Collecting memory metrics...
[0.42, 0.45, 0.48, 0.51, 0.53, 0.58, 0.63, 0.71, 0.79, 0.88]
✅ Healthy. Predicted Memory: 0.91
⚠️ Anomaly Detected! Predicted Memory: 1.24 (exceeds limit)
🧠 LLM suggests: kubectl set resources deployment/finalpod --limits=memory=512Mi
⚠️ CRITICAL: Pod finalpod-77c649c5fc-tzvnb is leaking memory. Notify cluster administrator.
| Approach | Detection Timing | False Positives | Auto-Remediation |
|---|---|---|---|
| Threshold alerts | After breach | High | ❌ |
| Prometheus alerting rules | After breach | Medium | ❌ |
| KubePulse (LSTM + LightGBM) | Before breach | Low | ✅ LLM-powered |
- Subprocess IPC over REST — The Python ML engine runs as a persistent subprocess rather than a separate microservice, eliminating HTTP overhead for high-frequency (1 Hz) inference
- Go for the metrics layer — The exporter is written in Go using the official
prometheus/client_golangSDK — the same stack used by production K8s operators — keeping the scrape path lightweight and idiomatic - Normalized memory inputs — Memory is normalized against each pod's individual limit, making the model portable across pods with different memory configurations
- LLM-as-last-resort — The LLM is only invoked on confirmed anomalies, keeping API costs minimal while providing intelligent, context-aware remediation
- CrashLoopBackOff triage — Distinguishes between recoverable anomalies (auto-fix) and critical failures (human escalation), avoiding dangerous automated actions on already-failing pods
- Multi-pod monitoring with dynamic pod discovery
- CPU usage anomaly detection alongside memory
- Slack/PagerDuty integration for critical alerts
- Automatic execution of LLM-suggested commands (with approval workflow)
- Grafana dashboard for real-time anomaly visualization
- Model retraining pipeline on new cluster data
Mounusha — GitHub
Built with a focus on proactive reliability engineering for production Kubernetes environments.
