🛡️ KubePulse — Autonomous Kubernetes AI Monitoring Agent

Real-time pod anomaly detection and LLM-powered self-healing for Kubernetes workloads.

Overview

KubePulse is an autonomous AIOps agent that continuously monitors Kubernetes pod memory usage, predicts anomalies before they cause outages, and leverages a Large Language Model to automatically suggest remediation commands — all in real time.

Traditional Kubernetes monitoring tools alert you after a problem occurs. KubePulse is predictive: it uses a hybrid LSTM + LightGBM ML pipeline to forecast memory behavior and detect anomalies before they escalate into CrashLoopBackOff or OOMKilled events.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        KubePulse Agent                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌───────────────────┐  │
│  │  Prometheus  │───▶│  Node.js     │───▶│  Python ML Engine │  │
│  │  (Metrics)   │    │  Orchestrator│    │  (LSTM + LightGBM)│  │
│  └──────────────┘    └──────┬───────┘    └────────┬──────────┘  │
│           ▲                 │                     │             │
│           │        POST     │            Anomaly? │             │
│  ┌────────┴───────┐         │                     ▼             │
│  │  Go Prometheus │◀────────┘         ┌───────────────────────┐ │
│  │  Exporter      │  /report          │  Gemini 1.5 Pro (LLM) │ │
│  │  /metrics      │                   │  kubectl fix command  │ │
│  └────────────────┘                   └───────────────────────┘ │
│  ┌──────────────┐                                               │
│  │  Kubernetes  │◀── HPA can scale based on custom metrics      │
│  │  API Server  │                                               │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Data Flow

Metrics Collection — Queries Prometheus every second for container_memory_usage_bytes of the target pod
Normalization — Memory values are normalized against the pod's configured memory limit (fetched from K8s API)
Sliding Window — Maintains a rolling window of 10 data points fed as a time-series sequence
ML Prediction — The Node.js process streams the window to a Python subprocess via stdin; the LSTM model forecasts the next memory value
Anomaly Classification — The predicted value is passed to a LightGBM classifier to determine if it's anomalous
Metrics Export — Every prediction (healthy or anomalous) is POSTed to the Go Prometheus Exporter which exposes custom metrics on /metrics for Prometheus to scrape
LLM Remediation — If an anomaly is detected, the agent queries Gemini 1.5 Pro with full pod context and gets a targeted kubectl fix command
Critical Triage — If the pod is already in CrashLoopBackOff, the agent escalates to a cluster administrator alert instead of auto-remediating

ML Pipeline

Hybrid LSTM + LightGBM Model

Component	Role	Details
LSTM	Time-series forecasting	Trained on historical pod memory sequences (10-step input → next value)
LightGBM	Anomaly classification	Binary classifier on LSTM-predicted output: `0` = healthy, `1` = anomaly

The hybrid approach separates concerns: LSTM captures temporal patterns in memory usage while LightGBM provides fast, interpretable anomaly classification on top of the forecast — a design that reduces false positives compared to threshold-based alerting.

Tech Stack

Layer	Technology
Orchestration	Node.js (ESM), Express
Kubernetes Integration	Kubernetes REST API (`kubectl proxy`)
Metrics	Prometheus + PromQL
Custom Metrics Exporter	Go 1.22, Prometheus `client_golang` SDK
ML Inference	TensorFlow/Keras (LSTM), LightGBM, NumPy
LLM Integration	LangChain + Google Gemini 1.5 Pro
IPC	Node.js `child_process` → Python subprocess via `stdin/stdout`
Containerization	Docker (multi-stage build)
Deployment	Kubernetes Deployment + Service YAML

Project Structure

KubePulse/
├── index.js                        # Main agent: metrics collection, orchestration, LLM integration
├── hybrid_predict.py               # ML inference engine: LSTM + LightGBM anomaly detection
├── hybrid_lstm_model.keras         # Pre-trained LSTM model (TensorFlow/Keras)
├── hybrid_lstm_model.h5            # LSTM model (HDF5 format)
├── lightgbm_anomaly.pkl            # Trained LightGBM anomaly classifier
├── exp.js                          # Utility: K8s API exploration script
├── package.json                    # Node.js dependencies
└── kubepulse-exporter/             # Go Prometheus custom metrics exporter
    ├── main.go                     # HTTP server exposing /report and /metrics
    ├── go.mod                      # Go module definition
    ├── Dockerfile                  # Multi-stage Docker build
    └── k8s/
        └── exporter-deployment.yaml  # Kubernetes Deployment + Service

Go Prometheus Exporter (`kubepulse-exporter`)

A purpose-built Go microservice that acts as the metrics bridge between the KubePulse agent and Prometheus.

Exposed Endpoints

Endpoint	Method	Description
`POST /report`	POST	Receives JSON predictions from the Node.js agent
`GET /metrics`	GET	Prometheus scrape endpoint with custom metrics
`GET /healthz`	GET	Liveness/readiness health check

Custom Prometheus Metrics

Metric	Type	Description
`kubepulse_predicted_memory_ratio`	Gauge	LSTM-forecasted memory usage (0–1, per pod label)
`kubepulse_anomaly_detected`	Gauge	1 = anomaly active, 0 = healthy (per pod label)
`kubepulse_anomalies_total`	Counter	Cumulative anomaly count since exporter start

Deploy the Exporter

# Build the Docker image
cd kubepulse-exporter
docker build -t kubepulse-exporter:latest .

# Deploy to Kubernetes
kubectl apply -f k8s/exporter-deployment.yaml

# Verify it's running
kubectl get pods -l app=kubepulse-exporter

HPA Integration (Horizontal Pod Autoscaler)

Since KubePulse exposes kubepulse_predicted_memory_ratio as a real Prometheus metric, Kubernetes HPA can use it via the Prometheus Adapter to auto-scale workloads based on predicted memory — closing the full AIOps loop.

Getting Started

Prerequisites

Kubernetes cluster running locally (e.g., minikube or kind)
Prometheus deployed in the cluster with container_memory_usage_bytes metrics available
Python 3.10+ with pip
Node.js 20+
kubectl proxy running on localhost:8001

1. Clone the Repository

git clone https://github.com/Mounusha25/Kube_Pulse.git
cd Kube_Pulse

2. Install Node.js Dependencies

npm install

3. Install Python Dependencies

pip install tensorflow keras lightgbm numpy joblib

4. Start the Kubernetes API Proxy

kubectl proxy --port=8001

5. Configure the Target Pod

In index.js, update the pod name to match your deployment:

const podName = "your-pod-name-here"

6. Run KubePulse

npm run dev

KubePulse will begin collecting metrics. After 10 seconds of warm-up, ML predictions start streaming and anomaly detection goes live.

Sample Output

[KubePulse] Collecting memory metrics...
[0.42, 0.45, 0.48, 0.51, 0.53, 0.58, 0.63, 0.71, 0.79, 0.88]

✅ Healthy. Predicted Memory: 0.91

⚠️  Anomaly Detected! Predicted Memory: 1.24 (exceeds limit)
🧠 LLM suggests: kubectl set resources deployment/finalpod --limits=memory=512Mi

⚠️  CRITICAL: Pod finalpod-77c649c5fc-tzvnb is leaking memory. Notify cluster administrator.

How It Compares to Traditional Approaches

Approach	Detection Timing	False Positives	Auto-Remediation
Threshold alerts	After breach	High	❌
Prometheus alerting rules	After breach	Medium	❌
KubePulse (LSTM + LightGBM)	Before breach	Low	✅ LLM-powered

Key Engineering Decisions

Subprocess IPC over REST — The Python ML engine runs as a persistent subprocess rather than a separate microservice, eliminating HTTP overhead for high-frequency (1 Hz) inference
Go for the metrics layer — The exporter is written in Go using the official prometheus/client_golang SDK — the same stack used by production K8s operators — keeping the scrape path lightweight and idiomatic
Normalized memory inputs — Memory is normalized against each pod's individual limit, making the model portable across pods with different memory configurations
LLM-as-last-resort — The LLM is only invoked on confirmed anomalies, keeping API costs minimal while providing intelligent, context-aware remediation
CrashLoopBackOff triage — Distinguishes between recoverable anomalies (auto-fix) and critical failures (human escalation), avoiding dangerous automated actions on already-failing pods

Future Improvements

Multi-pod monitoring with dynamic pod discovery
CPU usage anomaly detection alongside memory
Slack/PagerDuty integration for critical alerts
Automatic execution of LLM-suggested commands (with approval workflow)
Grafana dashboard for real-time anomaly visualization
Model retraining pipeline on new cluster data

Author

Mounusha — GitHub

Built with a focus on proactive reliability engineering for production Kubernetes environments.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🛡️ KubePulse — Autonomous Kubernetes AI Monitoring Agent

Overview

Architecture

Data Flow

ML Pipeline

Hybrid LSTM + LightGBM Model

Tech Stack

Project Structure

Go Prometheus Exporter (`kubepulse-exporter`)

Exposed Endpoints

Custom Prometheus Metrics

Deploy the Exporter

HPA Integration (Horizontal Pod Autoscaler)

Getting Started

Prerequisites

1. Clone the Repository

2. Install Node.js Dependencies

3. Install Python Dependencies

4. Start the Kubernetes API Proxy

5. Configure the Target Pod

6. Run KubePulse

Sample Output

How It Compares to Traditional Approaches

Key Engineering Decisions

Future Improvements

Author

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
kubeguard-exporter		kubeguard-exporter
.gitignore		.gitignore
README.md		README.md
exp.js		exp.js
hybrid_lstm_model.h5		hybrid_lstm_model.h5
hybrid_lstm_model.keras		hybrid_lstm_model.keras
hybrid_predict.py		hybrid_predict.py
index.js		index.js
kubepulse_cover.png		kubepulse_cover.png
lightgbm_anomaly.pkl		lightgbm_anomaly.pkl
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

🛡️ KubePulse — Autonomous Kubernetes AI Monitoring Agent

Overview

Architecture

Data Flow

ML Pipeline

Hybrid LSTM + LightGBM Model

Tech Stack

Project Structure

Go Prometheus Exporter (kubepulse-exporter)

Exposed Endpoints

Custom Prometheus Metrics

Deploy the Exporter

HPA Integration (Horizontal Pod Autoscaler)

Getting Started

Prerequisites

1. Clone the Repository

2. Install Node.js Dependencies

3. Install Python Dependencies

4. Start the Kubernetes API Proxy

5. Configure the Target Pod

6. Run KubePulse

Sample Output

How It Compares to Traditional Approaches

Key Engineering Decisions

Future Improvements

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Go Prometheus Exporter (`kubepulse-exporter`)

Packages