Skip to content

Mounusha25/Kube_Pulse

Repository files navigation

🛡️ KubePulse — Autonomous Kubernetes AI Monitoring Agent

KubePulse Banner

Real-time pod anomaly detection and LLM-powered self-healing for Kubernetes workloads.

Node.js Python Go Kubernetes Prometheus TensorFlow LangChain


Overview

KubePulse is an autonomous AIOps agent that continuously monitors Kubernetes pod memory usage, predicts anomalies before they cause outages, and leverages a Large Language Model to automatically suggest remediation commands — all in real time.

Traditional Kubernetes monitoring tools alert you after a problem occurs. KubePulse is predictive: it uses a hybrid LSTM + LightGBM ML pipeline to forecast memory behavior and detect anomalies before they escalate into CrashLoopBackOff or OOMKilled events.


Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        KubePulse Agent                          │
│                                                                 │
│  ┌──────────────┐    ┌──────────────┐    ┌───────────────────┐  │
│  │  Prometheus  │───▶│  Node.js     │───▶│  Python ML Engine │  │
│  │  (Metrics)   │    │  Orchestrator│    │  (LSTM + LightGBM)│  │
│  └──────────────┘    └──────┬───────┘    └────────┬──────────┘  │
│           ▲                 │                     │             │
│           │        POST     │            Anomaly? │             │
│  ┌────────┴───────┐         │                     ▼             │
│  │  Go Prometheus │◀────────┘         ┌───────────────────────┐ │
│  │  Exporter      │  /report          │  Gemini 1.5 Pro (LLM) │ │
│  │  /metrics      │                   │  kubectl fix command  │ │
│  └────────────────┘                   └───────────────────────┘ │
│  ┌──────────────┐                                               │
│  │  Kubernetes  │◀── HPA can scale based on custom metrics      │
│  │  API Server  │                                               │
│  └──────────────┘                                               │
└─────────────────────────────────────────────────────────────────┘

Data Flow

  1. Metrics Collection — Queries Prometheus every second for container_memory_usage_bytes of the target pod
  2. Normalization — Memory values are normalized against the pod's configured memory limit (fetched from K8s API)
  3. Sliding Window — Maintains a rolling window of 10 data points fed as a time-series sequence
  4. ML Prediction — The Node.js process streams the window to a Python subprocess via stdin; the LSTM model forecasts the next memory value
  5. Anomaly Classification — The predicted value is passed to a LightGBM classifier to determine if it's anomalous
  6. Metrics Export — Every prediction (healthy or anomalous) is POSTed to the Go Prometheus Exporter which exposes custom metrics on /metrics for Prometheus to scrape
  7. LLM Remediation — If an anomaly is detected, the agent queries Gemini 1.5 Pro with full pod context and gets a targeted kubectl fix command
  8. Critical Triage — If the pod is already in CrashLoopBackOff, the agent escalates to a cluster administrator alert instead of auto-remediating

ML Pipeline

Hybrid LSTM + LightGBM Model

Component Role Details
LSTM Time-series forecasting Trained on historical pod memory sequences (10-step input → next value)
LightGBM Anomaly classification Binary classifier on LSTM-predicted output: 0 = healthy, 1 = anomaly

The hybrid approach separates concerns: LSTM captures temporal patterns in memory usage while LightGBM provides fast, interpretable anomaly classification on top of the forecast — a design that reduces false positives compared to threshold-based alerting.


Tech Stack

Layer Technology
Orchestration Node.js (ESM), Express
Kubernetes Integration Kubernetes REST API (kubectl proxy)
Metrics Prometheus + PromQL
Custom Metrics Exporter Go 1.22, Prometheus client_golang SDK
ML Inference TensorFlow/Keras (LSTM), LightGBM, NumPy
LLM Integration LangChain + Google Gemini 1.5 Pro
IPC Node.js child_process → Python subprocess via stdin/stdout
Containerization Docker (multi-stage build)
Deployment Kubernetes Deployment + Service YAML

Project Structure

KubePulse/
├── index.js                        # Main agent: metrics collection, orchestration, LLM integration
├── hybrid_predict.py               # ML inference engine: LSTM + LightGBM anomaly detection
├── hybrid_lstm_model.keras         # Pre-trained LSTM model (TensorFlow/Keras)
├── hybrid_lstm_model.h5            # LSTM model (HDF5 format)
├── lightgbm_anomaly.pkl            # Trained LightGBM anomaly classifier
├── exp.js                          # Utility: K8s API exploration script
├── package.json                    # Node.js dependencies
└── kubepulse-exporter/             # Go Prometheus custom metrics exporter
    ├── main.go                     # HTTP server exposing /report and /metrics
    ├── go.mod                      # Go module definition
    ├── Dockerfile                  # Multi-stage Docker build
    └── k8s/
        └── exporter-deployment.yaml  # Kubernetes Deployment + Service

Go Prometheus Exporter (kubepulse-exporter)

A purpose-built Go microservice that acts as the metrics bridge between the KubePulse agent and Prometheus.

Exposed Endpoints

Endpoint Method Description
POST /report POST Receives JSON predictions from the Node.js agent
GET /metrics GET Prometheus scrape endpoint with custom metrics
GET /healthz GET Liveness/readiness health check

Custom Prometheus Metrics

Metric Type Description
kubepulse_predicted_memory_ratio Gauge LSTM-forecasted memory usage (0–1, per pod label)
kubepulse_anomaly_detected Gauge 1 = anomaly active, 0 = healthy (per pod label)
kubepulse_anomalies_total Counter Cumulative anomaly count since exporter start

Deploy the Exporter

# Build the Docker image
cd kubepulse-exporter
docker build -t kubepulse-exporter:latest .

# Deploy to Kubernetes
kubectl apply -f k8s/exporter-deployment.yaml

# Verify it's running
kubectl get pods -l app=kubepulse-exporter

HPA Integration (Horizontal Pod Autoscaler)

Since KubePulse exposes kubepulse_predicted_memory_ratio as a real Prometheus metric, Kubernetes HPA can use it via the Prometheus Adapter to auto-scale workloads based on predicted memory — closing the full AIOps loop.


Getting Started

Prerequisites

  • Kubernetes cluster running locally (e.g., minikube or kind)
  • Prometheus deployed in the cluster with container_memory_usage_bytes metrics available
  • Python 3.10+ with pip
  • Node.js 20+
  • kubectl proxy running on localhost:8001

1. Clone the Repository

git clone https://github.com/Mounusha25/Kube_Pulse.git
cd Kube_Pulse

2. Install Node.js Dependencies

npm install

3. Install Python Dependencies

pip install tensorflow keras lightgbm numpy joblib

4. Start the Kubernetes API Proxy

kubectl proxy --port=8001

5. Configure the Target Pod

In index.js, update the pod name to match your deployment:

const podName = "your-pod-name-here"

6. Run KubePulse

npm run dev

KubePulse will begin collecting metrics. After 10 seconds of warm-up, ML predictions start streaming and anomaly detection goes live.


Sample Output

[KubePulse] Collecting memory metrics...
[0.42, 0.45, 0.48, 0.51, 0.53, 0.58, 0.63, 0.71, 0.79, 0.88]

✅ Healthy. Predicted Memory: 0.91

⚠️  Anomaly Detected! Predicted Memory: 1.24 (exceeds limit)
🧠 LLM suggests: kubectl set resources deployment/finalpod --limits=memory=512Mi

⚠️  CRITICAL: Pod finalpod-77c649c5fc-tzvnb is leaking memory. Notify cluster administrator.

How It Compares to Traditional Approaches

Approach Detection Timing False Positives Auto-Remediation
Threshold alerts After breach High
Prometheus alerting rules After breach Medium
KubePulse (LSTM + LightGBM) Before breach Low ✅ LLM-powered

Key Engineering Decisions

  • Subprocess IPC over REST — The Python ML engine runs as a persistent subprocess rather than a separate microservice, eliminating HTTP overhead for high-frequency (1 Hz) inference
  • Go for the metrics layer — The exporter is written in Go using the official prometheus/client_golang SDK — the same stack used by production K8s operators — keeping the scrape path lightweight and idiomatic
  • Normalized memory inputs — Memory is normalized against each pod's individual limit, making the model portable across pods with different memory configurations
  • LLM-as-last-resort — The LLM is only invoked on confirmed anomalies, keeping API costs minimal while providing intelligent, context-aware remediation
  • CrashLoopBackOff triage — Distinguishes between recoverable anomalies (auto-fix) and critical failures (human escalation), avoiding dangerous automated actions on already-failing pods

Future Improvements

  • Multi-pod monitoring with dynamic pod discovery
  • CPU usage anomaly detection alongside memory
  • Slack/PagerDuty integration for critical alerts
  • Automatic execution of LLM-suggested commands (with approval workflow)
  • Grafana dashboard for real-time anomaly visualization
  • Model retraining pipeline on new cluster data

Author

MounushaGitHub


Built with a focus on proactive reliability engineering for production Kubernetes environments.

About

KubePulse is an autonomous AIOps agent that continuously monitors Kubernetes pod memory usage, predicts anomalies before they cause outages, and leverages a Large Language Model to automatically suggest remediation commands — all in real time.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors