Detector Documentation

This document describes all detectors included in infranow MVP and how they work.

Overview

Detectors are pluggable components that query Prometheus metrics and identify infrastructure problems. Each detector:

Runs at a configurable interval (default: 30-60 seconds)
Queries Prometheus for specific metrics
Applies deterministic rules to identify problems
Returns a list of Problem objects with severity, entity, and hints

Kubernetes Detectors

OOMKillDetector

Purpose: Detects containers that have been killed due to Out Of Memory (OOM) errors.

Entity Type: kubernetes_pod

Query:

increase(kube_pod_container_status_restarts_total{reason="OOMKilled"}[5m]) > 0

Severity: CRITICAL

Hint: "Container memory limit too low or memory leak detected"

Detection Logic:

Checks for any container restarts with reason "OOMKilled" in the last 5 minutes
Creates a problem for each affected container
Blast radius: 1 (single container)

Remediation:

Check container memory limits: kubectl describe pod <pod-name>
Review application memory usage patterns
Increase memory limits if necessary
Investigate for memory leaks

CrashLoopBackOffDetector

Purpose: Detects pods stuck in CrashLoopBackOff state.

Entity Type: kubernetes_pod

Query:

kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} > 0

Severity: FATAL

Hint: "Application startup failure or fatal runtime error"

Detection Logic:

Identifies containers waiting with reason "CrashLoopBackOff"
Indicates repeated application crashes
Blast radius: 1 (single container)

Remediation:

Check pod logs: kubectl logs <pod-name> --previous
Describe pod for events: kubectl describe pod <pod-name>
Review application startup code and dependencies
Check environment variables and configuration

ImagePullBackOffDetector

Purpose: Detects pods unable to pull container images.

Entity Type: kubernetes_pod

Query:

kube_pod_container_status_waiting_reason{reason=~"ImagePullBackOff|ErrImagePull"} > 0

Severity: CRITICAL

Hint: "Image not found or registry authentication failure"

Detection Logic:

Checks for containers waiting due to image pull failures
Matches both "ImagePullBackOff" and "ErrImagePull" reasons
Blast radius: 1 (single container)

Remediation:

Verify image name and tag: kubectl describe pod <pod-name>
Check registry authentication: kubectl get secrets
Verify network connectivity to registry
Ensure image exists in the registry

PodPendingDetector

Purpose: Detects pods stuck in Pending state for more than 5 minutes.

Entity Type: kubernetes_pod

Query:

kube_pod_status_phase{phase="Pending"} * on(namespace, pod) group_left() (time() - kube_pod_created > 300)

Severity: CRITICAL

Hint: "Insufficient cluster resources or scheduling constraints"

Detection Logic:

Identifies pods in Pending phase for >5 minutes
Combines pod phase and creation time metrics
Blast radius: 1 (single pod)

Remediation:

Check pod events: kubectl describe pod <pod-name>
Verify node resources: kubectl top nodes
Check for taints and tolerations
Review node selectors and affinity rules
Consider adding nodes if cluster is under-provisioned

Generic Detectors

HighErrorRateDetector

Purpose: Detects services with high HTTP 5xx error rates.

Entity Type: service

Query:

(rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m])) > 0.05

Severity: CRITICAL

Hint: "5xx error rate above 5% threshold"

Detection Logic:

Calculates 5xx error rate over 5-minute window
Threshold: 5% (configurable in detector)
Blast radius: 5 (assumes service affects multiple entities)

Remediation:

Check service logs for error details
Review recent deployments or changes
Verify backend service health
Check database connection pools
Review application error handling

Requirements:

Requires http_requests_total metric with status label
Common in services instrumented with Prometheus client libraries

DiskSpaceDetector

Purpose: Detects filesystems with low available space.

Entity Type: filesystem

Query:

(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) > 0.90

Severity:

WARNING: Usage > 90%
CRITICAL: Usage > 95%

Hint: "Disk usage above 90%"

Detection Logic:

Calculates filesystem usage percentage
WARNING threshold: 90%
CRITICAL threshold: 95%
Blast radius: 3 (could affect multiple services on node)

Remediation:

Identify large files: du -h /path | sort -rh | head -20
Clean up old logs or temporary files
Rotate or compress logs
Consider increasing disk size
Review log retention policies

Requirements:

Requires node_filesystem_avail_bytes and node_filesystem_size_bytes metrics
Typically provided by node_exporter

HighMemoryPressureDetector

Purpose: Detects nodes with high memory pressure.

Entity Type: node

Query:

(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.90

Severity: CRITICAL

Hint: "Memory pressure above 90%"

Detection Logic:

Calculates memory usage as percentage
Threshold: 90% of total memory
Uses MemAvailable (accounts for caches/buffers)
Blast radius: 10 (high impact on node)

Remediation:

Check memory usage by process: top or htop
Identify memory-hungry pods: kubectl top pods --all-namespaces
Review pod memory limits
Consider evicting or scaling down pods
Add memory to node or scale cluster

Requirements:

Requires node_memory_MemAvailable_bytes and node_memory_MemTotal_bytes metrics
Typically provided by node_exporter

Service Mesh Detectors

LinkerdControlPlaneDetector

Purpose: Detects linkerd control plane deployments with zero available replicas. When the control plane is down, proxy injection fails, mTLS breaks, and traffic routing stops for all meshed services.

Entity Type: service_mesh_control_plane

Query:

kube_deployment_status_replicas_available{namespace="linkerd"} == 0

Severity: FATAL

Blast Radius: 15 (affects all meshed services)

Hint: "Check pod status: kubectl get pods -n linkerd"

LinkerdProxyInjectionDetector

Purpose: Detects linkerd pods in CrashLoopBackOff. Catches proxy-injector, identity, and destination service failures.

Entity Type: service_mesh_control_plane

Query:

kube_pod_container_status_waiting_reason{namespace="linkerd",reason="CrashLoopBackOff"} > 0

Severity: CRITICAL

Blast Radius: 10

Hint: "Proxy injector or identity service failure"

IstioControlPlaneDetector

Purpose: Detects istiod with zero available replicas. When istiod is down, xDS config distribution stops and new deployments break.

Entity Type: service_mesh_control_plane

Query:

kube_deployment_status_replicas_available{namespace="istio-system",deployment="istiod"} == 0

Severity: FATAL

Blast Radius: 15

Hint: "Check pod status: kubectl get pods -n istio-system"

IstioSidecarInjectionDetector

Purpose: Detects istio-system pods in CrashLoopBackOff. Catches istiod, gateway, and pilot failures.

Entity Type: service_mesh_control_plane

Query:

kube_pod_container_status_waiting_reason{namespace="istio-system",reason="CrashLoopBackOff"} > 0

Severity: CRITICAL

Blast Radius: 10

Hint: "Sidecar injector or pilot failure"

Service Mesh Certificate Detectors

LinkerdCertExpiryDetector

Purpose: Detects linkerd identity certificates approaching expiry. Certificate expiry is the silent killer of service meshes — mTLS fails across all meshed services without warning.

Entity Type: service_mesh_certificate

Query:

(identity_cert_expiry_timestamp - time()) < 604800

Severity (tiered):

WARNING: < 7 days remaining
CRITICAL: < 48 hours remaining
FATAL: < 24 hours remaining or expired

Blast Radius: 20 (highest — cert expiry kills the entire mesh)

Interval: 60s

Hint: "Rotate certs: linkerd check --proxy; Renew: linkerd upgrade | kubectl apply -f -"

IstioCertExpiryDetector

Purpose: Detects istio root certificate approaching expiry. When the Citadel root cert expires, all workload certificates become invalid.

Entity Type: service_mesh_certificate

Query:

(citadel_server_root_cert_expiry_timestamp - time()) < 604800

Severity (tiered):

WARNING: < 7 days remaining
CRITICAL: < 48 hours remaining
FATAL: < 24 hours remaining or expired

Blast Radius: 20

Interval: 60s

Hint: "Check status: istioctl proxy-status; Rotate: istioctl create-remote-secret"

Trustwatch Detectors

TrustwatchCertExpiryDetector

Purpose: Detects certificates nearing expiry via trustwatch Prometheus metrics. Covers a broader trust surface than mesh-specific detectors: webhooks, API aggregation, external TLS, and mesh issuers.

Entity Type: trustwatch_certificate

Query:

trustwatch_cert_expires_in_seconds < 604800

Severity (tiered):

WARNING: < 7 days remaining
CRITICAL: < 48 hours remaining
FATAL: < 24 hours remaining or expired

Blast Radius: 10

Interval: 60s

Entity Format: trustwatch/{source}/{namespace}/{name} (e.g. trustwatch/webhook/kube-system/cert-manager-webhook)

Hint: "Run: trustwatch now"

Graceful absence: When trustwatch is not installed, the query returns empty results and no problems are reported.

TrustwatchProbeFailureDetector

Purpose: Detects TLS endpoints that trustwatch cannot reach. A probe failure means the endpoint is unreachable or returning invalid TLS.

Entity Type: trustwatch_certificate

Query:

trustwatch_probe_success == 0

Severity: CRITICAL

Blast Radius: 5

Interval: 60s

Entity Format: trustwatch/{source}/{namespace}/{name}

Hint: "Run: trustwatch now"

Graceful absence: When trustwatch is not installed, the query returns empty results and no problems are reported.

Trustwatch Metric Requirements

Metric	Source	Detector
`trustwatch_cert_expires_in_seconds`	trustwatch Helm chart + ServiceMonitor	Cert expiry
`trustwatch_probe_success`	trustwatch Helm chart + ServiceMonitor	Probe failure
`trustwatch_findings_total`	trustwatch Helm chart + ServiceMonitor	(future use)

Certificate Monitoring Setup

Service mesh certificate metrics are often missing from Prometheus. This is the most common reason cert expiry goes undetected.

Why cert metrics are missing:

Linkerd identity service not in Prometheus scrape targets
Istiod metrics endpoint not scraped
cert-manager not exporting metrics
ServiceMonitor or PodMonitor CRDs missing

How to verify:

# Check if linkerd metrics are being scraped
curl -s http://prometheus:9090/api/v1/targets | \
  jq '.data.activeTargets[] | select(.labels.job | contains("linkerd"))'

# Verify cert metric exists
curl -s http://prometheus:9090/api/v1/query?query=identity_cert_expiry_timestamp
curl -s http://prometheus:9090/api/v1/query?query=citadel_server_root_cert_expiry_timestamp

Required scrape targets:

Linkerd: linkerd-identity (port 9990), linkerd-proxy-injector (port 9995)
Istio: istiod (port 15014)
cert-manager (optional): certmanager_certificate_expiration_timestamp_seconds

Service Mesh Metric Requirements

Metric	Source	Detector
`kube_deployment_status_replicas_available`	kube-state-metrics	Control plane health
`kube_pod_container_status_waiting_reason`	kube-state-metrics	Component crashes
`identity_cert_expiry_timestamp`	linkerd-identity	Linkerd cert expiry
`citadel_server_root_cert_expiry_timestamp`	istiod	Istio cert expiry

Adding Custom Detectors

To add a custom detector:

Create detector file in internal/detector/:

package detector

type MyDetector struct {
    interval time.Duration
}

func NewMyDetector() *MyDetector {
    return &MyDetector{
        interval: 30 * time.Second,
    }
}

func (d *MyDetector) Name() string {
    return "my_custom_detector"
}

func (d *MyDetector) EntityTypes() []string {
    return []string{"my_entity_type"}
}

func (d *MyDetector) Interval() time.Duration {
    return d.interval
}

func (d *MyDetector) Detect(ctx context.Context, provider metrics.MetricsProvider, window time.Duration) ([]*models.Problem, error) {
    query := "my_metric > threshold"
    result, err := provider.QueryInstant(ctx, query, time.Now())
    if err != nil {
        return nil, err
    }

    problems := make([]*models.Problem, 0)
    for _, sample := range result {
        // Extract labels and create problem
        problem := &models.Problem{
            ID:         "unique-id",
            Entity:     "entity-identifier",
            EntityType: "my_entity_type",
            Type:       "problem_type",
            Severity:   models.SeverityCritical,
            Title:      "Problem Title",
            Message:    "Detailed message",
            Hint:       "Actionable hint",
            BlastRadius: 1,
            Labels:     map[string]string{},
            Metrics:    map[string]float64{},
        }
        problems = append(problems, problem)
    }

    return problems, nil
}

Add tests in internal/detector/my_test.go
Register detector in internal/cli/monitor.go:

func registerDetectors(registry *detector.Registry) {
    // ... existing
    registry.Register(detector.NewMyDetector())
}

Document detector in this file

Detector Best Practices

Query Design

Use instant queries for current state
Use range queries for rate calculations
Keep queries simple and efficient
Test queries in Prometheus UI first

Severity Selection

FATAL: Service down, data loss imminent
CRITICAL: Degraded performance, risk of failure
WARNING: Anomaly detected, no immediate impact

Blast Radius

Estimate affected entities:

1: Single container/pod
3-5: Multiple pods/services
10+: Node-level or cluster-wide impact

Hints

Provide actionable, specific hints:

✅ "Container memory limit too low or memory leak detected"
❌ "There's a problem with memory"

Testing

Test with:

No data (empty metrics)
Partial data (some labels missing)
Edge cases (exactly at threshold)
Multiple problems simultaneously

Metric Requirements

infranow detectors expect standard Prometheus metrics:

Kubernetes Metrics

kube_pod_container_status_restarts_total
kube_pod_container_status_waiting_reason
kube_pod_status_phase
kube_pod_created

Provided by: kube-state-metrics

Node Metrics

node_filesystem_avail_bytes
node_filesystem_size_bytes
node_memory_MemAvailable_bytes
node_memory_MemTotal_bytes

Provided by: node_exporter

Application Metrics

http_requests_total (with status label)

Provided by: Application instrumentation (Prometheus client libraries)

Future Detector Ideas

Post-MVP detector candidates:

DatabaseReplicationLagDetector: PostgreSQL/MySQL replication lag
KafkaUnderReplicatedPartitionsDetector: Kafka partition health
RedisMemoryPressureDetector: Redis memory usage
CertificateExpirationDetector: TLS certificate expiration (implemented in v0.1.1 as LinkerdCertExpiry + IstioCertExpiry)
PVCFullDetector: PersistentVolumeClaim usage
NodeNotReadyDetector: Kubernetes node health
DeploymentReplicaMismatchDetector: Desired vs actual replicas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Detector Documentation

Overview

Kubernetes Detectors

OOMKillDetector

CrashLoopBackOffDetector

ImagePullBackOffDetector

PodPendingDetector

Generic Detectors

HighErrorRateDetector

DiskSpaceDetector

HighMemoryPressureDetector

Service Mesh Detectors

LinkerdControlPlaneDetector

LinkerdProxyInjectionDetector

IstioControlPlaneDetector

IstioSidecarInjectionDetector

Service Mesh Certificate Detectors

LinkerdCertExpiryDetector

IstioCertExpiryDetector

Trustwatch Detectors

TrustwatchCertExpiryDetector

TrustwatchProbeFailureDetector

Trustwatch Metric Requirements

Certificate Monitoring Setup

Service Mesh Metric Requirements

Adding Custom Detectors

Detector Best Practices

Query Design

Severity Selection

Blast Radius

Hints

Testing

Metric Requirements

Kubernetes Metrics

Node Metrics

Application Metrics

Future Detector Ideas

FilesExpand file tree

DETECTORS.md

Latest commit

History

DETECTORS.md

File metadata and controls

Detector Documentation

Overview

Kubernetes Detectors

OOMKillDetector

CrashLoopBackOffDetector

ImagePullBackOffDetector

PodPendingDetector

Generic Detectors

HighErrorRateDetector

DiskSpaceDetector

HighMemoryPressureDetector

Service Mesh Detectors

LinkerdControlPlaneDetector

LinkerdProxyInjectionDetector

IstioControlPlaneDetector

IstioSidecarInjectionDetector

Service Mesh Certificate Detectors

LinkerdCertExpiryDetector

IstioCertExpiryDetector

Trustwatch Detectors

TrustwatchCertExpiryDetector

TrustwatchProbeFailureDetector

Trustwatch Metric Requirements

Certificate Monitoring Setup

Service Mesh Metric Requirements

Adding Custom Detectors

Detector Best Practices

Query Design

Severity Selection

Blast Radius

Hints

Testing

Metric Requirements

Kubernetes Metrics

Node Metrics

Application Metrics

Future Detector Ideas