Skip to content

Monitoring the CloudZero Agent

Evan Nemerson edited this page Apr 24, 2026 · 5 revisions

Monitoring the CloudZero Agent

This document explains how to monitor the health and performance of the CloudZero Agent using metrics exposed by the agent itself, the Kubernetes API server, and standard Kubernetes infrastructure metrics. Each section covers a specific failure mode, which metrics to watch, what to alert on, and where to go for remediation.

How this document is organized:

Section What It Covers Primary Metrics Source
Webhook Admission Latency Impact on cluster API operations Kubernetes API server
Webhook TLS Certificate Health Silent webhook failures from cert issues Webhook server + API server
Data Pipeline Health Metrics flowing into the collector Collector (port 8080)
Shipper Health Data delivery to CloudZero Shipper (port 8081)
Webhook Metadata Delivery Resource metadata pushed to CloudZero API Webhook server (port 8443)
Webhook Event Processing Webhook receiving and processing events Webhook server (port 8443)
Prometheus Agent Scrape Health Prometheus scraping targets correctly Prometheus agent (port 9090)
Pod and Container Health OOM kills, restarts, memory pressure kube-state-metrics, cAdvisor

Accessing agent metrics:

All agent components expose /metrics endpoints. Here is the port mapping:

Component Container Port Protocol What It Exposes
Aggregator collector 8080 HTTP metrics_received_*, http_request_duration_seconds
Aggregator shipper 8081 HTTP shipper_*, function_execution_seconds
Server Prometheus agent 9090 HTTP prometheus_* (Prometheus native metrics)
Webhook webhook-server 8443 HTTPS czo_webhook_*, remote_write_*, function_execution_seconds

If you are running a monitoring Prometheus in your cluster (or Datadog, Grafana Agent, etc.), add scrape targets for the agent services. The shipper sidecar (port 8081) is not exposed via the Kubernetes service by default; to scrape it, use pod-level service discovery or annotations.

Note on counter metrics: Some agent metrics (e.g., shipper_file_upload_error_total, storage_write_failure_total) are Prometheus counters that only appear in /metrics output after their first event. If you don't see a metric listed here, it means that event type hasn't occurred yet, which is normal.

Filtering for CloudZero Agent resources in queries:

The queries in this document filter by namespace. Replace <agent-namespace> with the namespace where the CloudZero Agent is installed (typically cloudzero). If you need finer-grained filtering, all agent pods carry the label app.kubernetes.io/part-of: cloudzero-agent, which can be used in joins with kube_pod_labels.


Monitoring Webhook Admission Latency

This section explains how to monitor the CloudZero Agent webhook's performance impact on your Kubernetes cluster using Kubernetes API server metrics.

Why Monitor Webhook Admission Latency?

The CloudZero Agent webhook operates as a Kubernetes validating admission controller that intercepts resource operations (CREATE, UPDATE, DELETE) before they're persisted to etcd. While the webhook is designed to be fast and non-blocking, it's important to monitor its performance impact to ensure it doesn't introduce significant latency into your cluster operations.

The agent uses failurePolicy: Ignore, which means an unreachable or slow webhook will not block operations -- but each operation must wait for the webhook to respond or time out. With the default timeout, every pod create/update/delete could be delayed by seconds if the webhook is unhealthy.

Key concerns include:

  • API request latency: How much time does the webhook add to resource operations?
  • Network overhead: Time spent on TLS handshakes, network transit, and proxy components
  • Service mesh impact: Additional latency introduced by service meshes (e.g., Istio, Linkerd)
  • Operational visibility: Understanding the webhook's behavior during high-load scenarios

Why API Server Metrics?

Monitoring webhook performance from the webhook server itself only captures part of the picture. The webhook server can measure how long it takes to process a request once received, but it cannot measure:

  • TLS handshake time
  • Network transit time (both directions)
  • Service mesh overhead (Istio, Linkerd, etc.)
  • API server queuing time
  • Connection setup overhead

The API server metrics provide the complete end-to-end latency from when the API server initiates the webhook call to when it receives the response. This is the metric that matters for understanding the webhook's impact on cluster operations.

The Right Metric: apiserver_admission_webhook_admission_duration_seconds

Kubernetes exposes a STABLE metric specifically for tracking admission webhook latency (see the Kubernetes Metrics Reference):

apiserver_admission_webhook_admission_duration_seconds

Metric Details:

  • Type: Histogram
  • Stability: STABLE
  • Labels:
    • name: Webhook name (e.g., "cz-agent-cloudzero-agent-webhook-server-webhook.cloudzero-agent.svc")
    • operation: API operation (CREATE, UPDATE, DELETE, CONNECT)
    • rejected: Whether the request was rejected ("true" or "false")
    • type: Webhook type ("validating" or "admit")
  • Buckets: [0.005, 0.025, 0.1, 0.5, 1, 2.5, 10, 25] seconds

This metric captures the complete round-trip time including:

  • TLS handshake and connection setup
  • Network transit (to webhook server and back)
  • Service mesh proxy processing (if applicable)
  • Webhook server processing time
  • Any queuing or retry logic

Accessing the Metrics

The metric is exposed by the Kubernetes API server on the /metrics endpoint. If you have Prometheus configured to scrape the API server, this metric will be available automatically. You can query it directly using:

kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds

To see metrics specific to the CloudZero webhook, filter by the webhook name in your Prometheus queries.

What to Alert On

p99 latency exceeding 1 second:

histogram_quantile(0.99,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

When the webhook is healthy, p99 latency is typically under 25ms. A p99 above 1 second indicates either webhook processing issues or network/TLS problems. When the webhook is completely unreachable, every request will show the full timeout duration (default 10 seconds).

Median latency exceeding 1 second (indicates widespread failure):

histogram_quantile(0.50,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

If the median jumps from <10ms to >1s, nearly every webhook call is failing. This is a strong signal of TLS certificate mismatch or total webhook unavailability. See Monitoring Webhook TLS Certificate Health.

Triage: See Webhook Unreachable / API Server Latency in the Debugging Guide.

Webhook Server Metrics (Supplementary)

While API server metrics provide the complete picture, the webhook server itself also exposes metrics that can help diagnose issues:

Available at: https://<webhook-pod-ip>:8443/metrics

Key metrics:

  • http_request_duration_seconds: Server-side processing time (excludes network/TLS)
  • http_requests_total: Request count by status code
  • czo_webhook_types_total: Webhook events by resource type and operation

Note: These metrics only show webhook server processing time and don't include network latency or TLS overhead. They're useful for isolating whether high latency is due to webhook processing or network/infrastructure issues.

Comparison: API Server vs. Webhook Server Metrics

Aspect API Server Metrics Webhook Server Metrics
Scope Complete end-to-end latency Server processing only
Includes TLS Yes No
Includes Network Yes No
Includes Sidecars Yes No
Granularity Per webhook name All requests
Recommended for Performance monitoring Debugging webhook logic

Recommendation: Use API server metrics for understanding the webhook's impact on cluster operations. Use webhook server metrics only for debugging specific webhook processing issues.


Monitoring Webhook TLS Certificate Health

This section covers how to detect silent webhook failures caused by TLS certificate issues. This is one of the most dangerous failure modes because it's completely invisible to standard Kubernetes monitoring.

Why This Matters

The CloudZero Agent webhook uses TLS certificates for communication with the Kubernetes API server. When there's a certificate mismatch -- for example, after a certificate rotation where the API server's caBundle hasn't been updated -- every admission request fails the TLS handshake. Because the webhook uses failurePolicy: Ignore, these failures are silently ignored: the API server lets operations proceed, but the webhook records nothing.

The result is complete, silent loss of all resource metadata (pod creates, deletes, updates, namespace changes, etc.) with no alerts from the Kubernetes side. The webhook pods appear healthy, all deployments show ready, and there are no error events in the namespace.

Detection Strategy

There is currently no Prometheus metric exported by the webhook for TLS handshake failures specifically. Detection relies on observing the absence of successful events, which is actually a reliable signal: in any active cluster, the webhook should be continuously receiving admission requests.

Metrics endpoint: https://<webhook-pod>:8443/metrics and the Kubernetes API server /metrics endpoint.

What to Alert On

1. Webhook running but receiving zero events:

rate(czo_webhook_types_total[30m]) == 0

This is the primary detection method. In any active cluster, pods are being created, updated, and deleted regularly. A zero rate for 30 minutes while the webhook pods are Running means the webhook is not receiving admission requests.

The most common cause is a TLS certificate mismatch between the webhook's TLS secret and the caBundle in the ValidatingWebhookConfiguration. Other causes include network policy blocking API server traffic to the webhook, or the ValidatingWebhookConfiguration being deleted.

Triage: See Webhook Diagnostics > Certificate not issued or expired and CA bundle mismatch in the Debugging Guide. To verify whether the caBundle matches:

# Compare fingerprints -- these should match
kubectl get validatingwebhookconfiguration <release>-cz-webhook \
  -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | \
  openssl x509 -noout -fingerprint

kubectl get secret -n <namespace> <release>-cz-webhook-tls \
  -o jsonpath='{.data.ca\.crt}' | base64 -d | \
  openssl x509 -noout -fingerprint

2. API server webhook latency at timeout duration:

histogram_quantile(0.50,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

When TLS handshakes fail, the API server waits for the timeout before proceeding. This manifests as every webhook call taking the full timeout duration. A median latency jumping from <10ms to >1s is a definitive signal of TLS failure.

Triage: Same as alert #1 above.


Monitoring Data Pipeline Health

This section covers monitoring the collector, which is the entry point for all metrics in the CloudZero Agent.

Why This Matters

The collector receives Prometheus remote-write requests from the internal Prometheus agent, classifies them as cost or observability metrics, and stores them to disk. If the collector stops receiving or processing metrics, CloudZero loses visibility into the cluster.

Metrics endpoint: http://<aggregator-pod>:8080/metrics (the collector container in the aggregator deployment)

Key Metrics

Metric Type What It Means
metrics_received_total Counter Total raw metrics received from Prometheus remote-write. Should increase steadily every scrape interval.
metrics_received_cost_total Counter Subset classified as cost-allocation metrics (cAdvisor, KSM). This is the data CloudZero uses for cost attribution.
metrics_received_observability_total Counter Subset classified as observability metrics (agent self-monitoring data).
http_requests_total{code="204",method="post"} Counter Successful remote-write ingestion requests (HTTP 204 = accepted).
http_request_duration_seconds{method="post"} Histogram Latency of processing each remote-write batch.

What to Alert On

1. No metrics received:

rate(metrics_received_total[10m]) == 0

No metrics received in 10 minutes. This means the internal Prometheus agent is not sending data to the collector, or the collector is down. In a healthy agent, this counter increments every scrape interval (typically ~60s).

Triage: See Data Pipeline Diagnostics > All Pods Healthy But No Data and CrashLoopBackOff Diagnostics in the Debugging Guide.

2. Cost metrics not being received:

rate(metrics_received_cost_total[10m]) == 0
  and
rate(metrics_received_total[10m]) > 0

Metrics are arriving but none are classified as cost metrics. This typically means KSM or cAdvisor scrape targets are not being scraped. The total will still increase from observability metrics, but CloudZero cannot perform cost attribution without the cost metrics.

Triage: See Data Pipeline Diagnostics > Missing KSM Metrics and Missing cAdvisor Metrics in the Debugging Guide.

3. High collector ingestion latency:

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{code="204",method="post"}[5m])
) > 1

p99 collector ingestion latency exceeding 1 second. In healthy operation, the p99 for POST requests is well under 25ms. Values above 1s suggest memory pressure, disk I/O contention, or a cluster generating more data than the collector can process.

Triage: See Performance Diagnostics > High Memory Usage in the Debugging Guide.


Monitoring Shipper Health

This section covers monitoring the shipper, which is responsible for uploading collected metric data to CloudZero.

Why This Matters

The shipper is the critical "last mile" of the data pipeline. It uploads collected metric data files to CloudZero's S3 buckets. When the shipper fails, data accumulates on disk. The agent automatically manages disk space by cleaning up shipped data based on age and shipped status, and becomes more aggressive about cleanup under disk pressure. This means persistent shipper failures are the real concern: if files can't be shipped, eventually even unshipped data will be purged to keep the volume from filling, resulting in data loss.

Metrics endpoint: http://<aggregator-pod>:8081/metrics (the shipper sidecar in the aggregator deployment)

Key Metrics

Metric Type What It Means
shipper_run_fail_total Counter Total shipper cycle failures, labeled by error_status_code.
shipper_new_files_error_total Counter File processing errors, labeled by error_status_code.
shipper_presigned_url_error_total Counter Pre-signed URL allocation failures (from CloudZero API).
shipper_file_upload_error_total Counter S3 upload failures, labeled by error_status_code.
shipper_handle_request_success_total Counter Successful file upload batches. Should increase every shipper cycle (~10 min).
shipper_handle_request_file_count Histogram Number of files processed per upload cycle.
function_execution_seconds{function_name="shipper_runShipper"} Histogram Duration of each shipper cycle, with error label showing failure reason on failure.

What to Alert On

1. Shipper cycle failures:

rate(shipper_run_fail_total[30m]) > 0

The shipper is failing to complete its upload cycle. The error_status_code label identifies the category of failure:

  • err-unauthorized: Invalid or revoked API key. The most common cause.
  • err-network: Cannot reach api.cloudzero.com or S3.
  • Other codes: See shipper logs for details.

Triage: For err-unauthorized, see Data Pipeline Diagnostics > API key invalid or revoked. For err-network, see Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

2. Pre-signed URL failures:

rate(shipper_presigned_url_error_total[30m]) > 0

Cannot obtain pre-signed URLs from CloudZero API. This blocks all file uploads. The shipper must obtain pre-signed URLs from api.cloudzero.com before uploading to S3. Failures here mean the API key is invalid/revoked, the network path to the API is blocked, or the CloudZero API is experiencing issues.

Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

3. S3 upload failures:

rate(shipper_file_upload_error_total[30m]) > 0

Pre-signed URLs were obtained successfully, but the actual file transfer to S3 failed. This typically indicates network policy blocking S3 or transient connectivity issues.

Triage: See Network Diagnostics > Cannot Reach S3 Buckets in the Debugging Guide.

4. No successful uploads:

rate(shipper_handle_request_success_total[30m]) == 0

No files have been successfully uploaded in 30 minutes. The shipper runs approximately every 10 minutes, so three consecutive cycles have failed. This is a broader check than alerts 1-3 that catches any delivery failure regardless of the specific error category.


Monitoring Webhook Metadata Delivery

This section covers monitoring the webhook's delivery of Kubernetes resource metadata to the CloudZero API via Prometheus remote-write.

Why This Matters

The webhook captures Kubernetes resource metadata (pod creates/deletes/updates, storage classes, ingress classes, etc.) and pushes it to CloudZero. This metadata is essential for cost attribution -- without it, CloudZero cannot map resource costs to teams, services, or labels.

Metrics endpoint: https://<webhook-pod>:8443/metrics

Key Metrics

Metric Type What It Means
remote_write_failures_total Counter Failed attempts to push metadata to CloudZero API.
remote_write_backlog_records Gauge Records queued but not yet sent. A growing backlog means delivery is failing.
remote_write_records_processed_total Counter Records successfully sent and confirmed.
remote_write_response_codes_total Counter Response codes from CloudZero API, labeled by status_code (bucketed values: 2xx, 4xx, 5xx, no_response).
remote_write_request_duration_seconds Histogram Latency of push requests to CloudZero API.
remote_write_db_failures_total Counter Database failures when tracking record state internally.
storage_write_failure_total Counter Failures writing admission data to local storage, labeled by resource_type, namespace, resource_name, action.

What to Alert On

1. Remote write failures:

rate(remote_write_failures_total[30m]) > 0

The webhook cannot push metadata to CloudZero. Cost attribution metadata is not being delivered.

Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

2. Growing backlog:

remote_write_backlog_records > 1000

Data is being captured but not delivered. A small transient backlog during high activity is normal, but a persistently growing backlog means delivery is failing or too slow. Check remote_write_response_codes_total for non-2xx codes.

3. Non-success responses from CloudZero API:

rate(remote_write_response_codes_total{status_code!="2xx"}[10m]) > 0

The status_code label uses bucketed values:

4. Local storage write failures:

rate(storage_write_failure_total[10m]) > 0

The webhook is receiving admission requests but cannot persist them to its local data store. The resource_type, namespace, resource_name, and action labels identify exactly which resource operations are failing.


Monitoring Webhook Event Processing

This section covers monitoring whether the webhook is actually receiving and processing Kubernetes admission requests.

Why This Matters

It's possible for the webhook to be "running" but not receiving any requests. This can happen due to a misconfigured ValidatingWebhookConfiguration, network policy blocking API server traffic, or TLS certificate issues (see Monitoring Webhook TLS Certificate Health). When the webhook isn't receiving events, all resource metadata is silently lost.

Metrics endpoint: https://<webhook-pod>:8443/metrics

Key Metrics

Metric Type What It Means
czo_webhook_types_total Counter Webhook events processed, labeled by kind_group, kind_version, kind_resource, operation.
function_execution_seconds{function_name=~"executeAdmissionsReviewRequest_.*"} Histogram Server-side processing time per admission request type (Create, Update, Delete).
function_execution_seconds{function_name="writeDataToStorage"} Histogram Time to persist admission data to local storage.

What to Alert On

1. No pod events received:

rate(czo_webhook_types_total{kind_resource="pod",operation="create"}[1h]) == 0

No pod CREATE events seen in 1 hour. In any active cluster, pods are being created regularly. A zero rate means the webhook is not receiving admission requests.

Triage: See Webhook Not Receiving Admission Requests in the Debugging Guide. Check the ValidatingWebhookConfiguration, CA bundle, and network policies allowing API server to webhook on port 8443. Also see Monitoring Webhook TLS Certificate Health -- TLS certificate mismatch is the most common cause of this symptom.

2. High webhook server-side processing latency:

histogram_quantile(0.99,
  rate(function_execution_seconds_bucket{
    function_name=~"executeAdmissionsReviewRequest_.*",
    error=""
  }[5m])
) > 0.5

Webhook server-side processing p99 exceeding 500ms. This is the processing time inside the webhook; the API server sees this plus network/TLS overhead. In healthy operation, the p99 for admission request processing is typically under 20ms.

Triage: See Performance Diagnostics > Slow Webhook Response Times in the Debugging Guide. Cross-reference with the API server metric apiserver_admission_webhook_admission_duration_seconds to determine whether the latency is from processing or network.


Monitoring Prometheus Agent Scrape Health

This section covers monitoring the internal Prometheus agent that scrapes Kubernetes metrics and feeds them to the collector.

Why This Matters

The CloudZero Agent uses Prometheus in agent mode to scrape Kubernetes metrics (cAdvisor from kubelets, KSM) and remote-write them to the collector. If Prometheus can't discover scrape targets, can't scrape them, or can't remote-write to the collector, no data enters the pipeline.

Metrics endpoint: http://<server-pod>:9090/metrics (the Prometheus process on the server deployment)

Key Metrics

Metric Type What It Means
prometheus_sd_discovered_targets Gauge Targets discovered by service discovery.
prometheus_target_scrape_pool_targets Gauge Current number of active scrape targets.
prometheus_remote_storage_samples_failed_total Counter Samples that permanently failed to send via remote-write. These are lost.
prometheus_remote_storage_samples_pending Gauge Samples queued for remote-write. Growing = collector can't keep up.
prometheus_remote_storage_queue_highest_sent_timestamp_seconds Gauge Most recent successfully sent timestamp.
prometheus_remote_storage_queue_highest_timestamp_seconds Gauge Most recent enqueued timestamp.
prometheus_remote_storage_shards Gauge Current number of remote-write shards.
prometheus_target_scrapes_sample_out_of_order_total Counter Samples dropped due to out-of-order timestamps.

What to Alert On

1. No scrape targets discovered:

prometheus_sd_discovered_targets == 0

Prometheus has discovered zero scrape targets. Service discovery is broken and no metrics will be collected.

Triage: See Data Pipeline Diagnostics > Some Metrics Missing in the Debugging Guide. Check the agent-server Prometheus configuration and RBAC permissions.

2. Remote-write samples permanently failing:

rate(prometheus_remote_storage_samples_failed_total[10m]) > 0

Prometheus permanently failed to send samples to the collector via remote-write. These samples are lost. Check collector health (see Monitoring Data Pipeline Health) first; if the collector is healthy, check Prometheus logs for remote-write errors.

3. Remote-write lag exceeding 5 minutes:

(
  prometheus_remote_storage_queue_highest_timestamp_seconds
  -
  prometheus_remote_storage_queue_highest_sent_timestamp_seconds
) > 300

The remote-write queue is more than 5 minutes behind. Data is being scraped but not delivered to the collector in a timely manner. Small lag during startup or after a restart is normal. Persistent lag indicates the collector can't keep up, or there's a network issue between Prometheus and the collector.

Triage: See Performance Diagnostics > Data Processing Delays in the Debugging Guide.

4. Rapid remote-write resharding:

changes(prometheus_remote_storage_shards[5m]) > 10

The Prometheus remote-write subsystem is rapidly oscillating its shard count. This is a symptom of write pressure where the remote-write queue can't maintain a stable throughput rate. Rapid resharding is a leading indicator of imminent memory exhaustion on the server -- the constant allocation and deallocation of shards drives memory usage up.

Triage: Increase server memory limits. For very large clusters, consider enabling federated mode to reduce the number of targets the server must scrape. See Performance Diagnostics > High Memory Usage in the Debugging Guide.

5. Out-of-order sample drops:

rate(prometheus_target_scrapes_sample_out_of_order_total[10m]) > 0

Samples are being dropped because their timestamps are out of order. This can indicate cAdvisor timestamp issues (common with some container runtimes) or scrape interval misconfiguration. Each dropped sample is a gap in the collected data.

Triage: See Data Pipeline Diagnostics > Missing cAdvisor Metrics in the Debugging Guide.


Monitoring Pod and Container Health

This section covers standard Kubernetes monitoring for the CloudZero Agent's pods and containers, including OOM kills, container restarts, and proactive memory pressure detection.

Required Metrics Source: kube-state-metrics

kube-state-metrics provides metrics about pod and container state. Most Kubernetes monitoring stacks (kube-prometheus-stack, etc.) include it by default.

Note: To ensure OOM kill metrics are available, your kube-state-metrics deployment must have the pods collector enabled. If you do not see the kube_pod_container_status_last_terminated_reason metric, check your KSM configuration to ensure the pods collector is active and not explicitly excluded.

# Verify kube-state-metrics is running
kubectl get pods -A | grep kube-state-metrics

# Verify OOM metrics are exposed
kubectl port-forward -n kube-system deployment/kube-state-metrics 8080:8080
curl http://localhost:8080/metrics | grep kube_pod_container_status_last_terminated_reason

Detecting OOM Kills

OOM kills occur when a container exceeds its memory limit and is terminated by the kernel. This causes service disruption, potential data loss (in-flight requests and unsaved state), and restart delays.

Primary metric (from kube-state-metrics):

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

What to alert on:

Current OOMKilled containers:

kube_pod_container_status_last_terminated_reason{
  namespace="<agent-namespace>",
  reason="OOMKilled"
} == 1

OOM kill rate over time:

sum(rate(kube_pod_container_status_restarts_total{
  namespace="<agent-namespace>"
}[5m])) by (pod, container)
  and on(namespace, pod, container)
  kube_pod_container_status_last_terminated_reason{
    namespace="<agent-namespace>",
    reason="OOMKilled"
  } == 1

Triage: See CrashLoopBackOff Diagnostics > OOMKilled in the Debugging Guide. Increase memory limits for the affected container. See Memory Sizing Guidelines below for recommendations based on cluster size.

Detecting Container Restarts

Any container restart in agent pods:

rate(kube_pod_container_status_restarts_total{
  namespace="<agent-namespace>"
}[15m]) > 0

A single restart during initial startup is normal. Repeated restarts indicate CrashLoopBackOff. Check the exit code: 137 = OOMKilled (increase memory), other codes = application error (check logs).

Triage: See CrashLoopBackOff Diagnostics in the Debugging Guide.

Detecting Unavailable Replicas

kube_deployment_status_replicas_unavailable{
  namespace="<agent-namespace>"
} > 0

Any agent deployment has unavailable replicas. Run kubectl get pods -n <namespace> and follow the General kubectl Workflow in the Debugging Guide.

Proactive Memory Pressure Detection

Why Monitor Against Limits?

Monitoring memory usage against limits provides early warning before containers are OOM killed. This is the single most common operational issue observed in production CloudZero Agent deployments across clusters of all sizes.

Memory usage as percentage of limit:

(
  container_memory_working_set_bytes{
    namespace="<agent-namespace>",
    container!="", container!="POD"
  }
  /
  kube_pod_container_resource_limits{
    namespace="<agent-namespace>",
    resource="memory"
  }
) > 0.85

Alert when memory usage exceeds 85% of the limit. This gives you time to increase limits before the container is killed.

Why container_memory_working_set_bytes? This metric excludes cached data that can be evicted, representing the true memory pressure. It's the same metric Kubernetes uses for OOM decisions.

Why Also Monitor Against Requests?

Monitoring against requests provides a different signal: it detects when actual usage has significantly diverged from what was expected at scheduling time, which can indicate undersized requests (scheduling problems), unexpected memory growth, or workload characteristics that have changed.

Memory usage as percentage of request:

(
  container_memory_working_set_bytes{
    namespace="<agent-namespace>",
    container!="", container!="POD"
  }
  /
  kube_pod_container_resource_requests{
    namespace="<agent-namespace>",
    resource="memory"
  }
) * 100

Memory Sizing Guidelines

Memory usage for CloudZero Agent components scales with cluster size. The following table provides approximate observed ranges to help with capacity planning. Memory limits should be set to at least 1.5x the observed peak for your cluster size tier.

Component Small (<50 nodes) Medium (50-200 nodes) Large (200-600 nodes) Very Large (600+ nodes)
Server (Prometheus) 200-700 Mi 900-2500 Mi 4000-11000 Mi 14000+ Mi
Aggregator (per replica) 80-200 Mi 200-1100 Mi 1000-1500 Mi Scale replicas
Webhook (per replica) 10-50 Mi 250-350 Mi 250-650 Mi 250-650 Mi
KSM 20-80 Mi 70-360 Mi 300-650 Mi 2500+ Mi

The Prometheus server is the component most sensitive to cluster size. For clusters above 200 nodes, consider enabling federated mode to distribute the scrape load.


Quick Reference: Priority Summary

Priority What to Monitor Key Alert Common Cause
P0 Webhook admission latency API server metric p99 > 1s Webhook unreachable, TLS failure, resource pressure
P0 Server memory vs limit container_memory_working_set_bytes / limit > 0.85 Cluster too large for current memory limits
P0 Webhook receiving zero events rate(czo_webhook_types_total) == 0 for 30m TLS certificate mismatch after rotation
P1 Shipper upload failures rate(shipper_run_fail_total) > 0 for 30m Invalid API key, network policy blocking egress
P1 Aggregator memory vs limit container_memory_working_set_bytes / limit > 0.85 Default 1Gi limit too low for cluster size
P1 OOM kills kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} Undersized memory limits
P2 Remote-write resharding churn changes(prometheus_remote_storage_shards[5m]) > 10 Write pressure, imminent server OOM
P2 No metrics received rate(metrics_received_total) == 0 for 10m Collector down, Prometheus not sending data
P2 Webhook metadata backlog remote_write_backlog_records > 1000 API connectivity or auth issues
P2 Out-of-order sample drops rate(prometheus_target_scrapes_sample_out_of_order_total) > 0 cAdvisor timestamp issues
P3 Container restarts rate(kube_pod_container_status_restarts_total) > 0 Various (check exit code)

Clone this wiki locally