-
Notifications
You must be signed in to change notification settings - Fork 11
Monitoring the CloudZero Agent
This document explains how to monitor the health and performance of the CloudZero Agent using metrics exposed by the agent itself, the Kubernetes API server, and standard Kubernetes infrastructure metrics. Each section covers a specific failure mode, which metrics to watch, what to alert on, and where to go for remediation.
How this document is organized:
| Section | What It Covers | Primary Metrics Source |
|---|---|---|
| Webhook Admission Latency | Impact on cluster API operations | Kubernetes API server |
| Webhook TLS Certificate Health | Silent webhook failures from cert issues | Webhook server + API server |
| Data Pipeline Health | Metrics flowing into the collector | Collector (port 8080) |
| Shipper Health | Data delivery to CloudZero | Shipper (port 8081) |
| Webhook Metadata Delivery | Resource metadata pushed to CloudZero API | Webhook server (port 8443) |
| Webhook Event Processing | Webhook receiving and processing events | Webhook server (port 8443) |
| Prometheus Agent Scrape Health | Prometheus scraping targets correctly | Prometheus agent (port 9090) |
| Pod and Container Health | OOM kills, restarts, memory pressure | kube-state-metrics, cAdvisor |
Accessing agent metrics:
All agent components expose /metrics endpoints. Here is the port mapping:
| Component | Container | Port | Protocol | What It Exposes |
|---|---|---|---|---|
| Aggregator | collector | 8080 | HTTP |
metrics_received_*, http_request_duration_seconds
|
| Aggregator | shipper | 8081 | HTTP |
shipper_*, function_execution_seconds
|
| Server | Prometheus agent | 9090 | HTTP |
prometheus_* (Prometheus native metrics) |
| Webhook | webhook-server | 8443 | HTTPS |
czo_webhook_*, remote_write_*, function_execution_seconds
|
If you are running a monitoring Prometheus in your cluster (or Datadog, Grafana Agent, etc.), add scrape targets for the agent services. The shipper sidecar (port 8081) is not exposed via the Kubernetes service by default; to scrape it, use pod-level service discovery or annotations.
Note on counter metrics: Some agent metrics (e.g.,
shipper_file_upload_error_total,storage_write_failure_total) are Prometheus counters that only appear in/metricsoutput after their first event. If you don't see a metric listed here, it means that event type hasn't occurred yet, which is normal.
Filtering for CloudZero Agent resources in queries:
The queries in this document filter by namespace. Replace <agent-namespace> with the namespace where the CloudZero Agent is installed (typically cloudzero). If you need finer-grained filtering, all agent pods carry the label app.kubernetes.io/part-of: cloudzero-agent, which can be used in joins with kube_pod_labels.
This section explains how to monitor the CloudZero Agent webhook's performance impact on your Kubernetes cluster using Kubernetes API server metrics.
The CloudZero Agent webhook operates as a Kubernetes validating admission controller that intercepts resource operations (CREATE, UPDATE, DELETE) before they're persisted to etcd. While the webhook is designed to be fast and non-blocking, it's important to monitor its performance impact to ensure it doesn't introduce significant latency into your cluster operations.
The agent uses failurePolicy: Ignore, which means an unreachable or slow webhook will not block operations -- but each operation must wait for the webhook to respond or time out. With the default timeout, every pod create/update/delete could be delayed by seconds if the webhook is unhealthy.
Key concerns include:
- API request latency: How much time does the webhook add to resource operations?
- Network overhead: Time spent on TLS handshakes, network transit, and proxy components
- Service mesh impact: Additional latency introduced by service meshes (e.g., Istio, Linkerd)
- Operational visibility: Understanding the webhook's behavior during high-load scenarios
Monitoring webhook performance from the webhook server itself only captures part of the picture. The webhook server can measure how long it takes to process a request once received, but it cannot measure:
- TLS handshake time
- Network transit time (both directions)
- Service mesh overhead (Istio, Linkerd, etc.)
- API server queuing time
- Connection setup overhead
The API server metrics provide the complete end-to-end latency from when the API server initiates the webhook call to when it receives the response. This is the metric that matters for understanding the webhook's impact on cluster operations.
Kubernetes exposes a STABLE metric specifically for tracking admission webhook latency (see the Kubernetes Metrics Reference):
apiserver_admission_webhook_admission_duration_seconds
Metric Details:
- Type: Histogram
- Stability: STABLE
-
Labels:
-
name: Webhook name (e.g., "cz-agent-cloudzero-agent-webhook-server-webhook.cloudzero-agent.svc") -
operation: API operation (CREATE, UPDATE, DELETE, CONNECT) -
rejected: Whether the request was rejected ("true" or "false") -
type: Webhook type ("validating" or "admit")
-
- Buckets: [0.005, 0.025, 0.1, 0.5, 1, 2.5, 10, 25] seconds
This metric captures the complete round-trip time including:
- TLS handshake and connection setup
- Network transit (to webhook server and back)
- Service mesh proxy processing (if applicable)
- Webhook server processing time
- Any queuing or retry logic
The metric is exposed by the Kubernetes API server on the /metrics endpoint. If you have Prometheus configured to scrape the API server, this metric will be available automatically. You can query it directly using:
kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_secondsTo see metrics specific to the CloudZero webhook, filter by the webhook name in your Prometheus queries.
p99 latency exceeding 1 second:
histogram_quantile(0.99,
rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
name=~".*cloudzero.*"
}[5m])
) > 1
When the webhook is healthy, p99 latency is typically under 25ms. A p99 above 1 second indicates either webhook processing issues or network/TLS problems. When the webhook is completely unreachable, every request will show the full timeout duration (default 10 seconds).
Median latency exceeding 1 second (indicates widespread failure):
histogram_quantile(0.50,
rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
name=~".*cloudzero.*"
}[5m])
) > 1
If the median jumps from <10ms to >1s, nearly every webhook call is failing. This is a strong signal of TLS certificate mismatch or total webhook unavailability. See Monitoring Webhook TLS Certificate Health.
Triage: See Webhook Unreachable / API Server Latency in the Debugging Guide.
While API server metrics provide the complete picture, the webhook server itself also exposes metrics that can help diagnose issues:
Available at: https://<webhook-pod-ip>:8443/metrics
Key metrics:
-
http_request_duration_seconds: Server-side processing time (excludes network/TLS) -
http_requests_total: Request count by status code -
czo_webhook_types_total: Webhook events by resource type and operation
Note: These metrics only show webhook server processing time and don't include network latency or TLS overhead. They're useful for isolating whether high latency is due to webhook processing or network/infrastructure issues.
| Aspect | API Server Metrics | Webhook Server Metrics |
|---|---|---|
| Scope | Complete end-to-end latency | Server processing only |
| Includes TLS | Yes | No |
| Includes Network | Yes | No |
| Includes Sidecars | Yes | No |
| Granularity | Per webhook name | All requests |
| Recommended for | Performance monitoring | Debugging webhook logic |
Recommendation: Use API server metrics for understanding the webhook's impact on cluster operations. Use webhook server metrics only for debugging specific webhook processing issues.
This section covers how to detect silent webhook failures caused by TLS certificate issues. This is one of the most dangerous failure modes because it's completely invisible to standard Kubernetes monitoring.
The CloudZero Agent webhook uses TLS certificates for communication with the Kubernetes API server. When there's a certificate mismatch -- for example, after a certificate rotation where the API server's caBundle hasn't been updated -- every admission request fails the TLS handshake. Because the webhook uses failurePolicy: Ignore, these failures are silently ignored: the API server lets operations proceed, but the webhook records nothing.
The result is complete, silent loss of all resource metadata (pod creates, deletes, updates, namespace changes, etc.) with no alerts from the Kubernetes side. The webhook pods appear healthy, all deployments show ready, and there are no error events in the namespace.
There is currently no Prometheus metric exported by the webhook for TLS handshake failures specifically. Detection relies on observing the absence of successful events, which is actually a reliable signal: in any active cluster, the webhook should be continuously receiving admission requests.
Metrics endpoint: https://<webhook-pod>:8443/metrics and the Kubernetes API server /metrics endpoint.
1. Webhook running but receiving zero events:
rate(czo_webhook_types_total[30m]) == 0
This is the primary detection method. In any active cluster, pods are being created, updated, and deleted regularly. A zero rate for 30 minutes while the webhook pods are Running means the webhook is not receiving admission requests.
The most common cause is a TLS certificate mismatch between the webhook's TLS secret and the caBundle in the ValidatingWebhookConfiguration. Other causes include network policy blocking API server traffic to the webhook, or the ValidatingWebhookConfiguration being deleted.
Triage: See Webhook Diagnostics > Certificate not issued or expired and CA bundle mismatch in the Debugging Guide. To verify whether the caBundle matches:
# Compare fingerprints -- these should match
kubectl get validatingwebhookconfiguration <release>-cz-webhook \
-o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | \
openssl x509 -noout -fingerprint
kubectl get secret -n <namespace> <release>-cz-webhook-tls \
-o jsonpath='{.data.ca\.crt}' | base64 -d | \
openssl x509 -noout -fingerprint2. API server webhook latency at timeout duration:
histogram_quantile(0.50,
rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
name=~".*cloudzero.*"
}[5m])
) > 1
When TLS handshakes fail, the API server waits for the timeout before proceeding. This manifests as every webhook call taking the full timeout duration. A median latency jumping from <10ms to >1s is a definitive signal of TLS failure.
Triage: Same as alert #1 above.
This section covers monitoring the collector, which is the entry point for all metrics in the CloudZero Agent.
The collector receives Prometheus remote-write requests from the internal Prometheus agent, classifies them as cost or observability metrics, and stores them to disk. If the collector stops receiving or processing metrics, CloudZero loses visibility into the cluster.
Metrics endpoint: http://<aggregator-pod>:8080/metrics (the collector container in the aggregator deployment)
| Metric | Type | What It Means |
|---|---|---|
metrics_received_total |
Counter | Total raw metrics received from Prometheus remote-write. Should increase steadily every scrape interval. |
metrics_received_cost_total |
Counter | Subset classified as cost-allocation metrics (cAdvisor, KSM). This is the data CloudZero uses for cost attribution. |
metrics_received_observability_total |
Counter | Subset classified as observability metrics (agent self-monitoring data). |
http_requests_total{code="204",method="post"} |
Counter | Successful remote-write ingestion requests (HTTP 204 = accepted). |
http_request_duration_seconds{method="post"} |
Histogram | Latency of processing each remote-write batch. |
1. No metrics received:
rate(metrics_received_total[10m]) == 0
No metrics received in 10 minutes. This means the internal Prometheus agent is not sending data to the collector, or the collector is down. In a healthy agent, this counter increments every scrape interval (typically ~60s).
Triage: See Data Pipeline Diagnostics > All Pods Healthy But No Data and CrashLoopBackOff Diagnostics in the Debugging Guide.
2. Cost metrics not being received:
rate(metrics_received_cost_total[10m]) == 0
and
rate(metrics_received_total[10m]) > 0
Metrics are arriving but none are classified as cost metrics. This typically means KSM or cAdvisor scrape targets are not being scraped. The total will still increase from observability metrics, but CloudZero cannot perform cost attribution without the cost metrics.
Triage: See Data Pipeline Diagnostics > Missing KSM Metrics and Missing cAdvisor Metrics in the Debugging Guide.
3. High collector ingestion latency:
histogram_quantile(0.99,
rate(http_request_duration_seconds_bucket{code="204",method="post"}[5m])
) > 1
p99 collector ingestion latency exceeding 1 second. In healthy operation, the p99 for POST requests is well under 25ms. Values above 1s suggest memory pressure, disk I/O contention, or a cluster generating more data than the collector can process.
Triage: See Performance Diagnostics > High Memory Usage in the Debugging Guide.
This section covers monitoring the shipper, which is responsible for uploading collected metric data to CloudZero.
The shipper is the critical "last mile" of the data pipeline. It uploads collected metric data files to CloudZero's S3 buckets. When the shipper fails, data accumulates on disk. The agent automatically manages disk space by cleaning up shipped data based on age and shipped status, and becomes more aggressive about cleanup under disk pressure. This means persistent shipper failures are the real concern: if files can't be shipped, eventually even unshipped data will be purged to keep the volume from filling, resulting in data loss.
Metrics endpoint: http://<aggregator-pod>:8081/metrics (the shipper sidecar in the aggregator deployment)
| Metric | Type | What It Means |
|---|---|---|
shipper_run_fail_total |
Counter | Total shipper cycle failures, labeled by error_status_code. |
shipper_new_files_error_total |
Counter | File processing errors, labeled by error_status_code. |
shipper_presigned_url_error_total |
Counter | Pre-signed URL allocation failures (from CloudZero API). |
shipper_file_upload_error_total |
Counter | S3 upload failures, labeled by error_status_code. |
shipper_handle_request_success_total |
Counter | Successful file upload batches. Should increase every shipper cycle (~10 min). |
shipper_handle_request_file_count |
Histogram | Number of files processed per upload cycle. |
function_execution_seconds{function_name="shipper_runShipper"} |
Histogram | Duration of each shipper cycle, with error label showing failure reason on failure. |
1. Shipper cycle failures:
rate(shipper_run_fail_total[30m]) > 0
The shipper is failing to complete its upload cycle. The error_status_code label identifies the category of failure:
-
err-unauthorized: Invalid or revoked API key. The most common cause. -
err-network: Cannot reachapi.cloudzero.comor S3. - Other codes: See shipper logs for details.
Triage: For err-unauthorized, see Data Pipeline Diagnostics > API key invalid or revoked. For err-network, see Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.
2. Pre-signed URL failures:
rate(shipper_presigned_url_error_total[30m]) > 0
Cannot obtain pre-signed URLs from CloudZero API. This blocks all file uploads. The shipper must obtain pre-signed URLs from api.cloudzero.com before uploading to S3. Failures here mean the API key is invalid/revoked, the network path to the API is blocked, or the CloudZero API is experiencing issues.
Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.
3. S3 upload failures:
rate(shipper_file_upload_error_total[30m]) > 0
Pre-signed URLs were obtained successfully, but the actual file transfer to S3 failed. This typically indicates network policy blocking S3 or transient connectivity issues.
Triage: See Network Diagnostics > Cannot Reach S3 Buckets in the Debugging Guide.
4. No successful uploads:
rate(shipper_handle_request_success_total[30m]) == 0
No files have been successfully uploaded in 30 minutes. The shipper runs approximately every 10 minutes, so three consecutive cycles have failed. This is a broader check than alerts 1-3 that catches any delivery failure regardless of the specific error category.
This section covers monitoring the webhook's delivery of Kubernetes resource metadata to the CloudZero API via Prometheus remote-write.
The webhook captures Kubernetes resource metadata (pod creates/deletes/updates, storage classes, ingress classes, etc.) and pushes it to CloudZero. This metadata is essential for cost attribution -- without it, CloudZero cannot map resource costs to teams, services, or labels.
Metrics endpoint: https://<webhook-pod>:8443/metrics
| Metric | Type | What It Means |
|---|---|---|
remote_write_failures_total |
Counter | Failed attempts to push metadata to CloudZero API. |
remote_write_backlog_records |
Gauge | Records queued but not yet sent. A growing backlog means delivery is failing. |
remote_write_records_processed_total |
Counter | Records successfully sent and confirmed. |
remote_write_response_codes_total |
Counter | Response codes from CloudZero API, labeled by status_code (bucketed values: 2xx, 4xx, 5xx, no_response). |
remote_write_request_duration_seconds |
Histogram | Latency of push requests to CloudZero API. |
remote_write_db_failures_total |
Counter | Database failures when tracking record state internally. |
storage_write_failure_total |
Counter | Failures writing admission data to local storage, labeled by resource_type, namespace, resource_name, action. |
1. Remote write failures:
rate(remote_write_failures_total[30m]) > 0
The webhook cannot push metadata to CloudZero. Cost attribution metadata is not being delivered.
Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.
2. Growing backlog:
remote_write_backlog_records > 1000
Data is being captured but not delivered. A small transient backlog during high activity is normal, but a persistently growing backlog means delivery is failing or too slow. Check remote_write_response_codes_total for non-2xx codes.
3. Non-success responses from CloudZero API:
rate(remote_write_response_codes_total{status_code!="2xx"}[10m]) > 0
The status_code label uses bucketed values:
-
4xx: Client error, most commonly an invalid or revoked API key. See Job Failure Diagnostics > Invalid API key in the Debugging Guide. -
5xx: CloudZero API issue. Typically transient unless persistent. -
no_response: No response received (network timeout, DNS failure, connection refused). See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.
4. Local storage write failures:
rate(storage_write_failure_total[10m]) > 0
The webhook is receiving admission requests but cannot persist them to its local data store. The resource_type, namespace, resource_name, and action labels identify exactly which resource operations are failing.
This section covers monitoring whether the webhook is actually receiving and processing Kubernetes admission requests.
It's possible for the webhook to be "running" but not receiving any requests. This can happen due to a misconfigured ValidatingWebhookConfiguration, network policy blocking API server traffic, or TLS certificate issues (see Monitoring Webhook TLS Certificate Health). When the webhook isn't receiving events, all resource metadata is silently lost.
Metrics endpoint: https://<webhook-pod>:8443/metrics
| Metric | Type | What It Means |
|---|---|---|
czo_webhook_types_total |
Counter | Webhook events processed, labeled by kind_group, kind_version, kind_resource, operation. |
function_execution_seconds{function_name=~"executeAdmissionsReviewRequest_.*"} |
Histogram | Server-side processing time per admission request type (Create, Update, Delete). |
function_execution_seconds{function_name="writeDataToStorage"} |
Histogram | Time to persist admission data to local storage. |
1. No pod events received:
rate(czo_webhook_types_total{kind_resource="pod",operation="create"}[1h]) == 0
No pod CREATE events seen in 1 hour. In any active cluster, pods are being created regularly. A zero rate means the webhook is not receiving admission requests.
Triage: See Webhook Not Receiving Admission Requests in the Debugging Guide. Check the ValidatingWebhookConfiguration, CA bundle, and network policies allowing API server to webhook on port 8443. Also see Monitoring Webhook TLS Certificate Health -- TLS certificate mismatch is the most common cause of this symptom.
2. High webhook server-side processing latency:
histogram_quantile(0.99,
rate(function_execution_seconds_bucket{
function_name=~"executeAdmissionsReviewRequest_.*",
error=""
}[5m])
) > 0.5
Webhook server-side processing p99 exceeding 500ms. This is the processing time inside the webhook; the API server sees this plus network/TLS overhead. In healthy operation, the p99 for admission request processing is typically under 20ms.
Triage: See Performance Diagnostics > Slow Webhook Response Times in the Debugging Guide. Cross-reference with the API server metric apiserver_admission_webhook_admission_duration_seconds to determine whether the latency is from processing or network.
This section covers monitoring the internal Prometheus agent that scrapes Kubernetes metrics and feeds them to the collector.
The CloudZero Agent uses Prometheus in agent mode to scrape Kubernetes metrics (cAdvisor from kubelets, KSM) and remote-write them to the collector. If Prometheus can't discover scrape targets, can't scrape them, or can't remote-write to the collector, no data enters the pipeline.
Metrics endpoint: http://<server-pod>:9090/metrics (the Prometheus process on the server deployment)
| Metric | Type | What It Means |
|---|---|---|
prometheus_sd_discovered_targets |
Gauge | Targets discovered by service discovery. |
prometheus_target_scrape_pool_targets |
Gauge | Current number of active scrape targets. |
prometheus_remote_storage_samples_failed_total |
Counter | Samples that permanently failed to send via remote-write. These are lost. |
prometheus_remote_storage_samples_pending |
Gauge | Samples queued for remote-write. Growing = collector can't keep up. |
prometheus_remote_storage_queue_highest_sent_timestamp_seconds |
Gauge | Most recent successfully sent timestamp. |
prometheus_remote_storage_queue_highest_timestamp_seconds |
Gauge | Most recent enqueued timestamp. |
prometheus_remote_storage_shards |
Gauge | Current number of remote-write shards. |
prometheus_target_scrapes_sample_out_of_order_total |
Counter | Samples dropped due to out-of-order timestamps. |
1. No scrape targets discovered:
prometheus_sd_discovered_targets == 0
Prometheus has discovered zero scrape targets. Service discovery is broken and no metrics will be collected.
Triage: See Data Pipeline Diagnostics > Some Metrics Missing in the Debugging Guide. Check the agent-server Prometheus configuration and RBAC permissions.
2. Remote-write samples permanently failing:
rate(prometheus_remote_storage_samples_failed_total[10m]) > 0
Prometheus permanently failed to send samples to the collector via remote-write. These samples are lost. Check collector health (see Monitoring Data Pipeline Health) first; if the collector is healthy, check Prometheus logs for remote-write errors.
3. Remote-write lag exceeding 5 minutes:
(
prometheus_remote_storage_queue_highest_timestamp_seconds
-
prometheus_remote_storage_queue_highest_sent_timestamp_seconds
) > 300
The remote-write queue is more than 5 minutes behind. Data is being scraped but not delivered to the collector in a timely manner. Small lag during startup or after a restart is normal. Persistent lag indicates the collector can't keep up, or there's a network issue between Prometheus and the collector.
Triage: See Performance Diagnostics > Data Processing Delays in the Debugging Guide.
4. Rapid remote-write resharding:
changes(prometheus_remote_storage_shards[5m]) > 10
The Prometheus remote-write subsystem is rapidly oscillating its shard count. This is a symptom of write pressure where the remote-write queue can't maintain a stable throughput rate. Rapid resharding is a leading indicator of imminent memory exhaustion on the server -- the constant allocation and deallocation of shards drives memory usage up.
Triage: Increase server memory limits. For very large clusters, consider enabling federated mode to reduce the number of targets the server must scrape. See Performance Diagnostics > High Memory Usage in the Debugging Guide.
5. Out-of-order sample drops:
rate(prometheus_target_scrapes_sample_out_of_order_total[10m]) > 0
Samples are being dropped because their timestamps are out of order. This can indicate cAdvisor timestamp issues (common with some container runtimes) or scrape interval misconfiguration. Each dropped sample is a gap in the collected data.
Triage: See Data Pipeline Diagnostics > Missing cAdvisor Metrics in the Debugging Guide.
This section covers standard Kubernetes monitoring for the CloudZero Agent's pods and containers, including OOM kills, container restarts, and proactive memory pressure detection.
kube-state-metrics provides metrics about pod and container state. Most Kubernetes monitoring stacks (kube-prometheus-stack, etc.) include it by default.
Note: To ensure OOM kill metrics are available, your
kube-state-metricsdeployment must have thepodscollector enabled. If you do not see thekube_pod_container_status_last_terminated_reasonmetric, check your KSM configuration to ensure thepodscollector is active and not explicitly excluded.
# Verify kube-state-metrics is running
kubectl get pods -A | grep kube-state-metrics
# Verify OOM metrics are exposed
kubectl port-forward -n kube-system deployment/kube-state-metrics 8080:8080
curl http://localhost:8080/metrics | grep kube_pod_container_status_last_terminated_reasonOOM kills occur when a container exceeds its memory limit and is terminated by the kernel. This causes service disruption, potential data loss (in-flight requests and unsaved state), and restart delays.
Primary metric (from kube-state-metrics):
kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}
What to alert on:
Current OOMKilled containers:
kube_pod_container_status_last_terminated_reason{
namespace="<agent-namespace>",
reason="OOMKilled"
} == 1
OOM kill rate over time:
sum(rate(kube_pod_container_status_restarts_total{
namespace="<agent-namespace>"
}[5m])) by (pod, container)
and on(namespace, pod, container)
kube_pod_container_status_last_terminated_reason{
namespace="<agent-namespace>",
reason="OOMKilled"
} == 1
Triage: See CrashLoopBackOff Diagnostics > OOMKilled in the Debugging Guide. Increase memory limits for the affected container. See Memory Sizing Guidelines below for recommendations based on cluster size.
Any container restart in agent pods:
rate(kube_pod_container_status_restarts_total{
namespace="<agent-namespace>"
}[15m]) > 0
A single restart during initial startup is normal. Repeated restarts indicate CrashLoopBackOff. Check the exit code: 137 = OOMKilled (increase memory), other codes = application error (check logs).
Triage: See CrashLoopBackOff Diagnostics in the Debugging Guide.
kube_deployment_status_replicas_unavailable{
namespace="<agent-namespace>"
} > 0
Any agent deployment has unavailable replicas. Run kubectl get pods -n <namespace> and follow the General kubectl Workflow in the Debugging Guide.
Monitoring memory usage against limits provides early warning before containers are OOM killed. This is the single most common operational issue observed in production CloudZero Agent deployments across clusters of all sizes.
Memory usage as percentage of limit:
(
container_memory_working_set_bytes{
namespace="<agent-namespace>",
container!="", container!="POD"
}
/
kube_pod_container_resource_limits{
namespace="<agent-namespace>",
resource="memory"
}
) > 0.85
Alert when memory usage exceeds 85% of the limit. This gives you time to increase limits before the container is killed.
Why container_memory_working_set_bytes? This metric excludes cached data that can be evicted, representing the true memory pressure. It's the same metric Kubernetes uses for OOM decisions.
Monitoring against requests provides a different signal: it detects when actual usage has significantly diverged from what was expected at scheduling time, which can indicate undersized requests (scheduling problems), unexpected memory growth, or workload characteristics that have changed.
Memory usage as percentage of request:
(
container_memory_working_set_bytes{
namespace="<agent-namespace>",
container!="", container!="POD"
}
/
kube_pod_container_resource_requests{
namespace="<agent-namespace>",
resource="memory"
}
) * 100
Memory usage for CloudZero Agent components scales with cluster size. The following table provides approximate observed ranges to help with capacity planning. Memory limits should be set to at least 1.5x the observed peak for your cluster size tier.
| Component | Small (<50 nodes) | Medium (50-200 nodes) | Large (200-600 nodes) | Very Large (600+ nodes) |
|---|---|---|---|---|
| Server (Prometheus) | 200-700 Mi | 900-2500 Mi | 4000-11000 Mi | 14000+ Mi |
| Aggregator (per replica) | 80-200 Mi | 200-1100 Mi | 1000-1500 Mi | Scale replicas |
| Webhook (per replica) | 10-50 Mi | 250-350 Mi | 250-650 Mi | 250-650 Mi |
| KSM | 20-80 Mi | 70-360 Mi | 300-650 Mi | 2500+ Mi |
The Prometheus server is the component most sensitive to cluster size. For clusters above 200 nodes, consider enabling federated mode to distribute the scrape load.
| Priority | What to Monitor | Key Alert | Common Cause |
|---|---|---|---|
| P0 | Webhook admission latency | API server metric p99 > 1s | Webhook unreachable, TLS failure, resource pressure |
| P0 | Server memory vs limit | container_memory_working_set_bytes / limit > 0.85 |
Cluster too large for current memory limits |
| P0 | Webhook receiving zero events |
rate(czo_webhook_types_total) == 0 for 30m |
TLS certificate mismatch after rotation |
| P1 | Shipper upload failures |
rate(shipper_run_fail_total) > 0 for 30m |
Invalid API key, network policy blocking egress |
| P1 | Aggregator memory vs limit | container_memory_working_set_bytes / limit > 0.85 |
Default 1Gi limit too low for cluster size |
| P1 | OOM kills | kube_pod_container_status_last_terminated_reason{reason="OOMKilled"} |
Undersized memory limits |
| P2 | Remote-write resharding churn | changes(prometheus_remote_storage_shards[5m]) > 10 |
Write pressure, imminent server OOM |
| P2 | No metrics received |
rate(metrics_received_total) == 0 for 10m |
Collector down, Prometheus not sending data |
| P2 | Webhook metadata backlog | remote_write_backlog_records > 1000 |
API connectivity or auth issues |
| P2 | Out-of-order sample drops | rate(prometheus_target_scrapes_sample_out_of_order_total) > 0 |
cAdvisor timestamp issues |
| P3 | Container restarts | rate(kube_pod_container_status_restarts_total) > 0 |
Various (check exit code) |