Monitoring the CloudZero Agent

This document explains how to monitor the health and performance of the CloudZero Agent using metrics exposed by the agent itself, the Kubernetes API server, and standard Kubernetes infrastructure metrics. Each section covers a specific failure mode, which metrics to watch, what to alert on, and where to go for remediation.

How this document is organized:

Section	What It Covers	Primary Metrics Source
Webhook Admission Latency	Impact on cluster API operations	Kubernetes API server
Webhook TLS Certificate Health	Silent webhook failures from cert issues	Webhook server + API server
Data Pipeline Health	Metrics flowing into the collector	Collector (`port 8080`)
Shipper Health	Data delivery to CloudZero	Shipper (`port 8081`)
Webhook Metadata Delivery	Resource metadata pushed to CloudZero API	Webhook server (`port 8443`)
Webhook Event Processing	Webhook receiving and processing events	Webhook server (`port 8443`)
Prometheus Agent Scrape Health	Prometheus scraping targets correctly	Prometheus agent (`port 9090`)
Pod and Container Health	OOM kills, restarts, memory pressure	kube-state-metrics, cAdvisor

Accessing agent metrics:

All agent components expose /metrics endpoints. Here is the port mapping:

Component	Container	Port	Protocol	What It Exposes
Aggregator	collector	8080	HTTP	`metrics_received_*`, `http_request_duration_seconds`
Aggregator	shipper	8081	HTTP	`shipper_*`, `function_execution_seconds`
Server	Prometheus agent	9090	HTTP	`prometheus_*` (Prometheus native metrics)
Webhook	webhook-server	8443	HTTPS	`czo_webhook_`, `remote_write_`, `function_execution_seconds`

If you are running a monitoring Prometheus in your cluster (or Datadog, Grafana Agent, etc.), add scrape targets for the agent services. The shipper sidecar (port 8081) is not exposed via the Kubernetes service by default; to scrape it, use pod-level service discovery or annotations.

Note on counter metrics: Some agent metrics (e.g., shipper_file_upload_error_total, storage_write_failure_total) are Prometheus counters that only appear in /metrics output after their first event. If you don't see a metric listed here, it means that event type hasn't occurred yet, which is normal.

Filtering for CloudZero Agent resources in queries:

The queries in this document filter by namespace. Replace <agent-namespace> with the namespace where the CloudZero Agent is installed (typically cloudzero). If you need finer-grained filtering, all agent pods carry the label app.kubernetes.io/part-of: cloudzero-agent, which can be used in joins with kube_pod_labels.

Monitoring Webhook Admission Latency

This section explains how to monitor the CloudZero Agent webhook's performance impact on your Kubernetes cluster using Kubernetes API server metrics.

Why Monitor Webhook Admission Latency?

The CloudZero Agent webhook operates as a Kubernetes validating admission controller that intercepts resource operations (CREATE, UPDATE, DELETE) before they're persisted to etcd. While the webhook is designed to be fast and non-blocking, it's important to monitor its performance impact to ensure it doesn't introduce significant latency into your cluster operations.

The agent uses failurePolicy: Ignore, which means an unreachable or slow webhook will not block operations -- but each operation must wait for the webhook to respond or time out. With the default timeout, every pod create/update/delete could be delayed by seconds if the webhook is unhealthy.

Key concerns include:

API request latency: How much time does the webhook add to resource operations?
Network overhead: Time spent on TLS handshakes, network transit, and proxy components
Service mesh impact: Additional latency introduced by service meshes (e.g., Istio, Linkerd)
Operational visibility: Understanding the webhook's behavior during high-load scenarios

Why API Server Metrics?

Monitoring webhook performance from the webhook server itself only captures part of the picture. The webhook server can measure how long it takes to process a request once received, but it cannot measure:

TLS handshake time
Network transit time (both directions)
Service mesh overhead (Istio, Linkerd, etc.)
API server queuing time
Connection setup overhead

The API server metrics provide the complete end-to-end latency from when the API server initiates the webhook call to when it receives the response. This is the metric that matters for understanding the webhook's impact on cluster operations.

The Right Metric: `apiserver_admission_webhook_admission_duration_seconds`

Kubernetes exposes a STABLE metric specifically for tracking admission webhook latency (see the Kubernetes Metrics Reference):

apiserver_admission_webhook_admission_duration_seconds

Metric Details:

Type: Histogram
Stability: STABLE
Labels:
- name: Webhook name (e.g., "cz-agent-cloudzero-agent-webhook-server-webhook.cloudzero-agent.svc")
- operation: API operation (CREATE, UPDATE, DELETE, CONNECT)
- rejected: Whether the request was rejected ("true" or "false")
- type: Webhook type ("validating" or "admit")
Buckets: [0.005, 0.025, 0.1, 0.5, 1, 2.5, 10, 25] seconds

This metric captures the complete round-trip time including:

TLS handshake and connection setup
Network transit (to webhook server and back)
Service mesh proxy processing (if applicable)
Webhook server processing time
Any queuing or retry logic

Accessing the Metrics

The metric is exposed by the Kubernetes API server on the /metrics endpoint. If you have Prometheus configured to scrape the API server, this metric will be available automatically. You can query it directly using:

kubectl get --raw /metrics | grep apiserver_admission_webhook_admission_duration_seconds

To see metrics specific to the CloudZero webhook, filter by the webhook name in your Prometheus queries.

What to Alert On

p99 latency exceeding 1 second:

histogram_quantile(0.99,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

When the webhook is healthy, p99 latency is typically under 25ms. A p99 above 1 second indicates either webhook processing issues or network/TLS problems. When the webhook is completely unreachable, every request will show the full timeout duration (default 10 seconds).

Median latency exceeding 1 second (indicates widespread failure):

histogram_quantile(0.50,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

If the median jumps from <10ms to >1s, nearly every webhook call is failing. This is a strong signal of TLS certificate mismatch or total webhook unavailability. See Monitoring Webhook TLS Certificate Health.

Triage: See Webhook Unreachable / API Server Latency in the Debugging Guide.

Webhook Server Metrics (Supplementary)

While API server metrics provide the complete picture, the webhook server itself also exposes metrics that can help diagnose issues:

Available at: https://<webhook-pod-ip>:8443/metrics

Key metrics:

http_request_duration_seconds: Server-side processing time (excludes network/TLS)
http_requests_total: Request count by status code
czo_webhook_types_total: Webhook events by resource type and operation

Note: These metrics only show webhook server processing time and don't include network latency or TLS overhead. They're useful for isolating whether high latency is due to webhook processing or network/infrastructure issues.

Comparison: API Server vs. Webhook Server Metrics

Aspect	API Server Metrics	Webhook Server Metrics
Scope	Complete end-to-end latency	Server processing only
Includes TLS	Yes	No
Includes Network	Yes	No
Includes Sidecars	Yes	No
Granularity	Per webhook name	All requests
Recommended for	Performance monitoring	Debugging webhook logic

Recommendation: Use API server metrics for understanding the webhook's impact on cluster operations. Use webhook server metrics only for debugging specific webhook processing issues.

Monitoring Webhook TLS Certificate Health

This section covers how to detect silent webhook failures caused by TLS certificate issues. This is one of the most dangerous failure modes because it's completely invisible to standard Kubernetes monitoring.

Why This Matters

The CloudZero Agent webhook uses TLS certificates for communication with the Kubernetes API server. When there's a certificate mismatch -- for example, after a certificate rotation where the API server's caBundle hasn't been updated -- every admission request fails the TLS handshake. Because the webhook uses failurePolicy: Ignore, these failures are silently ignored: the API server lets operations proceed, but the webhook records nothing.

The result is complete, silent loss of all resource metadata (pod creates, deletes, updates, namespace changes, etc.) with no alerts from the Kubernetes side. The webhook pods appear healthy, all deployments show ready, and there are no error events in the namespace.

Detection Strategy

There is currently no Prometheus metric exported by the webhook for TLS handshake failures specifically. Detection relies on observing the absence of successful events, which is actually a reliable signal: in any active cluster, the webhook should be continuously receiving admission requests.

Metrics endpoint: https://<webhook-pod>:8443/metrics and the Kubernetes API server /metrics endpoint.

What to Alert On

1. Webhook running but receiving zero events:

rate(czo_webhook_types_total[30m]) == 0

This is the primary detection method. In any active cluster, pods are being created, updated, and deleted regularly. A zero rate for 30 minutes while the webhook pods are Running means the webhook is not receiving admission requests.

The most common cause is a TLS certificate mismatch between the webhook's TLS secret and the caBundle in the ValidatingWebhookConfiguration. Other causes include network policy blocking API server traffic to the webhook, or the ValidatingWebhookConfiguration being deleted.

Triage: See Webhook Diagnostics > Certificate not issued or expired and CA bundle mismatch in the Debugging Guide. To verify whether the caBundle matches:

# Compare fingerprints -- these should match
kubectl get validatingwebhookconfiguration <release>-cz-webhook \
  -o jsonpath='{.webhooks[0].clientConfig.caBundle}' | base64 -d | \
  openssl x509 -noout -fingerprint

kubectl get secret -n <namespace> <release>-cz-webhook-tls \
  -o jsonpath='{.data.ca\.crt}' | base64 -d | \
  openssl x509 -noout -fingerprint

2. API server webhook latency at timeout duration:

histogram_quantile(0.50,
  rate(apiserver_admission_webhook_admission_duration_seconds_bucket{
    name=~".*cloudzero.*"
  }[5m])
) > 1

When TLS handshakes fail, the API server waits for the timeout before proceeding. This manifests as every webhook call taking the full timeout duration. A median latency jumping from <10ms to >1s is a definitive signal of TLS failure.

Triage: Same as alert #1 above.

Monitoring Data Pipeline Health

This section covers monitoring the collector, which is the entry point for all metrics in the CloudZero Agent.

Why This Matters

The collector receives Prometheus remote-write requests from the internal Prometheus agent, classifies them as cost or observability metrics, and stores them to disk. If the collector stops receiving or processing metrics, CloudZero loses visibility into the cluster.

Metrics endpoint: http://<aggregator-pod>:8080/metrics (the collector container in the aggregator deployment)

Key Metrics

Metric	Type	What It Means
`metrics_received_total`	Counter	Total raw metrics received from Prometheus remote-write. Should increase steadily every scrape interval.
`metrics_received_cost_total`	Counter	Subset classified as cost-allocation metrics (cAdvisor, KSM). This is the data CloudZero uses for cost attribution.
`metrics_received_observability_total`	Counter	Subset classified as observability metrics (agent self-monitoring data).
`http_requests_total{code="204",method="post"}`	Counter	Successful remote-write ingestion requests (HTTP 204 = accepted).
`http_request_duration_seconds{method="post"}`	Histogram	Latency of processing each remote-write batch.

What to Alert On

1. No metrics received:

rate(metrics_received_total[10m]) == 0

No metrics received in 10 minutes. This means the internal Prometheus agent is not sending data to the collector, or the collector is down. In a healthy agent, this counter increments every scrape interval (typically ~60s).

Triage: See Data Pipeline Diagnostics > All Pods Healthy But No Data and CrashLoopBackOff Diagnostics in the Debugging Guide.

2. Cost metrics not being received:

rate(metrics_received_cost_total[10m]) == 0
  and
rate(metrics_received_total[10m]) > 0

Metrics are arriving but none are classified as cost metrics. This typically means KSM or cAdvisor scrape targets are not being scraped. The total will still increase from observability metrics, but CloudZero cannot perform cost attribution without the cost metrics.

Triage: See Data Pipeline Diagnostics > Missing KSM Metrics and Missing cAdvisor Metrics in the Debugging Guide.

3. High collector ingestion latency:

histogram_quantile(0.99,
  rate(http_request_duration_seconds_bucket{code="204",method="post"}[5m])
) > 1

p99 collector ingestion latency exceeding 1 second. In healthy operation, the p99 for POST requests is well under 25ms. Values above 1s suggest memory pressure, disk I/O contention, or a cluster generating more data than the collector can process.

Triage: See Performance Diagnostics > High Memory Usage in the Debugging Guide.

Monitoring Shipper Health

This section covers monitoring the shipper, which is responsible for uploading collected metric data to CloudZero.

Why This Matters

The shipper is the critical "last mile" of the data pipeline. It uploads collected metric data files to CloudZero's S3 buckets. When the shipper fails, data accumulates on disk. The agent automatically manages disk space by cleaning up shipped data based on age and shipped status, and becomes more aggressive about cleanup under disk pressure. This means persistent shipper failures are the real concern: if files can't be shipped, eventually even unshipped data will be purged to keep the volume from filling, resulting in data loss.

Metrics endpoint: http://<aggregator-pod>:8081/metrics (the shipper sidecar in the aggregator deployment)

Key Metrics

Metric	Type	What It Means
`shipper_run_fail_total`	Counter	Total shipper cycle failures, labeled by `error_status_code`.
`shipper_new_files_error_total`	Counter	File processing errors, labeled by `error_status_code`.
`shipper_presigned_url_error_total`	Counter	Pre-signed URL allocation failures (from CloudZero API).
`shipper_file_upload_error_total`	Counter	S3 upload failures, labeled by `error_status_code`.
`shipper_handle_request_success_total`	Counter	Successful file upload batches. Should increase every shipper cycle (~10 min).
`shipper_handle_request_file_count`	Histogram	Number of files processed per upload cycle.
`function_execution_seconds{function_name="shipper_runShipper"}`	Histogram	Duration of each shipper cycle, with `error` label showing failure reason on failure.

What to Alert On

1. Shipper cycle failures:

rate(shipper_run_fail_total[30m]) > 0

The shipper is failing to complete its upload cycle. The error_status_code label identifies the category of failure:

err-unauthorized: Invalid or revoked API key. The most common cause.
err-network: Cannot reach api.cloudzero.com or S3.
Other codes: See shipper logs for details.

Triage: For err-unauthorized, see Data Pipeline Diagnostics > API key invalid or revoked. For err-network, see Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

2. Pre-signed URL failures:

rate(shipper_presigned_url_error_total[30m]) > 0

Cannot obtain pre-signed URLs from CloudZero API. This blocks all file uploads. The shipper must obtain pre-signed URLs from api.cloudzero.com before uploading to S3. Failures here mean the API key is invalid/revoked, the network path to the API is blocked, or the CloudZero API is experiencing issues.

Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

3. S3 upload failures:

rate(shipper_file_upload_error_total[30m]) > 0

Pre-signed URLs were obtained successfully, but the actual file transfer to S3 failed. This typically indicates network policy blocking S3 or transient connectivity issues.

Triage: See Network Diagnostics > Cannot Reach S3 Buckets in the Debugging Guide.

4. No successful uploads:

rate(shipper_handle_request_success_total[30m]) == 0

No files have been successfully uploaded in 30 minutes. The shipper runs approximately every 10 minutes, so three consecutive cycles have failed. This is a broader check than alerts 1-3 that catches any delivery failure regardless of the specific error category.

Monitoring Webhook Metadata Delivery

This section covers monitoring the webhook's delivery of Kubernetes resource metadata to the CloudZero API via Prometheus remote-write.

Why This Matters

The webhook captures Kubernetes resource metadata (pod creates/deletes/updates, storage classes, ingress classes, etc.) and pushes it to CloudZero. This metadata is essential for cost attribution -- without it, CloudZero cannot map resource costs to teams, services, or labels.

Metrics endpoint: https://<webhook-pod>:8443/metrics

Key Metrics

Metric	Type	What It Means
`remote_write_failures_total`	Counter	Failed attempts to push metadata to CloudZero API.
`remote_write_backlog_records`	Gauge	Records queued but not yet sent. A growing backlog means delivery is failing.
`remote_write_records_processed_total`	Counter	Records successfully sent and confirmed.
`remote_write_response_codes_total`	Counter	Response codes from CloudZero API, labeled by `status_code` (bucketed values: `2xx`, `4xx`, `5xx`, `no_response`).
`remote_write_request_duration_seconds`	Histogram	Latency of push requests to CloudZero API.
`remote_write_db_failures_total`	Counter	Database failures when tracking record state internally.
`storage_write_failure_total`	Counter	Failures writing admission data to local storage, labeled by `resource_type`, `namespace`, `resource_name`, `action`.

What to Alert On

1. Remote write failures:

rate(remote_write_failures_total[30m]) > 0

The webhook cannot push metadata to CloudZero. Cost attribution metadata is not being delivered.

Triage: See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

2. Growing backlog:

remote_write_backlog_records > 1000

Data is being captured but not delivered. A small transient backlog during high activity is normal, but a persistently growing backlog means delivery is failing or too slow. Check remote_write_response_codes_total for non-2xx codes.

3. Non-success responses from CloudZero API:

rate(remote_write_response_codes_total{status_code!="2xx"}[10m]) > 0

The status_code label uses bucketed values:

4xx: Client error, most commonly an invalid or revoked API key. See Job Failure Diagnostics > Invalid API key in the Debugging Guide.
5xx: CloudZero API issue. Typically transient unless persistent.
no_response: No response received (network timeout, DNS failure, connection refused). See Network Diagnostics > Cannot Reach CloudZero API in the Debugging Guide.

4. Local storage write failures:

rate(storage_write_failure_total[10m]) > 0

The webhook is receiving admission requests but cannot persist them to its local data store. The resource_type, namespace, resource_name, and action labels identify exactly which resource operations are failing.

Monitoring Webhook Event Processing

This section covers monitoring whether the webhook is actually receiving and processing Kubernetes admission requests.

Why This Matters

It's possible for the webhook to be "running" but not receiving any requests. This can happen due to a misconfigured ValidatingWebhookConfiguration, network policy blocking API server traffic, or TLS certificate issues (see Monitoring Webhook TLS Certificate Health). When the webhook isn't receiving events, all resource metadata is silently lost.

Metrics endpoint: https://<webhook-pod>:8443/metrics

Key Metrics

Metric	Type	What It Means
`czo_webhook_types_total`	Counter	Webhook events processed, labeled by `kind_group`, `kind_version`, `kind_resource`, `operation`.
`function_execution_seconds{function_name=~"executeAdmissionsReviewRequest_.*"}`	Histogram	Server-side processing time per admission request type (Create, Update, Delete).
`function_execution_seconds{function_name="writeDataToStorage"}`	Histogram	Time to persist admission data to local storage.

What to Alert On

1. No pod events received:

rate(czo_webhook_types_total{kind_resource="pod",operation="create"}[1h]) == 0

No pod CREATE events seen in 1 hour. In any active cluster, pods are being created regularly. A zero rate means the webhook is not receiving admission requests.

Triage: See Webhook Not Receiving Admission Requests in the Debugging Guide. Check the ValidatingWebhookConfiguration, CA bundle, and network policies allowing API server to webhook on port 8443. Also see Monitoring Webhook TLS Certificate Health -- TLS certificate mismatch is the most common cause of this symptom.

2. High webhook server-side processing latency:

histogram_quantile(0.99,
  rate(function_execution_seconds_bucket{
    function_name=~"executeAdmissionsReviewRequest_.*",
    error=""
  }[5m])
) > 0.5

Webhook server-side processing p99 exceeding 500ms. This is the processing time inside the webhook; the API server sees this plus network/TLS overhead. In healthy operation, the p99 for admission request processing is typically under 20ms.

Triage: See Performance Diagnostics > Slow Webhook Response Times in the Debugging Guide. Cross-reference with the API server metric apiserver_admission_webhook_admission_duration_seconds to determine whether the latency is from processing or network.

Monitoring Prometheus Agent Scrape Health

This section covers monitoring the internal Prometheus agent that scrapes Kubernetes metrics and feeds them to the collector.

Why This Matters

The CloudZero Agent uses Prometheus in agent mode to scrape Kubernetes metrics (cAdvisor from kubelets, KSM) and remote-write them to the collector. If Prometheus can't discover scrape targets, can't scrape them, or can't remote-write to the collector, no data enters the pipeline.

Metrics endpoint: http://<server-pod>:9090/metrics (the Prometheus process on the server deployment)

Key Metrics

Metric	Type	What It Means
`prometheus_sd_discovered_targets`	Gauge	Targets discovered by service discovery.
`prometheus_target_scrape_pool_targets`	Gauge	Current number of active scrape targets.
`prometheus_remote_storage_samples_failed_total`	Counter	Samples that permanently failed to send via remote-write. These are lost.
`prometheus_remote_storage_samples_pending`	Gauge	Samples queued for remote-write. Growing = collector can't keep up.
`prometheus_remote_storage_queue_highest_sent_timestamp_seconds`	Gauge	Most recent successfully sent timestamp.
`prometheus_remote_storage_queue_highest_timestamp_seconds`	Gauge	Most recent enqueued timestamp.
`prometheus_remote_storage_shards`	Gauge	Current number of remote-write shards.
`prometheus_target_scrapes_sample_out_of_order_total`	Counter	Samples dropped due to out-of-order timestamps.

What to Alert On

1. No scrape targets discovered:

prometheus_sd_discovered_targets == 0

Prometheus has discovered zero scrape targets. Service discovery is broken and no metrics will be collected.

Triage: See Data Pipeline Diagnostics > Some Metrics Missing in the Debugging Guide. Check the agent-server Prometheus configuration and RBAC permissions.

2. Remote-write samples permanently failing:

rate(prometheus_remote_storage_samples_failed_total[10m]) > 0

Prometheus permanently failed to send samples to the collector via remote-write. These samples are lost. Check collector health (see Monitoring Data Pipeline Health) first; if the collector is healthy, check Prometheus logs for remote-write errors.

3. Remote-write lag exceeding 5 minutes:

(
  prometheus_remote_storage_queue_highest_timestamp_seconds
  -
  prometheus_remote_storage_queue_highest_sent_timestamp_seconds
) > 300

The remote-write queue is more than 5 minutes behind. Data is being scraped but not delivered to the collector in a timely manner. Small lag during startup or after a restart is normal. Persistent lag indicates the collector can't keep up, or there's a network issue between Prometheus and the collector.

Triage: See Performance Diagnostics > Data Processing Delays in the Debugging Guide.

4. Rapid remote-write resharding:

changes(prometheus_remote_storage_shards[5m]) > 10

The Prometheus remote-write subsystem is rapidly oscillating its shard count. This is a symptom of write pressure where the remote-write queue can't maintain a stable throughput rate. Rapid resharding is a leading indicator of imminent memory exhaustion on the server -- the constant allocation and deallocation of shards drives memory usage up.

Triage: Increase server memory limits. For very large clusters, consider enabling federated mode to reduce the number of targets the server must scrape. See Performance Diagnostics > High Memory Usage in the Debugging Guide.

5. Out-of-order sample drops:

rate(prometheus_target_scrapes_sample_out_of_order_total[10m]) > 0

Samples are being dropped because their timestamps are out of order. This can indicate cAdvisor timestamp issues (common with some container runtimes) or scrape interval misconfiguration. Each dropped sample is a gap in the collected data.

Triage: See Data Pipeline Diagnostics > Missing cAdvisor Metrics in the Debugging Guide.

Monitoring Pod and Container Health

This section covers standard Kubernetes monitoring for the CloudZero Agent's pods and containers, including OOM kills, container restarts, and proactive memory pressure detection.

Required Metrics Source: kube-state-metrics

kube-state-metrics provides metrics about pod and container state. Most Kubernetes monitoring stacks (kube-prometheus-stack, etc.) include it by default.

Note: To ensure OOM kill metrics are available, your kube-state-metrics deployment must have the pods collector enabled. If you do not see the kube_pod_container_status_last_terminated_reason metric, check your KSM configuration to ensure the pods collector is active and not explicitly excluded.

# Verify kube-state-metrics is running
kubectl get pods -A | grep kube-state-metrics

# Verify OOM metrics are exposed
kubectl port-forward -n kube-system deployment/kube-state-metrics 8080:8080
curl http://localhost:8080/metrics | grep kube_pod_container_status_last_terminated_reason

Detecting OOM Kills

OOM kills occur when a container exceeds its memory limit and is terminated by the kernel. This causes service disruption, potential data loss (in-flight requests and unsaved state), and restart delays.

Primary metric (from kube-state-metrics):

kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}

What to alert on:

Current OOMKilled containers:

kube_pod_container_status_last_terminated_reason{
  namespace="<agent-namespace>",
  reason="OOMKilled"
} == 1

OOM kill rate over time:

sum(rate(kube_pod_container_status_restarts_total{
  namespace="<agent-namespace>"
}[5m])) by (pod, container)
  and on(namespace, pod, container)
  kube_pod_container_status_last_terminated_reason{
    namespace="<agent-namespace>",
    reason="OOMKilled"
  } == 1

Triage: See CrashLoopBackOff Diagnostics > OOMKilled in the Debugging Guide. Increase memory limits for the affected container. See Memory Sizing Guidelines below for recommendations based on cluster size.

Detecting Container Restarts

Any container restart in agent pods:

rate(kube_pod_container_status_restarts_total{
  namespace="<agent-namespace>"
}[15m]) > 0

A single restart during initial startup is normal. Repeated restarts indicate CrashLoopBackOff. Check the exit code: 137 = OOMKilled (increase memory), other codes = application error (check logs).

Triage: See CrashLoopBackOff Diagnostics in the Debugging Guide.

Detecting Unavailable Replicas

kube_deployment_status_replicas_unavailable{
  namespace="<agent-namespace>"
} > 0

Any agent deployment has unavailable replicas. Run kubectl get pods -n <namespace> and follow the General kubectl Workflow in the Debugging Guide.

Proactive Memory Pressure Detection

Why Monitor Against Limits?

Monitoring memory usage against limits provides early warning before containers are OOM killed. This is the single most common operational issue observed in production CloudZero Agent deployments across clusters of all sizes.

Memory usage as percentage of limit:

(
  container_memory_working_set_bytes{
    namespace="<agent-namespace>",
    container!="", container!="POD"
  }
  /
  kube_pod_container_resource_limits{
    namespace="<agent-namespace>",
    resource="memory"
  }
) > 0.85

Alert when memory usage exceeds 85% of the limit. This gives you time to increase limits before the container is killed.

Why container_memory_working_set_bytes? This metric excludes cached data that can be evicted, representing the true memory pressure. It's the same metric Kubernetes uses for OOM decisions.

Why Also Monitor Against Requests?

Monitoring against requests provides a different signal: it detects when actual usage has significantly diverged from what was expected at scheduling time, which can indicate undersized requests (scheduling problems), unexpected memory growth, or workload characteristics that have changed.

Memory usage as percentage of request:

(
  container_memory_working_set_bytes{
    namespace="<agent-namespace>",
    container!="", container!="POD"
  }
  /
  kube_pod_container_resource_requests{
    namespace="<agent-namespace>",
    resource="memory"
  }
) * 100

Memory Sizing Guidelines

Memory usage for CloudZero Agent components scales with cluster size. The following table provides approximate observed ranges to help with capacity planning. Memory limits should be set to at least 1.5x the observed peak for your cluster size tier.

Component	Small (<50 nodes)	Medium (50-200 nodes)	Large (200-600 nodes)	Very Large (600+ nodes)
Server (Prometheus)	200-700 Mi	900-2500 Mi	4000-11000 Mi	14000+ Mi
Aggregator (per replica)	80-200 Mi	200-1100 Mi	1000-1500 Mi	Scale replicas
Webhook (per replica)	10-50 Mi	250-350 Mi	250-650 Mi	250-650 Mi
KSM	20-80 Mi	70-360 Mi	300-650 Mi	2500+ Mi

The Prometheus server is the component most sensitive to cluster size. For clusters above 200 nodes, consider enabling federated mode to distribute the scrape load.

Quick Reference: Priority Summary

Priority	What to Monitor	Key Alert	Common Cause
P0	Webhook admission latency	API server metric p99 > 1s	Webhook unreachable, TLS failure, resource pressure
P0	Server memory vs limit	`container_memory_working_set_bytes / limit > 0.85`	Cluster too large for current memory limits
P0	Webhook receiving zero events	`rate(czo_webhook_types_total) == 0` for 30m	TLS certificate mismatch after rotation
P1	Shipper upload failures	`rate(shipper_run_fail_total) > 0` for 30m	Invalid API key, network policy blocking egress
P1	Aggregator memory vs limit	`container_memory_working_set_bytes / limit > 0.85`	Default 1Gi limit too low for cluster size
P1	OOM kills	`kube_pod_container_status_last_terminated_reason{reason="OOMKilled"}`	Undersized memory limits
P2	Remote-write resharding churn	`changes(prometheus_remote_storage_shards[5m]) > 10`	Write pressure, imminent server OOM
P2	No metrics received	`rate(metrics_received_total) == 0` for 10m	Collector down, Prometheus not sending data
P2	Webhook metadata backlog	`remote_write_backlog_records > 1000`	API connectivity or auth issues
P2	Out-of-order sample drops	`rate(prometheus_target_scrapes_sample_out_of_order_total) > 0`	cAdvisor timestamp issues
P3	Container restarts	`rate(kube_pod_container_status_restarts_total) > 0`	Various (check exit code)

Monitoring the CloudZero Agent

Monitoring the CloudZero Agent

Monitoring Webhook Admission Latency

Why Monitor Webhook Admission Latency?

Why API Server Metrics?

The Right Metric: apiserver_admission_webhook_admission_duration_seconds

Accessing the Metrics

What to Alert On

Webhook Server Metrics (Supplementary)

Comparison: API Server vs. Webhook Server Metrics

Monitoring Webhook TLS Certificate Health

Why This Matters

Detection Strategy

What to Alert On

Monitoring Data Pipeline Health

Why This Matters

Key Metrics

What to Alert On

Monitoring Shipper Health

Why This Matters

Key Metrics

What to Alert On

Monitoring Webhook Metadata Delivery

Why This Matters

Key Metrics

What to Alert On

Monitoring Webhook Event Processing

Why This Matters

Key Metrics

What to Alert On

Monitoring Prometheus Agent Scrape Health

Why This Matters

Key Metrics

What to Alert On

Monitoring Pod and Container Health

Required Metrics Source: kube-state-metrics

Detecting OOM Kills

Detecting Container Restarts

Detecting Unavailable Replicas

Proactive Memory Pressure Detection

Why Monitor Against Limits?

Why Also Monitor Against Requests?

Memory Sizing Guidelines

Quick Reference: Priority Summary

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

The Right Metric: `apiserver_admission_webhook_admission_duration_seconds`