Skip to content

postgres controllers metrics#1811

Draft
limak9182 wants to merge 2 commits intofeature/database-controllersfrom
feature/postgres-controllers-metrics
Draft

postgres controllers metrics#1811
limak9182 wants to merge 2 commits intofeature/database-controllersfrom
feature/postgres-controllers-metrics

Conversation

@limak9182
Copy link
Copy Markdown

@limak9182 limak9182 commented Apr 2, 2026

Description

Adds comprehensive Prometheus metrics for the PostgreSQL controllers using a hexagonal
(ports & adapters) pattern — the domain code depends only on a Recorder interface, never
on Prometheus directly.

New package: pkg/postgresql/metrics/

  • ports.goRecorder interface + typed constants for all label values (controller names,
    result labels, error classes, action names, resource kinds). Compile-time checked, no magic
    strings.
  • prometheus.goPrometheusRecorder adapter: 13 metric families with
    splunk_operator_postgres_ prefix, registered against the controller-runtime metrics registry.
  • noop.goNoopRecorder for unit tests.
  • collector.goFleetCollector with per-resource-type rate limiting (2s) to recompute
    fleet-state gauges from the informer cache after each reconcile.

Metrics emitted (13 families):

Metric Type Description
reconcile_total Counter Reconcile attempts by controller and result
reconcile_duration_seconds Histogram End-to-end reconcile latency (p50/p99)
reconcile_errors_total Counter Errors by class (not_found, conflict, validation, unknown)
reconcile_requeues_total Counter Requeues by reason
validation_failures_total Counter Config/validation failures by reason
clusters Gauge Clusters by phase and pooler status
databases Gauge Databases by phase
managed_users Gauge User counts by state (desired/reconciled/pending/failed)
user_actions_total Counter User-management actions (secret, role, privilege, etc.)
poolers Gauge PgBouncer poolers by type and state
pooler_instances Gauge Pooler instance count
finalizer_operations_total Counter Finalizer success/failure
owned_resource_operations_total Counter CRUD on owned resources (Secret, Cluster, Pooler, ConfigMap)

Three-layer collection:

  1. Controller shell — timing, outcome classification
  2. Fleet collector — gauges from observed CR state (rate-limited)
  3. Explicit rc.Metrics.* calls in service code — next to existing event emissions

Design decisions:

  • No reconcile-level metrics — controller-runtime handles these for free
    (controller_runtime_reconcile_total{controller="postgresCluster|postgresdatabase"})
  • Low-cardinality labels only (controller, result, phase, reason) — no
    per-resource name/namespace labels, following Prometheus best practices
  • Recorder interface enables testability (NoopRecorder) and adapter swappability
  • All label values are typed constants in ports.go — typos become compile errors
  • Existing pkg/splunk/client/metrics/ is untouched

Key Changes

Highlight the updates in specific files

Testing and Verification

Setting up Grafana + Prometheus on KIND

1. Install the monitoring stack

kubectl create namespace monitoring

# Add helm repos
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo add grafana https://grafana.github.io/helm-charts
helm repo update

# Install kube-prometheus-stack
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=admin \
  --set alertmanager.enabled=false \
  --set kubeStateMetrics.enabled=false \
  --set nodeExporter.enabled=false
  1. Grant Prometheus access to scrape the operator metrics
  # Create RBAC for the Prometheus SA to read /metrics
  kubectl apply -f - <<EOF
  apiVersion: rbac.authorization.k8s.io/v1
  kind: ClusterRoleBinding
  metadata:
    name: prometheus-splunk-operator-metrics
  roleRef:
    apiGroup: rbac.authorization.k8s.io
    kind: ClusterRole
    name: splunk-operator-metrics-reader
  subjects:
  - kind: ServiceAccount
    name: prometheus
    namespace: monitoring
  EOF
  1. Create a ServiceMonitor
  kubectl apply -f - <<EOF
  apiVersion: monitoring.coreos.com/v1
  kind: ServiceMonitor
  metadata:
    name: splunk-operator-postgres
    namespace: monitoring
    labels:
      release: kube-prometheus
  spec:
    namespaceSelector:
      matchNames:
      - splunk-operator
    selector:
      matchLabels:
        control-plane: controller-manager
    endpoints:
    - port: metric
      path: /metrics
      interval: 5s
      bearerTokenFile: /var/run/secrets/kubernetes.io/serviceaccount/token
  EOF
  1. Access Grafana
  kubectl port-forward svc/kube-prometheus-grafana -n monitoring 3000:80

Open http://localhost:3000 — login: admin / admin

The Prometheus datasource is auto-configured. Query any metric with the splunk_operator_postgres_ prefix.

  1. Example PromQL queries
  # Reconcile rate by controller
  rate(splunk_operator_postgres_reconcile_total[5m])

  # p99 latency per controller
  histogram_quantile(0.99, sum by (controller, le) (rate(splunk_operator_postgres_reconcile_duration_seconds_bucket[5m])))

  # Databases by phase
  splunk_operator_postgres_databases

  # Error breakdown
  rate(splunk_operator_postgres_reconcile_errors_total[5m])
image

Related Issues

Jira tickets, GitHub issues, Support tickets...

PR Checklist

  • Code changes adhere to the project's coding standards.
  • Relevant unit and integration tests are included.
  • Documentation has been updated accordingly.
  • All tests pass locally.
  • The PR description follows the project's guidelines.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 2, 2026

CLA Assistant Lite bot:
Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contribution License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment with the exact sentence copied from below.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request

@limak9182 limak9182 changed the title metrics postgres controllers metrics Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant