Skip to content

feat: Prometheus metrics collection via OTEL Collector #36

@JacobPEvans-personal

Description

@JacobPEvans-personal

Summary

No metrics are currently collected from Cribl Stream, Cribl Edge, or the OTEL Collector itself. Without metrics, alerting and dashboards are impossible. This issue adds Prometheus as the metrics store, fed by OTEL Collector scraping internal Cribl APIs.

Motivation

  • Cribl Stream exposes internal stats at /api/v1/system/stats (port 9000)
  • Cribl Edge exposes metrics at port 9420
  • OTEL Collector exposes self-metrics at :8888/metrics
  • All three are currently invisible at the infrastructure level

Approach

OTEL Collector changes (k8s/base/otel-collector/configmap.yaml)

Add prometheus receiver scraping:

  • cribl-stream-standalone:9000/api/v1/system/stats (Cribl internal metrics)
  • cribl-edge-standalone:9420/metrics (Edge self-metrics)
  • localhost:8888/metrics (OTEL self-metrics)

Add prometheusremotewrite exporter pointing to Prometheus StatefulSet.

Expose OTEL self-metrics port 8888 in the Service.

New k8s/base/prometheus/ directory

  • StatefulSetprom/prometheus:latest, ~256Mi memory, 10Gi PVC
  • ConfigMapprometheus.yml with scrape configs and retention (15d)
  • Service — ClusterIP for OTEL remote write, NodePort :30090 for local access
  • NetworkPolicy — ingress from otel-collector on remote-write port (9090), egress to Cribl pods on stats ports

NetworkPolicy updates

  • OTEL Collector: add egress to stream:9000, edge:9420
  • Prometheus: ingress from OTEL on 9090

Acceptance Criteria

  • kubectl --context orbstack -n monitoring port-forward svc/prometheus 9090 shows Cribl metrics in Prometheus UI
  • OTEL self-metrics visible (e.g., otelcol_process_uptime_seconds)
  • Stream outBytes, inBytes, outputDroppedEventsTotal are queryable
  • All pods remain Running after deploy
  • NetworkPolicies still pass make validate

Notes

This is a foundational dependency for:

  • Alertmanager rules (see related issue: Alertmanager with pipeline stall rules)
  • Grafana dashboards (see related issue: Grafana dashboards for pipeline visibility)
  • OTEL Collector self-monitoring pipeline

Implement this before alerting or dashboards.


Observability Roadmap

This is the foundational issue for the monitoring observability stack. The following issues have been consolidated here (2026-04-24):

Implement this first; the above will be reopened when scheduled.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions