feat(metrics): add v1 of chronos prometheus metrics#14
Open
aidanhall34 wants to merge 36 commits into
Open
Conversation
Document Chronos project context, verification commands, and the action-trail expectations future agents should follow. Verification: not run (docs-only change). Model-version: GPT-5
Add a mock metrics abstraction that defines Chronos metrics once and records through either a Prometheus client registry or an OTLP metrics provider selected by OTEL_METRICS_EXPORTER. The OTLP path uses the gRPC exporter configuration from standard OTLP environment variables, while the Prometheus path uses the prometheus client crate directly. Register the mock as an explicit chronos_ex example target and enable the OpenTelemetry metrics features required by the design. Verification: - cargo fmt --check - cargo check --manifest-path /tmp/chronos_prom_otlp_mock_check/Cargo.toml - cargo check -p chronos_ex --example prom_otlp_mock (blocked: missing system libsasl2 development package) Model-version: GPT-5
Add a metrics.mock recipe that runs the Prometheus/OTLP mock with EXPORTER=prom or EXPORTER=otlp. Use a minimal standalone example package so the mock can run without pulling in Chronos Kafka dependencies and the local libsasl2 development package. The OTLP mode sets OTEL_METRICS_EXPORTER=otlp and OTEL_EXPORTER_OTLP_PROTOCOL=grpc, while Prometheus mode sets OTEL_METRICS_EXPORTER=prometheus and prints the text exposition. Verification: - cargo fmt --check - make -n metrics.mock EXPORTER=prom - make -n metrics.mock EXPORTER=otlp - make metrics.mock EXPORTER=prom - make metrics.mock EXPORTER=otlp - make metrics.mock EXPORTER=bad Model-version: GPT-5
Set up a pinned grafana/otel-lgtm:0.24.1 compose overlay with local Prometheus, OpenTelemetry Collector, and Grafana dashboard provisioning overrides. The Prometheus config keeps the upstream OTLP/resource defaults and scrapes the LGTM services plus Chronos on chronos:9091. Wire Chronos metrics binding through OTEL_EXPORTER_PROMETHEUS_HOST and OTEL_EXPORTER_PROMETHEUS_PORT, keeping METRICS_PORT as a backward-compatible fallback. Update local integration helpers and docs to use the OpenTelemetry Prometheus exporter variables. Verification: - cargo fmt --check - docker compose -f docker-compose.yml -f dev/docker-compose-lgtm.yaml config - make lgtm.validate - sh scripts/pre-commit-checks.sh (fails: host is missing libsasl2 development headers required by sasl2-sys) Model-version: GPT-5
Configure the LGTM OpenTelemetry Collector override to emit JSON logs and detailed internal metrics via service.telemetry. Verification: - make lgtm.validate Model-version: GPT-5
Add lgtm.up and lgtm.down recipes that operate only on the LGTM service while using the main compose file for the shared chronos network. Update the local observability docs to point at the new start command. Verification: - docker compose -f docker-compose.yml -f dev/docker-compose-lgtm.yaml config --services - make -n lgtm.up - make lgtm.validate Model-version: GPT-5
Enable LGTM service logging, route backend stdout and stderr through a JSON-line logging wrapper, and have the embedded OpenTelemetry Collector re-ingest the generated log files with the file_log receiver.
Add an LGTM healthcheck script that checks Grafana, Loki, Tempo, Pyroscope, Prometheus, and the OpenTelemetry Collector readiness endpoints.
Tempo 2.10.3 does not expose a JSON log-format flag, so the wrapper normalizes non-JSON service lines into JSON records while preserving native JSON records from services that support them.
Verification:
- make lgtm.validate
- docker compose -f docker-compose.yml -f dev/docker-compose-lgtm.yaml config
- make lgtm.up
- docker ps --filter name=lgtm --format '{{.Names}} {{.Status}}'
- docker exec lgtm sh -c 'sh /otel-lgtm/chronos-healthcheck.sh'
- docker exec lgtm sh -c 'ls -1 /data/lgtm/logs && for f in /data/lgtm/logs/*.jsonl; do head -n 2 ""; done'
- docker exec lgtm sh -c 'curl -sfG http://127.0.0.1:3100/loki/api/v1/query_range --data-urlencode query={service_name="unknown_service"} --data-urlencode limit=1'
Model-version: GPT-5
Derive LGTM log filenames from service names instead of the full service label that includes component versions. Map the OpenTelemetry Collector label to otelcol.jsonl for a stable service-specific file name.
Verification:
- make lgtm.validate
- docker compose -f docker-compose.yml -f dev/docker-compose-lgtm.yaml up -d --force-recreate lgtm
- docker exec lgtm sh -c 'ls -1 /data/lgtm/logs'
- docker ps --filter name=lgtm --format '{{.Names}} {{.Status}}'
Model-version: GPT-5
Make the OTLP metrics mock explicit about the LGTM gRPC endpoint and service resource metadata so make metrics.mock EXPORTER=otlp sends to the local LGTM collector instead of relying on older SDK endpoint defaults. Flush the metrics provider before shutdown so short-lived mock runs export their points. Set LGTM file-ingested logs through a transform processor that fills missing resource service.name from log.file.name without the .jsonl extension, preserving records that already carry a service name. Verification: - cargo fmt --check - cargo check -p prom_otlp_mock_runner - make lgtm.validate - make metrics.mock EXPORTER=otlp - docker exec lgtm sh -c 'curl -sf http://127.0.0.1:3100/loki/api/v1/label/service_name/values' Note: direct curl to localhost:4318/9090 from this sandbox network namespace fails even while Docker reports LGTM ports published; Prometheus/Loki checks were run from inside the LGTM container. Model-version: GPT-5
Add an example OpenTelemetry Weaver registry for Chronos metrics, a Rust template skeleton, and a checked-in generated definition example that follows the Prometheus/OTLP abstraction in examples/prom_otlp_mock.rs.
The template has been verified with the otel/weaver:v0.23.0 Docker image. The generated Rust example is checked in after rustfmt so future work has a concrete target for integrating generated metric definitions.
Verification:
- docker run --rm otel/weaver:v0.23.0 --version
- docker run --rm -v /home/ah34/work/opensource/chronos:/work -w /work otel/weaver:v0.23.0 registry check -r examples/weaver/registry
- docker run --rm -v /home/ah34/work/opensource/chronos:/work -v /tmp/chronos-weaver-out:/out -w /work otel/weaver:v0.23.0 registry generate -r examples/weaver/registry --templates examples/weaver/templates rust /out
- rustfmt --check examples/weaver/generated/chronos_metric_definitions.rs
- rustfmt --config-path rustfmt.toml /tmp/chronos-weaver-out/chronos_metric_definitions.rs && diff -u examples/weaver/generated/chronos_metric_definitions.rs /tmp/chronos-weaver-out/chronos_metric_definitions.rs
- python3 -c 'import yaml; yaml.safe_load(open("examples/weaver/registry/chronos/metrics.yaml")); yaml.safe_load(open("examples/weaver/templates/registry/rust/weaver.yaml")); print("yaml ok")'
- git diff --cached --check
Model-version: GPT-5
Keep the Prometheus/OTLP metrics mock running until interrupted and record counter/histogram samples on every cycle. Use OpenTelemetry environment variables and messaging semantic convention names for emitted metric attributes, with Prometheus-safe rendered names where required.
Add Docker-backed Weaver Make recipes for registry checks, Rust generation, markdown generation, JSON schema generation, and live-check. Check in the generated markdown and resolved-registry schema outputs.
Verification:
- cargo check --package prom_otlp_mock_runner
- cargo fmt --package prom_otlp_mock_runner -- --check
- rustfmt --edition 2021 --check examples/prom_otlp_mock.rs examples/weaver/generated/chronos_metric_definitions.rs
- python3 -c 'import yaml; yaml.safe_load(open("examples/weaver/registry/chronos/metrics.yaml")); yaml.safe_load(open("examples/weaver/templates/registry/rust/weaver.yaml")); yaml.safe_load(open("examples/weaver/templates/registry/markdown/weaver.yaml")); print("yaml ok")'
- make weaver.generate
- make weaver.check
- make weaver.live-check
- Prometheus mock smoke test on http://127.0.0.1:19092/metrics
- git diff --cached --check
Model-version: GPT-5
Capture the current Chronos Prometheus metric surface in chronos_bin/src/metrics/spec.yaml, cross-referenced with issue kindredgroup#12 and the Prometheus/OTLP abstraction sketched in examples/prom_otlp_mock.rs. Add an unused generated Rust definition table with metric IDs, Prometheus names, OTLP names, labels, buckets, and pre-warm label values. The generated module is intentionally not imported yet, so runtime metrics remain hand-written. Verification: - cargo fmt -- --check - sh scripts/pre-commit-checks.sh (fails: missing libsasl2 development package; cargo check also reports time 0.3.30 type inference error before tests run) Model-version: GPT-5
Make the standard lint and unit-test targets run with warnings denied, remove the obsolete clippy crate dependency, and fix the existing Rust and clippy warnings that blocked that policy. Update the lockfile for the Rust 1.94-compatible time dependency set and refreshed native Kafka build dependencies. Add a pre-commit GitHub Actions workflow for pushes to non-main branches, plus an act-backed Make target to run that workflow locally. Add .github/config.json and a repo.config.apply target for applying repository settings, Actions permissions, and main branch protection through gh. Verification: - sh scripts/pre-commit-checks.sh - make workflow.pre-commit.act Model-version: GPT-5
Wire Chronos runtime metrics through the Weaver-generated metric definition table and add Prometheus/OTLP exporter selection via OTEL_METRICS_EXPORTER. Prometheus metrics now use the chronos namespace while preserving the issue kindredgroup#12 metric dimensions and buckets. Generate Weaver Rust definitions, Markdown metric docs, and resolved registry JSON schema into chronos_bin/src/metrics/generated as part of make build. Verification: - make weaver.generate - make lint - env CARGO_HUSKY_DONT_INSTALL_HOOKS=true cargo test -p chronos_bin - make build - sh scripts/pre-commit-checks.sh - make integration Model-version: GPT-5
Add reusable GitHub Actions for pre-commit, unit tests, Trivy scanning, static binary builds, container builds, and SBOM generation. CI.yaml now runs the required non-main branch checks in parallel and gates merge protection through a final CI aggregate job. Add act recipes under dev/makefiles/act.mk for running the central CI workflow, individual jobs, and SBOM workflow inputs locally. Keep sbom reusable/manual only so it is not called from CI yet. Verification: - git diff --check - python3 YAML parse for .github/workflows - make -f dev/makefiles/act.mk -n act.ci act.ci.job act.sbom.release - docker run --rm -v /home/ah34/work/opensource/chronos:/repo -w /repo rhysd/actionlint:latest - sh scripts/pre-commit-checks.sh Model-version: GPT-5
Switch the slim image away from fully static scratch output because the current rdkafka SASL/OpenSSL feature set pulls in system libraries that are not practical to link statically on Alpine. Keep the image small by using an Alpine runtime with only the required shared libraries, add bash for librdkafka configure, and disable Rust musl crt-static during the build. Update the composable binary workflow to extract the Alpine-built binary path and add a .dockerignore so image builds do not send target or git state as context. Verification: - docker build -f Dockerfile.chronos-slim -t chronos-slim:test . - docker run --rm --entrypoint /bin/sh chronos-slim:test -lc 'ldd /chronos' - sh scripts/pre-commit-checks.sh Model-version: GPT-5
Move Docker Compose files under dev/docker-compose, split infra and observability backends, and make make up start Chronos with PostgreSQL, Kafka, and Jaeger/OpenTelemetry by default. Add the LGTM backend as an alternate make up lgtm path and move LGTM configuration under dev/lgtm. Split the root Makefile into logical includes under dev/makefiles while keeping legacy target aliases where useful. Replace scripts/pre-commit-checks.sh with the make pre-commit target and update CI and agent documentation references. Verification: - make setup - make help - make docker.config - make docker.config BACKEND=lgtm - make pre-commit - make build - make lgtm.validate Model-version: GPT-5
Move production Weaver registry and templates under dev/weaver/production so application generation no longer depends on examples/weaver. Keep examples/weaver as the explicit example input set. Route production Weaver docs to docs/chronos_metrics.md and the resolved registry schema to docs/schema/resolved-registry.schema.json. Make build depend on weaver.production.generate; example artifacts now require an explicit make weaver.example.generate call. Verification: - make build - make weaver.example.generate - make pre-commit Model-version: GPT-5
Move Dockerfiles into docker/ and update Docker Compose build paths to reference the new location. Add make docker.build plus per-image recipes for the Chronos and PostgreSQL migration images. Fix the Docker runtime stage user reference while moving the files so the images build without Dockerfile warnings for the stage names or undefined USER. Verification: - make docker.config - make docker.config BACKEND=lgtm - make docker.build - make pre-commit Model-version: GPT-5
Replace separate production and example Weaver recipes with shared weaver.check and weaver.generate targets selected by WEAVER_TARGET. The default target remains production, while example artifacts are generated with WEAVER_TARGET=example. Keep make build pinned to production generation by invoking make weaver.generate WEAVER_TARGET=production before cargo build. Verification: - make -n build - make -n weaver.generate - make -n weaver.generate WEAVER_TARGET=example - make weaver.generate - make weaver.generate WEAVER_TARGET=example - make build - make pre-commit Model-version: GPT-5
Add a custom k6 image with xk6-kafka, Make targets for contract and load workloads, and k6 scripts that publish results through OTLP. The contract test exercises immediate publish, delayed DB publish, invalid delayed payload, and missing-key immediate failure paths. The load test models the README throughput and jitter SLOs with configurable rate/duration inputs. Route Chronos logs to the LGTM filelog directory when the LGTM compose overlay is active, and send Chronos metrics to LGTM over OTLP while leaving traces on the existing HTTP OTLP path. Verification: - make k6.build - docker run --rm chronos-k6:1.7.1 version - k6 inspect for dev/k6/contract.js and dev/k6/load.js through the custom image - make up lgtm - make k6.contract - K6_LOAD_RATE=10 K6_LOAD_DURATION=5s K6_LOAD_CONSUME_DURATION=15s K6_LOAD_EXPECTED_MESSAGES=50 make k6.load (expected SLO failure: p99.9 jitter exceeded 500ms on the local stack) - make docker.config BACKEND=lgtm - make pre-commit Model-version: GPT-5
Set OTEL_SERVICE_NAME to chronos during Chronos startup when callers have not provided a value, before tracing and metrics exporters initialize. Also make the Jaeger helper use the effective service name instead of formatting the Result returned by env lookup. Verification: - make pre-commit Model-version: GPT-5
Change the default k6 load profile to 100 messages/sec while keeping the production-scale 1,000 messages/sec profile available through K6_FULL_LOAD=true make k6.load. Document that the full load profile depends on k6, Docker host, Kafka, PostgreSQL, and Chronos capacity and may require production-like infrastructure to pass. Verification: - node --check dev/k6/load.js - make -n k6.load - make -n k6.load K6_FULL_LOAD=true - k6 inspect for dev/k6/load.js through the custom image - make pre-commit Model-version: GPT-5
Change the k6 load test to calculate scheduling jitter from the Kafka output record timestamp minus the requested scheduled timestamp. This removes k6 drain/consume timing from the jitter measurement and keeps the original input publish timestamp as separate payload data. Add a timestamp error counter threshold so missing or unparsable output record timestamps fail visibly. Verification: - node --check dev/k6/load.js - k6 inspect for dev/k6/load.js through the custom image - make pre-commit Model-version: GPT-5
Record an explicit Kafka output record timestamp when Chronos publishes a delayed message and use that timestamp for chronos.message.jitter instead of measuring after the Kafka delivery future completes. This removes broker acknowledgement/backpressure wait time from the jitter observation. Add focused tests proving jitter milliseconds are converted to seconds and clock skew floors at zero. Verification: - make pre-commit Model-version: GPT-5
Apply Chronos' generated histogram bucket boundaries to the OTLP metrics exporter so LGTM does not fall back to the SDK default histogram buckets. This prevents low-latency Chronos histograms from quantiling into the broad default 0..5s bucket and reporting p95 values around 4.75s. Update the k6 load workload to publish a default mix of immediate and delayed messages. Immediate messages exercise the receiver-to-Kafka path, while delayed messages enter PostgreSQL and exercise the processor path. Keep the scheduling jitter trend scoped to delayed messages and record immediate delivery delay separately. Verification: - node --check dev/k6/load.js - make -n k6.load - make -n k6.load K6_FULL_LOAD=true - cargo test -p chronos_bin metrics::registry::tests::otlp_histograms_use_generated_second_boundaries - make pre-commit - docker run --rm -v /home/ah34/work/opensource/chronos/dev/k6:/scripts:ro chronos-k6:1.7.1 inspect /scripts/load.js Model-version: GPT-5
Change the k6 load workload default to send 10% of messages with already-expired deadlines and 90% with future deadlines. Use the global k6 scenario iteration counter to spread immediate messages through the run instead of grouping them by VU-local iteration. Add tagged published and consumed thresholds for immediate and delayed paths so the load test verifies that both receiver-to-Kafka and PostgreSQL-backed processor paths are exercised. Verification: - node --check dev/k6/load.js - docker run --rm -v /home/ah34/work/opensource/chronos/dev/k6:/scripts:ro chronos-k6:1.7.1 inspect /scripts/load.js - make -n k6.load - make -n k6.load K6_FULL_LOAD=true - make pre-commit Model-version: GPT-5
Author
Make the production Weaver registry use one canonical metric name and label set for both OTEL and Prometheus. Remove duplicated prometheus_name and prometheus_label_names annotations, derive Prometheus names and labels by normalizing canonical identifiers, and regenerate the Rust definitions and metrics docs. Update the runtime registry to use canonical metric names for OTLP, normalized names for Prometheus exposition, and cumulative temporality for histograms. Remove the older unused chronos_bin/src/metrics/spec.yaml so the production Weaver registry is the single metric definition source. Verification: - make weaver.generate - cargo test -p chronos_bin metrics::registry - make weaver.check - make pre-commit Model-version: GPT-5
Add cAdvisor, postgres_exporter, KMinion, and sql_exporter to the LGTM compose overlay, with health checks and dependency ordering. Configure Prometheus scrape jobs for those exporters and add sql_exporter configuration for the chronos_rows hanger table row-count metric. Limit the local Chronos compose container to 2 CPUs and 2 GiB of memory, and limit k6 runner containers launched from Make targets to 1 CPU and 1 GiB of memory. Document the new local LGTM exporters and limits in How-to.md. Verification: - docker compose --project-name chronos -f dev/docker-compose/compose.yaml -f dev/docker-compose/lgtm.yaml config - make lgtm.validate - docker run --rm -v /home/ah34/work/opensource/chronos/dev/lgtm/sql_exporter.yaml:/etc/sql_exporter/sql_exporter.yaml:ro burningalchemist/sql_exporter:0.18.3 --config.file=/etc/sql_exporter/sql_exporter.yaml --config.check - docker compose --project-name chronos -f dev/docker-compose/compose.yaml -f dev/docker-compose/lgtm.yaml up -d --build postgres kafka chronos-pg-migrations postgres-exporter sql-exporter kminion cadvisor lgtm - docker exec lgtm curl -sf 'http://127.0.0.1:9090/api/v1/query?query=chronos_rows' - docker compose --project-name chronos -f dev/docker-compose/compose.yaml -f dev/docker-compose/lgtm.yaml down - make pre-commit Model-version: GPT-5
Smoke test every Make recipe with a 10 second timeout and fix immediate target-level failures. Set the default Make goal to help, make withenv deterministic without RECIPE, load .env for local app/database recipes, and let dev.run fall back to a single cargo run when cargo-watch is unavailable. Repair supporting targets found by the smoke pass: add the missing coverage-report script with a cargo-llvm-cov path and raw LLVM coverage fallback, correct the example OTLP endpoint wiring, fix the act artifact server address, use the repository's master branch in .github/config.json, and reference Trivy actions with the v-prefixed tag. Verification: - timeout 10s make <every discovered Make target>; results recorded under /tmp/chronos-make-smoke and /tmp/chronos-make-smoke-escalated - timeout -k 2s 10s make withenv - timeout -k 2s 10s make repo.config.apply - timeout -k 2s 10s make act.scan - timeout -k 2s 10s make act.sbom - make lgtm.validate - make pre-commit Model-version: GPT-5
Add a reusable GitHub Actions workflow that installs the pinned Rust toolchain and system dependencies, then runs make weaver.live-check. Include the workflow in the top-level CI fan-out so live Weaver validation is part of CI tests. Update the OTLP metrics mock to use the production ChronosMetrics facade instead of a stale standalone metrics list. This keeps live-check samples aligned with the production Weaver registry and removes duplicate mock metric definitions. Verification: - make weaver.live-check - make weaver.check - rg '"highest_advice_level": "violation"|"level": "violation"' /tmp/chronos-weaver-live-check/live_check.json || true - make pre-commit Model-version: GPT-5
Point container build workflows at the Dockerfiles under docker/, make the Weaver live-check output directory writable for the container on hosted runners, and run Trivy through its pinned container image to avoid the failing action-side installer. Verification: - make weaver.live-check - docker build --target builder -f docker/Dockerfile.chronos-slim -t chronos-binary-builder:ci-smoke . - docker build -f docker/Dockerfile.chronos -t chronos:ci-smoke . - docker build -f docker/Dockerfile.chronos-slim -t chronos-scratch:ci-smoke . - make pre-commit - cargo build --release -p chronos_bin - docker run --rm -v "/home/ah34/work/opensource/chronos:/work:ro" aquasec/trivy:0.64.1 fs --scanners vuln --severity CRITICAL,HIGH --ignore-unfixed --exit-code 1 /work/target/release/chronos Model-version: GPT-5
Author
|
While looking at the telemetry I found an "ultra span" bug: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Resolves issue:
#12