Swap Prometheus for Cortex, wire otel-demo metrics, and pre-can alerting#226
Swap Prometheus for Cortex, wire otel-demo metrics, and pre-can alerting#226lezzago wants to merge 5 commits intoopensearch-project:mainfrom
Conversation
87c1d84 to
9d290e3
Compare
Replaces vanilla Prometheus with Cortex to get the full Prometheus HTTP
API surface (query + Ruler + Alertmanager), wires otel-demo metrics
end-to-end, surfaces the Alert Manager UI in OpenSearch Dashboards, and
ships pre-canned alerting rules, monitors, and routing.
Backend swap
- docker-compose: run cortexproject/cortex:v1.18.1 under service name
`prometheus` so PROMETHEUS_HOST/PORT continue to work everywhere.
Add Alertmanager (prom/alertmanager:v0.27.0) as a sibling service,
always running (routing is a harmless no-op when the otel-demo overlay
is off).
- cortex.yaml: single-binary dev config with filesystem-backed ruler
storage. Per-tenant limits widened so span-derived RED metrics don't
blow past defaults; max_label_names_per_series raised to 50 because
blanket resource_to_telemetry_conversion on JVM/.NET/Node.js services
pushes label counts past Cortex's default of 30.
- OTel collector: point the metrics pipeline at Cortex's /api/v1/push
remote-write endpoint with resource_to_telemetry_conversion so every
sample lands with service_name. Add a prometheus/self scrape of
localhost:8888 so otelcol_* series reach Cortex (the stack rules
depend on this). Add an envoy scrape so ingress HTTP RPS/latency is
visible for otel-demo.
- data-prepper: split the service-map pipeline so the Cortex branch
strips per-event randomKey UUIDs before remote-write (otel_apm_service_map's
grouping by telemetry.sdk.language is preserved so multi-language
emissions don't collide on the same timestamp).
OpenSearch Dashboards integration
- Create the Prometheus datasource with Cortex-correct URIs:
prometheus.uri uses /prometheus, prometheus.ruler.uri is the
unprefixed Cortex root, alertmanager.uri targets the new service.
- Turn on observability.alertManager.enabled so the Observability
plugin surfaces the Alert Manager UI.
Pre-canned alerting
- Alertmanager template: catch-all webhook that indexes alerts into
OpenSearch, plus demo-match routes and dummy Slack/email/PagerDuty
receivers as integration-shape examples. Silences/state persist in a
named volume.
- Cortex rules:
- rules-stack/stack-alerts.yml (always loaded): scrape-target down,
collector export failures, high memory, queue near capacity.
- rules-otel-demo/otel-demo-alerts.yml (loaded with the overlay): RED
alerts built on span-derived latency_seconds_* since most demo
services don't emit their own RED metrics.
- cortex-rules-init container upserts every group via POST
/api/v1/rules/{namespace} with a retry budget on Cortex readiness
and a non-zero exit on any failure, so rule edits take effect on
re-run and partial failures don't go unnoticed.
- OpenSearch monitors:
- init-stack-monitors.py (always): cluster_metrics_monitor for
cluster health red. Red-only so single-node yellow doesn't flap.
Init container has required: true on opensearch since the script
hardcodes https://opensearch:9200.
- init-otel-demo-monitors.py (with overlay): 5 query-level monitors
on checkout/payment/cart/frontend traces and logs.
Demo propagation
- .env + otel-demo compose: enable otel-demo by default, set
OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER=otlp globally, and propagate
them into every demo service so Node.js/Python SDKs actually emit
metrics (not just traces).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
Mend finding is a false-positive attribution, not a regression from this PRThe Mend gate flags
Fix belongs in a separate CDK dependency-bump PR (upgrade |
|
@lezzago why is alert manager not coming up in the left nav? |
There is a bug in alert manager UI I think with the new side nav changes that were coming in. |
| <<: *network | ||
| logging: *logging | ||
|
|
||
| # ****************** |
There was a problem hiding this comment.
If demo-only, then should move to another compose file e.g. docker-compose.otel-demo.yml
| # continue to work everywhere without changes. | ||
| prometheus: | ||
| image: prom/prometheus:${PROMETHEUS_VERSION} | ||
| image: cortexproject/cortex:v1.18.1 |
There was a problem hiding this comment.
use CORTEX_VERSION env var
| - '--config.file=/etc/prometheus/prometheus.yml' | ||
| - '--storage.tsdb.path=/prometheus' | ||
| # Retention period from environment variable | ||
| - '--storage.tsdb.retention.time=${PROMETHEUS_RETENTION}' |
There was a problem hiding this comment.
How are we handling data retention? Removing this option I am concerned metrics data storage will grow infinitely
| # HTTP 5xx at the customer-facing boundary, so a single scrape unlocks | ||
| # full RED visibility from the edge. The scrape is a no-op when the | ||
| # otel-demo compose file isn't enabled (no DNS → drop). | ||
| prometheus/envoy: |
There was a problem hiding this comment.
what is users don't deploy otel demo?
There was a problem hiding this comment.
Verified silent (no logs, no refused-metrics) when demo is off
| @@ -0,0 +1,306 @@ | |||
| #!/usr/bin/env python3 | |||
There was a problem hiding this comment.
Is this init idempotent? I.e. can you run it multiple times and it will work all the same and not duplicate any resources?
There was a problem hiding this comment.
Verified idempotent (5 monitors before/after rerun)
| explore.agentTraces.enabled: true | ||
| # Surfaces the Alert Manager UI in the Observability plugin, backed by the | ||
| # alertmanager.uri configured on the Prometheus datasource. | ||
| observability.alertManager.enabled: true |
There was a problem hiding this comment.
Below mentions alertmanager config only applies with otel-demo enabled. Is this ui-only and not applicable to same logic?
There was a problem hiding this comment.
Yea sadly the name of the UI makes it confusing, but this is concerning to enable the UI page or not.
| threshold: | ||
| max_events: 500 | ||
| flush_interval: 5s | ||
| routes: [service_processed_metrics] No newline at end of file |
There was a problem hiding this comment.
routes: [service_processed_metrics]
Why remove?
There was a problem hiding this comment.
routes: [service_processed_metrics] still on line 116, moved to sub-pipeline
| namespace = os.path.basename(namespace_dir) | ||
|
|
||
| for rules_file in sorted(glob.glob(f"{namespace_dir}/*.yml")): | ||
| loaded, failed = load_rules_file(rules_file, namespace) |
There was a problem hiding this comment.
Verified idempotent (rule counts match before/after rerun; unconditional upsert)
|
I am concerned how this may affect existing deployments. Some areas to test:
|
Code reviewFound 1 issue:
observability-stack/docker-compose.yml Lines 274 to 279 in d93fd52 Compare with the accurate description at observability-stack/docker-compose.yml Lines 145 to 152 in d93fd52 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
- CORTEX_VERSION env var: add to .env (v1.18.1) and template into the cortex image reference, matching the *_VERSION convention used by OTEL_COLLECTOR_VERSION, OPENSEARCH_VERSION, DATA_PREPPER_VERSION, ALERTMANAGER_VERSION. Flagged by @kylehounslow. - Cortex retention: wire PROMETHEUS_RETENTION through to -compactor.blocks-retention-period so metrics storage doesn't grow unbounded. Restores the retention behavior vanilla Prometheus had before the backend swap. Flagged by @kylehounslow. - Stale alertmanager comment in opensearch-dashboards-init environment: corrected to reflect that alertmanager now runs unconditionally in the base compose (post-467bb07). Flagged by @kylehounslow (opensearch-project#6) and joshuali925. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
…g bash
Folds four P0/P1 fixes surfaced during stability testing:
P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race
with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health`
reporting green is not sufficient — the alerting plugin's internal
indices still need to allocate. `POST /_plugins/_alerting/monitors`
now retries up to 12× / 5s on 5xx and "all shards failed" responses.
Without this, fresh installs silently dropped the stack "Cluster
Health Red" monitor (5/6 monitors ending up created).
P0.2 — in-place upgrades from pre-PR main left the
`ObservabilityStack_Prometheus` datasource with only
`{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI
silently surfaced zero alerts. The init script now reads the
authoritative properties via `GET /api/dataconnections`, and when the
new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale,
migrates via DELETE + POST (the SQL plugin does not expose a working
PUT/PATCH). The DELETE+POST changes the saved-object id, so the
migration also cleans up the orphaned pre-PR `data-connection`
saved-object and any correlations whose references still point at it,
keeping the saved-object graph consistent. Reruns are idempotent.
P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no
healthcheck, so `docker compose up -d --wait` returned while rules
were still being loaded. init-cortex-rules.py now writes
`/tmp/rules-loaded` on clean completion and both services test for
that file. Rule counts at `--wait` return are now (1, 3) instead of
(0, 0).
P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB
directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`)
dangling under `/data` because Cortex writes its own layout
(`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex
entrypoint now cleans these up on first boot only — gated on
`/data/tsdb` being absent AND `/data/chunks_head` being present,
so fresh deploys and subsequent restarts are untouched.
Verified via the full Scenario 2b cold bring-up:
- 35s `docker compose up -d --wait` exits 0
- All 6 OpenSearch monitors created
- Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups)
- Datasource has all 3 required URI properties after upgrade
- Exactly 1 data-connection saved-object, no orphans
- Correlations reference the current datasource id
- 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g bash
Folds four P0/P1 fixes surfaced during stability testing:
P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race
with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health`
reporting green is not sufficient — the alerting plugin's internal
indices still need to allocate. `POST /_plugins/_alerting/monitors`
now retries up to 12× / 5s on 5xx and "all shards failed" responses.
Without this, fresh installs silently dropped the stack "Cluster
Health Red" monitor (5/6 monitors ending up created).
P0.2 — in-place upgrades from pre-PR main left the
`ObservabilityStack_Prometheus` datasource with only
`{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI
silently surfaced zero alerts. The init script now reads the
authoritative properties via `GET /api/dataconnections`, and when the
new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale,
migrates via DELETE + POST (the SQL plugin does not expose a working
PUT/PATCH). The DELETE+POST changes the saved-object id, so the
migration also cleans up the orphaned pre-PR `data-connection`
saved-object and any correlations whose references still point at it,
keeping the saved-object graph consistent. Reruns are idempotent.
P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no
healthcheck, so `docker compose up -d --wait` returned while rules
were still being loaded. init-cortex-rules.py now writes
`/tmp/rules-loaded` on clean completion and both services test for
that file. Rule counts at `--wait` return are now (1, 3) instead of
(0, 0).
P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB
directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`)
dangling under `/data` because Cortex writes its own layout
(`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex
entrypoint now cleans these up on first boot only — gated on
`/data/tsdb` being absent AND `/data/chunks_head` being present,
so fresh deploys and subsequent restarts are untouched.
Verified via the full Scenario 2b cold bring-up:
- 35s `docker compose up -d --wait` exits 0
- All 6 OpenSearch monitors created
- Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups)
- Datasource has all 3 required URI properties after upgrade
- Exactly 1 data-connection saved-object, no orphans
- Correlations reference the current datasource id
- 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
Good call — I ran a full stability test pass against those three scenarios and surfaced four real issues. All fixed now.
Race surfaced: stack-monitor init vs. OpenSearch alerting plugin readiness. /_cluster/health green isn't enough — the plugin's indices still need to allocate.
Tested three teardown cycles + force-recreate of every init container:
Related bug fixed: cortex-rules-init had no healthcheck, so --wait returned while rules were still loading (counts (0, 0) for ~30s). Now writes /tmp/rules-loaded on clean completion; --wait blocks properly and counts are (1, 3) at return. |
- README.md: mention Cortex under the `prometheus` service name, add Alertmanager to the components list and ports table (9093), update the 9090 description to reflect Cortex's Ruler + PromQL endpoints, and add an "Upgrading from Previous Releases" section that documents the (unavoidable) historical-metric loss and the `docker compose down -v` clean-slate path. - docs/starlight-docs/src/content/docs/alerting/index.md: add a "Prometheus/Cortex alerting" section explaining the two alerting surfaces, rule file locations (stack/ and otel-demo/), Alertmanager routing tree, and the unified Alert Manager UI in OSD, including a troubleshooting note for upgrades where the datasource still lacks the new URI properties. Starlight build passes `✓ All internal links are valid.`. Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #226 +/- ##
=======================================
Coverage 55.62% 55.62%
=======================================
Files 4 4
Lines 169 169
Branches 48 48
=======================================
Hits 94 94
Misses 74 74
Partials 1 1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
Summary
prometheusfor backward compat); add Alertmanager as a sibling service that always runs.OTEL_METRICS_EXPORTER=otlpglobally, add envoy + collector self-scrape, and strip high-cardinalityrandomKeyon the Cortex branch of the service-map pipeline.Alert Manager UI in observability stack
Notable design calls
cortex.yamlraisesmax_label_names_per_seriesto 50 becauseresource_to_telemetry_conversion: truepushes JVM/.NET/Node.js series past the default 30-label cap.otel-demo-alerts.ymluses span-derivedlatency_seconds_*{namespace=\"span_derived\"}instead of per-servicerpc_server_*/http_server_*— the demo services don't emit RED metrics uniformly, so span-derived gives full coverage.cortex-rules-initalways upserts viaPOST /api/v1/rules/{namespace}so rule edits take effect on re-run. Adds a Cortex-readiness retry budget and exits non-zero on any POST failure.required: trueonopensearch— the scripts hardcodehttps://opensearch:9200so the oldrequired: falsewith an infinite retry loop was a latent footgun.Stability + upgrade-path fixes (commit
d69b8e6)Four issues were surfaced during a stability test pass (scenarios: in-place upgrade from pre-PR main, with and without otel-demo overlay, re-deploy idempotency). Details and raw verification output in
PR-226-TEST-RESULTS.md/PR-226-FIX-REPORT.md(kept local, not committed)./_cluster/healthgreen is not sufficient — the plugin allocates its own internal indices after cluster green, andPOST /_plugins/_alerting/monitorsreturns500 "all shards failed"until they settle.init-stack-monitors.py/init-otel-demo-monitors.pynow retry up to 12× / 5s on 5xx. Without this, fresh installs silently ended up with 5/6 monitors.ObservabilityStack_Prometheusdatasource with only{prometheus.uri, prometheus.auth.*}, so OSD Alert Manager silently surfaced zero alerts. The init now reads authoritative properties viaGET /api/dataconnections, and on mismatch migrates via DELETE + POST (SQL plugin doesn't expose a working PUT/PATCH). The DELETE+POST changes the saved-object id, so the migration also cleans up the orphaned pre-PRdata-connectionsaved-object and any correlations whose references still point at it. Reruns are idempotent.cortex-rules-init/cortex-rules-init-otel-demohad no healthcheck, sodocker compose up -d --waitreturned while rules were still loading. Rule counts at--waitreturn were (0, 0) for ~30s.init-cortex-rules.pynow writes/tmp/rules-loadedon clean completion and both services test for that file. Counts at--waitreturn are now (1, 3) as expected.chunks_head,wal,wbl,lock,queries.active) dangling under/dataalongside Cortex's own layout (/data/tsdb,/data/ruler-storage). Cortex entrypoint now cleans these up on first boot only — gated on/data/tsdbbeing absent AND/data/chunks_headbeing present, so fresh deploys and subsequent restarts are untouched.Known follow-ups (not blocking)
latency_seconds_*within a single sdk.language. Root cause isotel_apm_service_map's emit cadence within a 10s window; the proper fix is migrating RED metrics to the OTel Collector'sspanmetricsconnector. Tracked separately.opensearch-dashboards-inithas no healthcheck, sodocker compose up -d --waitreturns ~70s before it completes. Not a regression (same behavior as pre-PR main); CI/scripts that query OSD state immediately after--waitmay see stale state during that window. Same sentinel-file + healthcheck pattern as P1.3 would fix it..NET,JVM, Node.js, Kafka, etc.) with"invalid temporality and type combination". Reproduces on fresh deploy — not an upgrade regression — but reduces the surface of series available to rules. Fix is either acumulativetodeltaprocessor in the collector or per-SDKOTEL_METRIC_EXPORT_TEMPORALITY_PREFERENCE.Test plan
promtool check rulesfor both rule files → SUCCESS (4 stack, 9 otel-demo)amtool check-configfor the Alertmanager template → SUCCESS (7 receivers, 1 inhibit rule)python3 -m py_compileon the init scripts → OKdocker compose configrenders in both demo-on and demo-off modes, both compose v5.0.2 local and v2.38.2 CI-parity → exit 0d69b8e6):docker compose up -d --waitexits 0 in 35s--waitreturn (P0.1)--waitreturn: 1 stack + 3 otel_demo groups (P1.3)data-connectionsaved-object — no orphans — after upgrade (P0.2 follow-up)tsdb/,ruler-storage/,compactor/after simulated upgrade; fresh deploy unaffected (P1.4)up{job=\"otel-collector\"}andup{job=\"envoy-frontend-proxy\"}both report1max_label_names_per_series: 50cortex-rules-initafter an edit — log showsloaded: 4, failed: 0🤖 Generated with Claude Code