Skip to content

Swap Prometheus for Cortex, wire otel-demo metrics, and pre-can alerting#226

Open
lezzago wants to merge 5 commits intoopensearch-project:mainfrom
lezzago:cortex-metrics
Open

Swap Prometheus for Cortex, wire otel-demo metrics, and pre-can alerting#226
lezzago wants to merge 5 commits intoopensearch-project:mainfrom
lezzago:cortex-metrics

Conversation

@lezzago
Copy link
Copy Markdown
Member

@lezzago lezzago commented May 6, 2026

Summary

  • Replace vanilla Prometheus with Cortex (kept as service name prometheus for backward compat); add Alertmanager as a sibling service that always runs.
  • Wire otel-demo metrics end-to-end: enable the demo by default, set OTEL_METRICS_EXPORTER=otlp globally, add envoy + collector self-scrape, and strip high-cardinality randomKey on the Cortex branch of the service-map pipeline.
  • Surface the Alert Manager UI in OpenSearch Dashboards and point its Prometheus datasource at Cortex's query, Ruler, and Alertmanager URIs.
  • Pre-can alerting: 4 stack-health rules (loaded always), 9 otel-demo RED rules on span-derived latency (loaded with overlay), 1 cluster-health monitor in OS, 5 demo trace/log monitors, and Alertmanager routing with an OpenSearch webhook receiver.

Alert Manager UI in observability stack

Screenshot 2026-05-06 at 11 19 33 AM Screenshot 2026-05-06 at 11 19 23 AM Screenshot 2026-05-06 at 11 19 12 AM

Notable design calls

  • cortex.yaml raises max_label_names_per_series to 50 because resource_to_telemetry_conversion: true pushes JVM/.NET/Node.js series past the default 30-label cap.
  • otel-demo-alerts.yml uses span-derived latency_seconds_*{namespace=\"span_derived\"} instead of per-service rpc_server_* / http_server_* — the demo services don't emit RED metrics uniformly, so span-derived gives full coverage.
  • cortex-rules-init always upserts via POST /api/v1/rules/{namespace} so rule edits take effect on re-run. Adds a Cortex-readiness retry budget and exits non-zero on any POST failure.
  • OS monitor init containers now have required: true on opensearch — the scripts hardcode https://opensearch:9200 so the old required: false with an infinite retry loop was a latent footgun.

Stability + upgrade-path fixes (commit d69b8e6)

Four issues were surfaced during a stability test pass (scenarios: in-place upgrade from pre-PR main, with and without otel-demo overlay, re-deploy idempotency). Details and raw verification output in PR-226-TEST-RESULTS.md / PR-226-FIX-REPORT.md (kept local, not committed).

  • P0.1 — stack-monitor init race with the OpenSearch alerting plugin. /_cluster/health green is not sufficient — the plugin allocates its own internal indices after cluster green, and POST /_plugins/_alerting/monitors returns 500 "all shards failed" until they settle. init-stack-monitors.py / init-otel-demo-monitors.py now retry up to 12× / 5s on 5xx. Without this, fresh installs silently ended up with 5/6 monitors.
  • P0.2 — in-place upgrades left the ObservabilityStack_Prometheus datasource with only {prometheus.uri, prometheus.auth.*}, so OSD Alert Manager silently surfaced zero alerts. The init now reads authoritative properties via GET /api/dataconnections, and on mismatch migrates via DELETE + POST (SQL plugin doesn't expose a working PUT/PATCH). The DELETE+POST changes the saved-object id, so the migration also cleans up the orphaned pre-PR data-connection saved-object and any correlations whose references still point at it. Reruns are idempotent.
  • P1.3 — cortex-rules-init / cortex-rules-init-otel-demo had no healthcheck, so docker compose up -d --wait returned while rules were still loading. Rule counts at --wait return were (0, 0) for ~30s. init-cortex-rules.py now writes /tmp/rules-loaded on clean completion and both services test for that file. Counts at --wait return are now (1, 3) as expected.
  • P1.4 — in-place upgrades left stale vanilla-Prometheus TSDB dirs (chunks_head, wal, wbl, lock, queries.active) dangling under /data alongside Cortex's own layout (/data/tsdb, /data/ruler-storage). Cortex entrypoint now cleans these up on first boot only — gated on /data/tsdb being absent AND /data/chunks_head being present, so fresh deploys and subsequent restarts are untouched.

Known follow-ups (not blocking)

  • Residual Data Prepper duplicate-sample rejects (~0.15/s) for latency_seconds_* within a single sdk.language. Root cause is otel_apm_service_map's emit cadence within a 10s window; the proper fix is migrating RED metrics to the OTel Collector's spanmetrics connector. Tracked separately.
  • opensearch-dashboards-init has no healthcheck, so docker compose up -d --wait returns ~70s before it completes. Not a regression (same behavior as pre-PR main); CI/scripts that query OSD state immediately after --wait may see stale state during that window. Same sentinel-file + healthcheck pattern as P1.3 would fix it.
  • Cortex rejects many otel-demo runtime metrics (.NET, JVM, Node.js, Kafka, etc.) with "invalid temporality and type combination". Reproduces on fresh deploy — not an upgrade regression — but reduces the surface of series available to rules. Fix is either a cumulativetodelta processor in the collector or per-SDK OTEL_METRIC_EXPORT_TEMPORALITY_PREFERENCE.

Test plan

  • promtool check rules for both rule files → SUCCESS (4 stack, 9 otel-demo)
  • amtool check-config for the Alertmanager template → SUCCESS (7 receivers, 1 inhibit rule)
  • python3 -m py_compile on the init scripts → OK
  • docker compose config renders in both demo-on and demo-off modes, both compose v5.0.2 local and v2.38.2 CI-parity → exit 0
  • Full Scenario 2b cold bring-up verified against final commit (d69b8e6):
    • docker compose up -d --wait exits 0 in 35s
    • All 6 OpenSearch monitors created at --wait return (P0.1)
    • Cortex rules loaded at --wait return: 1 stack + 3 otel_demo groups (P1.3)
    • Datasource has all 3 required URI properties after simulated upgrade (P0.2)
    • Exactly 1 data-connection saved-object — no orphans — after upgrade (P0.2 follow-up)
    • Correlations reference the current datasource id (P0.2 follow-up)
    • Cortex volume contains only tsdb/, ruler-storage/, compactor/ after simulated upgrade; fresh deploy unaffected (P1.4)
    • 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak
  • Live verification with the otel-demo overlay:
    • Cortex up{job=\"otel-collector\"} and up{job=\"envoy-frontend-proxy\"} both report 1
    • OSD "Alert Manager" UI renders both OS-monitor alerts and Cortex alerts when both datasources are selected
    • Label-cap rejects: 16/5m → 0/5m after max_label_names_per_series: 50
    • Duplicate-sample rejects: 65/5m → 46/5m (partial — residual tracked as follow-up)
    • Dead otel-demo rules: 5 of 9 → 0 of 9
    • Rule-loader upsert verified by re-running cortex-rules-init after an edit — log shows loaded: 4, failed: 0

🤖 Generated with Claude Code

@lezzago lezzago force-pushed the cortex-metrics branch 3 times, most recently from 87c1d84 to 9d290e3 Compare May 6, 2026 20:31
Replaces vanilla Prometheus with Cortex to get the full Prometheus HTTP
API surface (query + Ruler + Alertmanager), wires otel-demo metrics
end-to-end, surfaces the Alert Manager UI in OpenSearch Dashboards, and
ships pre-canned alerting rules, monitors, and routing.

Backend swap
- docker-compose: run cortexproject/cortex:v1.18.1 under service name
  `prometheus` so PROMETHEUS_HOST/PORT continue to work everywhere.
  Add Alertmanager (prom/alertmanager:v0.27.0) as a sibling service,
  always running (routing is a harmless no-op when the otel-demo overlay
  is off).
- cortex.yaml: single-binary dev config with filesystem-backed ruler
  storage. Per-tenant limits widened so span-derived RED metrics don't
  blow past defaults; max_label_names_per_series raised to 50 because
  blanket resource_to_telemetry_conversion on JVM/.NET/Node.js services
  pushes label counts past Cortex's default of 30.
- OTel collector: point the metrics pipeline at Cortex's /api/v1/push
  remote-write endpoint with resource_to_telemetry_conversion so every
  sample lands with service_name. Add a prometheus/self scrape of
  localhost:8888 so otelcol_* series reach Cortex (the stack rules
  depend on this). Add an envoy scrape so ingress HTTP RPS/latency is
  visible for otel-demo.
- data-prepper: split the service-map pipeline so the Cortex branch
  strips per-event randomKey UUIDs before remote-write (otel_apm_service_map's
  grouping by telemetry.sdk.language is preserved so multi-language
  emissions don't collide on the same timestamp).

OpenSearch Dashboards integration
- Create the Prometheus datasource with Cortex-correct URIs:
  prometheus.uri uses /prometheus, prometheus.ruler.uri is the
  unprefixed Cortex root, alertmanager.uri targets the new service.
- Turn on observability.alertManager.enabled so the Observability
  plugin surfaces the Alert Manager UI.

Pre-canned alerting
- Alertmanager template: catch-all webhook that indexes alerts into
  OpenSearch, plus demo-match routes and dummy Slack/email/PagerDuty
  receivers as integration-shape examples. Silences/state persist in a
  named volume.
- Cortex rules:
  - rules-stack/stack-alerts.yml (always loaded): scrape-target down,
    collector export failures, high memory, queue near capacity.
  - rules-otel-demo/otel-demo-alerts.yml (loaded with the overlay): RED
    alerts built on span-derived latency_seconds_* since most demo
    services don't emit their own RED metrics.
  - cortex-rules-init container upserts every group via POST
    /api/v1/rules/{namespace} with a retry budget on Cortex readiness
    and a non-zero exit on any failure, so rule edits take effect on
    re-run and partial failures don't go unnoticed.
- OpenSearch monitors:
  - init-stack-monitors.py (always): cluster_metrics_monitor for
    cluster health red. Red-only so single-node yellow doesn't flap.
    Init container has required: true on opensearch since the script
    hardcodes https://opensearch:9200.
  - init-otel-demo-monitors.py (with overlay): 5 query-level monitors
    on checkout/payment/cart/frontend traces and logs.

Demo propagation
- .env + otel-demo compose: enable otel-demo by default, set
  OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER=otlp globally, and propagate
  them into every demo service so Node.js/Python SDKs actually emit
  metrics (not just traces).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
@lezzago lezzago marked this pull request as ready for review May 6, 2026 20:46
@lezzago
Copy link
Copy Markdown
Member Author

lezzago commented May 6, 2026

Mend finding is a false-positive attribution, not a regression from this PR

The Mend gate flags CVE-2026-6321 in fast-uri@3.1.0 under aws/cdk/package.json → aws-cdk-lib@2.248.0 → table → ajv → fast-uri. This PR does not touch the CDK dependency tree:

$ git diff --name-only origin/main HEAD | grep -iE "cdk|package\.json"
(no matches)

aws/cdk/package.json was last modified in #159 (and added in #153), both well before this branch. The CVE is "new" to Mend's database, not new to this branch — the same scan run against current main would report the same vuln. Mend's base-branch commit in the report (1a8c22a3…) predates when the CVE was published to their feed.

Fix belongs in a separate CDK dependency-bump PR (upgrade aws-cdk-lib or add an npm overrides forcing fast-uri@>=3.1.1). I'll leave that out of this PR to keep the scope clean — happy to file a follow-up issue if useful.

@ps48
Copy link
Copy Markdown
Member

ps48 commented May 6, 2026

@lezzago why is alert manager not coming up in the left nav?

@lezzago
Copy link
Copy Markdown
Member Author

lezzago commented May 6, 2026

@lezzago why is alert manager not coming up in the left nav?

There is a bug in alert manager UI I think with the new side nav changes that were coming in.
I plan to take a look at the issue separately since it is in a different repo and this will pull in that change after its fixed.
For now to view the page, you need the end URI to be: app/observability-alerting

<<: *network
logging: *logging

# ******************
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If demo-only, then should move to another compose file e.g. docker-compose.otel-demo.yml

Comment thread docker-compose.yml Outdated
# continue to work everywhere without changes.
prometheus:
image: prom/prometheus:${PROMETHEUS_VERSION}
image: cortexproject/cortex:v1.18.1
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use CORTEX_VERSION env var

Comment thread docker-compose.yml
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
# Retention period from environment variable
- '--storage.tsdb.retention.time=${PROMETHEUS_RETENTION}'
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How are we handling data retention? Removing this option I am concerned metrics data storage will grow infinitely

# HTTP 5xx at the customer-facing boundary, so a single scrape unlocks
# full RED visibility from the edge. The scrape is a no-op when the
# otel-demo compose file isn't enabled (no DNS → drop).
prometheus/envoy:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is users don't deploy otel demo?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified silent (no logs, no refused-metrics) when demo is off

@@ -0,0 +1,306 @@
#!/usr/bin/env python3
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this init idempotent? I.e. can you run it multiple times and it will work all the same and not duplicate any resources?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified idempotent (5 monitors before/after rerun)

explore.agentTraces.enabled: true
# Surfaces the Alert Manager UI in the Observability plugin, backed by the
# alertmanager.uri configured on the Prometheus datasource.
observability.alertManager.enabled: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Below mentions alertmanager config only applies with otel-demo enabled. Is this ui-only and not applicable to same logic?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea sadly the name of the UI makes it confusing, but this is concerning to enable the UI page or not.

threshold:
max_events: 500
flush_interval: 5s
routes: [service_processed_metrics] No newline at end of file
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routes: [service_processed_metrics]

Why remove?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

routes: [service_processed_metrics] still on line 116, moved to sub-pipeline

namespace = os.path.basename(namespace_dir)

for rules_file in sorted(glob.glob(f"{namespace_dir}/*.yml")):
loaded, failed = load_rules_file(rules_file, namespace)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this idempotent?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified idempotent (rule counts match before/after rerun; unconditional upsert)

@kylehounslow
Copy link
Copy Markdown
Collaborator

I am concerned how this may affect existing deployments. Some areas to test:

  1. vanilla Prometheus running initially and then making this update to swap to cortex
  2. running with/without otel-demo
  3. re-deploying multiple times and ensure duplicate resources aren't created (and won't error out if resources already exist)

@joshuali925
Copy link
Copy Markdown
Member

Code review

Found 1 issue:

  1. Stale comment claims alertmanager lives in docker-compose.otel-demo.yml and the URI "dangles when the demo is off", but this PR defines alertmanager unconditionally in the main docker-compose.yml — which the adjacent comment a few lines above correctly describes ("Runs whether or not the otel-demo is enabled"). The misleading comment will confuse future readers about when the service runs.

- PROMETHEUS_PORT=${PROMETHEUS_PORT}
# alertmanager.uri is set on the Prometheus datasource unconditionally.
# The service itself only starts with the otel-demo flag (alertmanager
# lives in docker-compose.otel-demo.yml), so this URI dangles when the
# demo is off — harmless, since no alerts are firing to route anyway.
- ALERTMANAGER_HOST=alertmanager

Compare with the accurate description at

# Prometheus Alertmanager - Alert routing, grouping, deduplication, and silencing.
# Runs whether or not the otel-demo is enabled: the base stack rules (collector
# health, scrape-target health) alert into it, and demo rules alert in when the
# demo overlay is enabled too. The OSD Prometheus datasource's alertmanager.uri
# points at this service's HTTP API.
alertmanager:
image: prom/alertmanager:${ALERTMANAGER_VERSION}
container_name: alertmanager

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

- CORTEX_VERSION env var: add to .env (v1.18.1) and template into the
  cortex image reference, matching the *_VERSION convention used by
  OTEL_COLLECTOR_VERSION, OPENSEARCH_VERSION, DATA_PREPPER_VERSION,
  ALERTMANAGER_VERSION. Flagged by @kylehounslow.
- Cortex retention: wire PROMETHEUS_RETENTION through to
  -compactor.blocks-retention-period so metrics storage doesn't grow
  unbounded. Restores the retention behavior vanilla Prometheus had
  before the backend swap. Flagged by @kylehounslow.
- Stale alertmanager comment in opensearch-dashboards-init environment:
  corrected to reflect that alertmanager now runs unconditionally in
  the base compose (post-467bb07). Flagged by @kylehounslow (opensearch-project#6) and
  joshuali925.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
lezzago added a commit to lezzago/observability-stack that referenced this pull request May 7, 2026
…g bash

Folds four P0/P1 fixes surfaced during stability testing:

P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race
with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health`
reporting green is not sufficient — the alerting plugin's internal
indices still need to allocate. `POST /_plugins/_alerting/monitors`
now retries up to 12× / 5s on 5xx and "all shards failed" responses.
Without this, fresh installs silently dropped the stack "Cluster
Health Red" monitor (5/6 monitors ending up created).

P0.2 — in-place upgrades from pre-PR main left the
`ObservabilityStack_Prometheus` datasource with only
`{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI
silently surfaced zero alerts. The init script now reads the
authoritative properties via `GET /api/dataconnections`, and when the
new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale,
migrates via DELETE + POST (the SQL plugin does not expose a working
PUT/PATCH). The DELETE+POST changes the saved-object id, so the
migration also cleans up the orphaned pre-PR `data-connection`
saved-object and any correlations whose references still point at it,
keeping the saved-object graph consistent. Reruns are idempotent.

P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no
healthcheck, so `docker compose up -d --wait` returned while rules
were still being loaded. init-cortex-rules.py now writes
`/tmp/rules-loaded` on clean completion and both services test for
that file. Rule counts at `--wait` return are now (1, 3) instead of
(0, 0).

P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB
directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`)
dangling under `/data` because Cortex writes its own layout
(`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex
entrypoint now cleans these up on first boot only — gated on
`/data/tsdb` being absent AND `/data/chunks_head` being present,
so fresh deploys and subsequent restarts are untouched.

Verified via the full Scenario 2b cold bring-up:
- 35s `docker compose up -d --wait` exits 0
- All 6 OpenSearch monitors created
- Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups)
- Datasource has all 3 required URI properties after upgrade
- Exactly 1 data-connection saved-object, no orphans
- Correlations reference the current datasource id
- 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…g bash

Folds four P0/P1 fixes surfaced during stability testing:

P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race
with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health`
reporting green is not sufficient — the alerting plugin's internal
indices still need to allocate. `POST /_plugins/_alerting/monitors`
now retries up to 12× / 5s on 5xx and "all shards failed" responses.
Without this, fresh installs silently dropped the stack "Cluster
Health Red" monitor (5/6 monitors ending up created).

P0.2 — in-place upgrades from pre-PR main left the
`ObservabilityStack_Prometheus` datasource with only
`{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI
silently surfaced zero alerts. The init script now reads the
authoritative properties via `GET /api/dataconnections`, and when the
new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale,
migrates via DELETE + POST (the SQL plugin does not expose a working
PUT/PATCH). The DELETE+POST changes the saved-object id, so the
migration also cleans up the orphaned pre-PR `data-connection`
saved-object and any correlations whose references still point at it,
keeping the saved-object graph consistent. Reruns are idempotent.

P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no
healthcheck, so `docker compose up -d --wait` returned while rules
were still being loaded. init-cortex-rules.py now writes
`/tmp/rules-loaded` on clean completion and both services test for
that file. Rule counts at `--wait` return are now (1, 3) instead of
(0, 0).

P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB
directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`)
dangling under `/data` because Cortex writes its own layout
(`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex
entrypoint now cleans these up on first boot only — gated on
`/data/tsdb` being absent AND `/data/chunks_head` being present,
so fresh deploys and subsequent restarts are untouched.

Verified via the full Scenario 2b cold bring-up:
- 35s `docker compose up -d --wait` exits 0
- All 6 OpenSearch monitors created
- Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups)
- Datasource has all 3 required URI properties after upgrade
- Exactly 1 data-connection saved-object, no orphans
- Correlations reference the current datasource id
- 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
@lezzago
Copy link
Copy Markdown
Member Author

lezzago commented May 7, 2026

I am concerned how this may affect existing deployments. Some areas to test:

1. vanilla Prometheus running initially and then making this update to swap to cortex

2. running with/without otel-demo

3. re-deploying multiple times and ensure duplicate resources aren't created (and won't error out if resources already exist)

Good call — I ran a full stability test pass against those three scenarios and surfaced four real issues. All fixed now.

  1. Vanilla Prometheus → Cortex upgrade
  • Datasource not migrated. Pre-PR datasource had only prometheus.uri; init short-circuited on name match, so upgraded deployments silently lost the new alertmanager.uri / prometheus.ruler.uri and OSD Alert Manager UI showed zero Cortex alerts. Init now diffs against /api/dataconnections and reconciles via DELETE + POST (PUT/PATCH aren't exposed); also cleans up the orphaned saved-object wrapper and dangling correlations so the graph stays consistent. Reruns are idempotent.
  • Stale TSDB dirs. Cortex writes /data/tsdb + /data/ruler-storage alongside leftover chunks_head/wal/wbl. Entrypoint now cleans them up on first boot only, gated on /data/tsdb absent AND /data/chunks_head present.
  • Historical metrics aren't migrated — Cortex can't read vanilla-Prometheus blocks. New OTLP writes work immediately but pre-upgrade data is gone. Worth a release note.
  1. With / without otel-demo
  • Base only: 15 containers, 1 stack rule group, AM empty, OSD reachable.
  • With demo: 38 containers, 1 stack + 3 otel_demo rule groups, 6 monitors, 21 Cortex / 20 AM alerts firing after 5-min soak.
  • Compose renders exit 0 on both compose v5.0.2 and v2.38.2 (CI parity).

Race surfaced: stack-monitor init vs. OpenSearch alerting plugin readiness. /_cluster/health green isn't enough — the plugin's indices still need to allocate.
Fresh installs silently ended up with 5/6 monitors. Monitor-create now retries on 5xx / "all shards failed" (up to 12× / 5s).

  1. Re-deploy idempotency

Tested three teardown cycles + force-recreate of every init container:

  • Repeat up: exit 0, IDs stable, zero error log lines.
  • Rule loaders: loaded: N, failed: 0 on rerun; live-edit round-trip works.
  • Monitor loaders: all log Monitor already exists; count stable.
  • OSD init: saved-object counts stable across reruns (no duplicates).

Related bug fixed: cortex-rules-init had no healthcheck, so --wait returned while rules were still loading (counts (0, 0) for ~30s). Now writes /tmp/rules-loaded on clean completion; --wait blocks properly and counts are (1, 3) at return.

- README.md: mention Cortex under the `prometheus` service name,
  add Alertmanager to the components list and ports table (9093),
  update the 9090 description to reflect Cortex's Ruler + PromQL
  endpoints, and add an "Upgrading from Previous Releases" section
  that documents the (unavoidable) historical-metric loss and the
  `docker compose down -v` clean-slate path.

- docs/starlight-docs/src/content/docs/alerting/index.md: add a
  "Prometheus/Cortex alerting" section explaining the two alerting
  surfaces, rule file locations (stack/ and otel-demo/), Alertmanager
  routing tree, and the unified Alert Manager UI in OSD, including
  a troubleshooting note for upgrades where the datasource still
  lacks the new URI properties.

Starlight build passes `✓ All internal links are valid.`.

Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented May 7, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.62%. Comparing base (15f21e2) to head (033a0d2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #226   +/-   ##
=======================================
  Coverage   55.62%   55.62%           
=======================================
  Files           4        4           
  Lines         169      169           
  Branches       48       48           
=======================================
  Hits           94       94           
  Misses         74       74           
  Partials        1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants