Swap Prometheus for Cortex, wire otel-demo metrics, and pre-can alerting by lezzago · Pull Request #226 · opensearch-project/observability-stack

lezzago · 2026-05-06T19:52:40Z

Summary

Replace vanilla Prometheus with Cortex (kept as service name prometheus for backward compat); add Alertmanager as a sibling service that always runs.
Wire otel-demo metrics end-to-end: enable the demo by default, set OTEL_METRICS_EXPORTER=otlp globally, add envoy + collector self-scrape, and strip high-cardinality randomKey on the Cortex branch of the service-map pipeline.
Surface the Alert Manager UI in OpenSearch Dashboards and point its Prometheus datasource at Cortex's query, Ruler, and Alertmanager URIs.
Pre-can alerting: 4 stack-health rules (loaded always), 9 otel-demo RED rules on span-derived latency (loaded with overlay), 1 cluster-health monitor in OS, 5 demo trace/log monitors, and Alertmanager routing with an OpenSearch webhook receiver.

Alert Manager UI in observability stack

Notable design calls

cortex.yaml raises max_label_names_per_series to 50 because resource_to_telemetry_conversion: true pushes JVM/.NET/Node.js series past the default 30-label cap.
otel-demo-alerts.yml uses span-derived latency_seconds_*{namespace=\"span_derived\"} instead of per-service rpc_server_* / http_server_* — the demo services don't emit RED metrics uniformly, so span-derived gives full coverage.
cortex-rules-init always upserts via POST /api/v1/rules/{namespace} so rule edits take effect on re-run. Adds a Cortex-readiness retry budget and exits non-zero on any POST failure.
OS monitor init containers now have required: true on opensearch — the scripts hardcode https://opensearch:9200 so the old required: false with an infinite retry loop was a latent footgun.

Stability + upgrade-path fixes (commit `d69b8e6`)

Four issues were surfaced during a stability test pass (scenarios: in-place upgrade from pre-PR main, with and without otel-demo overlay, re-deploy idempotency). Details and raw verification output in PR-226-TEST-RESULTS.md / PR-226-FIX-REPORT.md (kept local, not committed).

P0.1 — stack-monitor init race with the OpenSearch alerting plugin. /_cluster/health green is not sufficient — the plugin allocates its own internal indices after cluster green, and POST /_plugins/_alerting/monitors returns 500 "all shards failed" until they settle. init-stack-monitors.py / init-otel-demo-monitors.py now retry up to 12× / 5s on 5xx. Without this, fresh installs silently ended up with 5/6 monitors.
P0.2 — in-place upgrades left the ObservabilityStack_Prometheus datasource with only {prometheus.uri, prometheus.auth.*}, so OSD Alert Manager silently surfaced zero alerts. The init now reads authoritative properties via GET /api/dataconnections, and on mismatch migrates via DELETE + POST (SQL plugin doesn't expose a working PUT/PATCH). The DELETE+POST changes the saved-object id, so the migration also cleans up the orphaned pre-PR data-connection saved-object and any correlations whose references still point at it. Reruns are idempotent.
P1.3 — cortex-rules-init / cortex-rules-init-otel-demo had no healthcheck, so docker compose up -d --wait returned while rules were still loading. Rule counts at --wait return were (0, 0) for ~30s. init-cortex-rules.py now writes /tmp/rules-loaded on clean completion and both services test for that file. Counts at --wait return are now (1, 3) as expected.
P1.4 — in-place upgrades left stale vanilla-Prometheus TSDB dirs (chunks_head, wal, wbl, lock, queries.active) dangling under /data alongside Cortex's own layout (/data/tsdb, /data/ruler-storage). Cortex entrypoint now cleans these up on first boot only — gated on /data/tsdb being absent AND /data/chunks_head being present, so fresh deploys and subsequent restarts are untouched.

Known follow-ups (not blocking)

Residual Data Prepper duplicate-sample rejects (~0.15/s) for latency_seconds_* within a single sdk.language. Root cause is otel_apm_service_map's emit cadence within a 10s window; the proper fix is migrating RED metrics to the OTel Collector's spanmetrics connector. Tracked separately.
opensearch-dashboards-init has no healthcheck, so docker compose up -d --wait returns ~70s before it completes. Not a regression (same behavior as pre-PR main); CI/scripts that query OSD state immediately after --wait may see stale state during that window. Same sentinel-file + healthcheck pattern as P1.3 would fix it.
Cortex rejects many otel-demo runtime metrics (.NET, JVM, Node.js, Kafka, etc.) with "invalid temporality and type combination". Reproduces on fresh deploy — not an upgrade regression — but reduces the surface of series available to rules. Fix is either a cumulativetodelta processor in the collector or per-SDK OTEL_METRIC_EXPORT_TEMPORALITY_PREFERENCE.

Test plan

🤖 Generated with Claude Code

Replaces vanilla Prometheus with Cortex to get the full Prometheus HTTP API surface (query + Ruler + Alertmanager), wires otel-demo metrics end-to-end, surfaces the Alert Manager UI in OpenSearch Dashboards, and ships pre-canned alerting rules, monitors, and routing. Backend swap - docker-compose: run cortexproject/cortex:v1.18.1 under service name `prometheus` so PROMETHEUS_HOST/PORT continue to work everywhere. Add Alertmanager (prom/alertmanager:v0.27.0) as a sibling service, always running (routing is a harmless no-op when the otel-demo overlay is off). - cortex.yaml: single-binary dev config with filesystem-backed ruler storage. Per-tenant limits widened so span-derived RED metrics don't blow past defaults; max_label_names_per_series raised to 50 because blanket resource_to_telemetry_conversion on JVM/.NET/Node.js services pushes label counts past Cortex's default of 30. - OTel collector: point the metrics pipeline at Cortex's /api/v1/push remote-write endpoint with resource_to_telemetry_conversion so every sample lands with service_name. Add a prometheus/self scrape of localhost:8888 so otelcol_* series reach Cortex (the stack rules depend on this). Add an envoy scrape so ingress HTTP RPS/latency is visible for otel-demo. - data-prepper: split the service-map pipeline so the Cortex branch strips per-event randomKey UUIDs before remote-write (otel_apm_service_map's grouping by telemetry.sdk.language is preserved so multi-language emissions don't collide on the same timestamp). OpenSearch Dashboards integration - Create the Prometheus datasource with Cortex-correct URIs: prometheus.uri uses /prometheus, prometheus.ruler.uri is the unprefixed Cortex root, alertmanager.uri targets the new service. - Turn on observability.alertManager.enabled so the Observability plugin surfaces the Alert Manager UI. Pre-canned alerting - Alertmanager template: catch-all webhook that indexes alerts into OpenSearch, plus demo-match routes and dummy Slack/email/PagerDuty receivers as integration-shape examples. Silences/state persist in a named volume. - Cortex rules: - rules-stack/stack-alerts.yml (always loaded): scrape-target down, collector export failures, high memory, queue near capacity. - rules-otel-demo/otel-demo-alerts.yml (loaded with the overlay): RED alerts built on span-derived latency_seconds_* since most demo services don't emit their own RED metrics. - cortex-rules-init container upserts every group via POST /api/v1/rules/{namespace} with a retry budget on Cortex readiness and a non-zero exit on any failure, so rule edits take effect on re-run and partial failures don't go unnoticed. - OpenSearch monitors: - init-stack-monitors.py (always): cluster_metrics_monitor for cluster health red. Red-only so single-node yellow doesn't flap. Init container has required: true on opensearch since the script hardcodes https://opensearch:9200. - init-otel-demo-monitors.py (with overlay): 5 query-level monitors on checkout/payment/cart/frontend traces and logs. Demo propagation - .env + otel-demo compose: enable otel-demo by default, set OTEL_METRICS_EXPORTER/OTEL_LOGS_EXPORTER=otlp globally, and propagate them into every demo service so Node.js/Python SDKs actually emit metrics (not just traces). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

lezzago · 2026-05-06T20:50:15Z

Mend finding is a false-positive attribution, not a regression from this PR

The Mend gate flags CVE-2026-6321 in fast-uri@3.1.0 under aws/cdk/package.json → aws-cdk-lib@2.248.0 → table → ajv → fast-uri. This PR does not touch the CDK dependency tree:

$ git diff --name-only origin/main HEAD | grep -iE "cdk|package\.json"
(no matches)

aws/cdk/package.json was last modified in #159 (and added in #153), both well before this branch. The CVE is "new" to Mend's database, not new to this branch — the same scan run against current main would report the same vuln. Mend's base-branch commit in the report (1a8c22a3…) predates when the CVE was published to their feed.

Fix belongs in a separate CDK dependency-bump PR (upgrade aws-cdk-lib or add an npm overrides forcing fast-uri@>=3.1.1). I'll leave that out of this PR to keep the scope clean — happy to file a follow-up issue if useful.

ps48 · 2026-05-06T21:00:46Z

@lezzago why is alert manager not coming up in the left nav?

lezzago · 2026-05-06T21:03:24Z

@lezzago why is alert manager not coming up in the left nav?

There is a bug in alert manager UI I think with the new side nav changes that were coming in.
I plan to take a look at the issue separately since it is in a different repo and this will pull in that change after its fixed.
For now to view the page, you need the end URI to be: app/observability-alerting

kylehounslow · 2026-05-06T21:03:39Z

    <<: *network
    logging: *logging

+  # ******************


If demo-only, then should move to another compose file e.g. docker-compose.otel-demo.yml

kylehounslow · 2026-05-06T21:04:04Z

+  # continue to work everywhere without changes.
  prometheus:
-    image: prom/prometheus:${PROMETHEUS_VERSION}
+    image: cortexproject/cortex:v1.18.1


use CORTEX_VERSION env var

kylehounslow · 2026-05-06T21:04:40Z

-      - '--config.file=/etc/prometheus/prometheus.yml'
-      - '--storage.tsdb.path=/prometheus'
-      # Retention period from environment variable
-      - '--storage.tsdb.retention.time=${PROMETHEUS_RETENTION}'


How are we handling data retention? Removing this option I am concerned metrics data storage will grow infinitely

kylehounslow · 2026-05-06T21:06:01Z

+  # HTTP 5xx at the customer-facing boundary, so a single scrape unlocks
+  # full RED visibility from the edge. The scrape is a no-op when the
+  # otel-demo compose file isn't enabled (no DNS → drop).
+  prometheus/envoy:


what is users don't deploy otel demo?

Verified silent (no logs, no refused-metrics) when demo is off

kylehounslow · 2026-05-06T21:06:55Z

@@ -0,0 +1,306 @@
+#!/usr/bin/env python3


Is this init idempotent? I.e. can you run it multiple times and it will work all the same and not duplicate any resources?

Verified idempotent (5 monitors before/after rerun)

kylehounslow · 2026-05-06T21:07:34Z

 explore.agentTraces.enabled: true
+# Surfaces the Alert Manager UI in the Observability plugin, backed by the
+# alertmanager.uri configured on the Prometheus datasource.
+observability.alertManager.enabled: true


Below mentions alertmanager config only applies with otel-demo enabled. Is this ui-only and not applicable to same logic?

Yea sadly the name of the UI makes it confusing, but this is concerning to enable the UI page or not.

kylehounslow · 2026-05-06T21:08:29Z

        threshold:
          max_events: 500
-          flush_interval: 5s
-        routes: [service_processed_metrics]


routes: [service_processed_metrics]

Why remove?

routes: [service_processed_metrics] still on line 116, moved to sub-pipeline

kylehounslow · 2026-05-06T21:09:01Z

+        namespace = os.path.basename(namespace_dir)
+
+        for rules_file in sorted(glob.glob(f"{namespace_dir}/*.yml")):
+            loaded, failed = load_rules_file(rules_file, namespace)


Is this idempotent?

Verified idempotent (rule counts match before/after rerun; unconditional upsert)

kylehounslow · 2026-05-06T21:12:01Z

I am concerned how this may affect existing deployments. Some areas to test:

vanilla Prometheus running initially and then making this update to swap to cortex
running with/without otel-demo
re-deploying multiple times and ensure duplicate resources aren't created (and won't error out if resources already exist)

joshuali925 · 2026-05-06T21:32:43Z

Code review

Found 1 issue:

Stale comment claims alertmanager lives in docker-compose.otel-demo.yml and the URI "dangles when the demo is off", but this PR defines alertmanager unconditionally in the main docker-compose.yml — which the adjacent comment a few lines above correctly describes ("Runs whether or not the otel-demo is enabled"). The misleading comment will confuse future readers about when the service runs.

observability-stack/docker-compose.yml

Lines 274 to 279 in d93fd52

    
           - PROMETHEUS_PORT=${PROMETHEUS_PORT} 
        
           # alertmanager.uri is set on the Prometheus datasource unconditionally. 
        
           # The service itself only starts with the otel-demo flag (alertmanager 
        
           # lives in docker-compose.otel-demo.yml), so this URI dangles when the 
        
           # demo is off — harmless, since no alerts are firing to route anyway. 
        
           - ALERTMANAGER_HOST=alertmanager

Compare with the accurate description at

observability-stack/docker-compose.yml

Lines 145 to 152 in d93fd52

    
           # Prometheus Alertmanager - Alert routing, grouping, deduplication, and silencing. 
        
           # Runs whether or not the otel-demo is enabled: the base stack rules (collector 
        
           # health, scrape-target health) alert into it, and demo rules alert in when the 
        
           # demo overlay is enabled too. The OSD Prometheus datasource's alertmanager.uri 
        
           # points at this service's HTTP API. 
        
           alertmanager: 
        
             image: prom/alertmanager:${ALERTMANAGER_VERSION} 
        
             container_name: alertmanager

🤖 Generated with Claude Code

_{- If this code review was useful, please react with 👍. Otherwise, react with 👎.}

@kylehounslow

- CORTEX_VERSION env var: add to .env (v1.18.1) and template into the cortex image reference, matching the *_VERSION convention used by OTEL_COLLECTOR_VERSION, OPENSEARCH_VERSION, DATA_PREPPER_VERSION, ALERTMANAGER_VERSION. Flagged by @kylehounslow. - Cortex retention: wire PROMETHEUS_RETENTION through to -compactor.blocks-retention-period so metrics storage doesn't grow unbounded. Restores the retention behavior vanilla Prometheus had before the backend swap. Flagged by @kylehounslow. - Stale alertmanager comment in opensearch-dashboards-init environment: corrected to reflect that alertmanager now runs unconditionally in the base compose (post-467bb07). Flagged by @kylehounslow (opensearch-project#6) and joshuali925. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

…g bash Folds four P0/P1 fixes surfaced during stability testing: P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health` reporting green is not sufficient — the alerting plugin's internal indices still need to allocate. `POST /_plugins/_alerting/monitors` now retries up to 12× / 5s on 5xx and "all shards failed" responses. Without this, fresh installs silently dropped the stack "Cluster Health Red" monitor (5/6 monitors ending up created). P0.2 — in-place upgrades from pre-PR main left the `ObservabilityStack_Prometheus` datasource with only `{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI silently surfaced zero alerts. The init script now reads the authoritative properties via `GET /api/dataconnections`, and when the new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale, migrates via DELETE + POST (the SQL plugin does not expose a working PUT/PATCH). The DELETE+POST changes the saved-object id, so the migration also cleans up the orphaned pre-PR `data-connection` saved-object and any correlations whose references still point at it, keeping the saved-object graph consistent. Reruns are idempotent. P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no healthcheck, so `docker compose up -d --wait` returned while rules were still being loaded. init-cortex-rules.py now writes `/tmp/rules-loaded` on clean completion and both services test for that file. Rule counts at `--wait` return are now (1, 3) instead of (0, 0). P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`) dangling under `/data` because Cortex writes its own layout (`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex entrypoint now cleans these up on first boot only — gated on `/data/tsdb` being absent AND `/data/chunks_head` being present, so fresh deploys and subsequent restarts are untouched. Verified via the full Scenario 2b cold bring-up: - 35s `docker compose up -d --wait` exits 0 - All 6 OpenSearch monitors created - Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups) - Datasource has all 3 required URI properties after upgrade - Exactly 1 data-connection saved-object, no orphans - Correlations reference the current datasource id - 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…g bash Folds four P0/P1 fixes surfaced during stability testing: P0.1 — opensearch-stack-monitors-init / otel-demo-monitors-init race with the OpenSearch alerting plugin on cold bring-up. `/_cluster/health` reporting green is not sufficient — the alerting plugin's internal indices still need to allocate. `POST /_plugins/_alerting/monitors` now retries up to 12× / 5s on 5xx and "all shards failed" responses. Without this, fresh installs silently dropped the stack "Cluster Health Red" monitor (5/6 monitors ending up created). P0.2 — in-place upgrades from pre-PR main left the `ObservabilityStack_Prometheus` datasource with only `{prometheus.uri, prometheus.auth.*}`, so the OSD Alert Manager UI silently surfaced zero alerts. The init script now reads the authoritative properties via `GET /api/dataconnections`, and when the new `prometheus.ruler.uri` / `alertmanager.uri` are missing or stale, migrates via DELETE + POST (the SQL plugin does not expose a working PUT/PATCH). The DELETE+POST changes the saved-object id, so the migration also cleans up the orphaned pre-PR `data-connection` saved-object and any correlations whose references still point at it, keeping the saved-object graph consistent. Reruns are idempotent. P1.3 — `cortex-rules-init` / `cortex-rules-init-otel-demo` had no healthcheck, so `docker compose up -d --wait` returned while rules were still being loaded. init-cortex-rules.py now writes `/tmp/rules-loaded` on clean completion and both services test for that file. Rule counts at `--wait` return are now (1, 3) instead of (0, 0). P1.4 — an in-place upgrade left stale vanilla-Prometheus TSDB directories (`chunks_head`, `wal`, `wbl`, `lock`, `queries.active`) dangling under `/data` because Cortex writes its own layout (`/data/tsdb`, `/data/ruler-storage`) alongside them. The Cortex entrypoint now cleans these up on first boot only — gated on `/data/tsdb` being absent AND `/data/chunks_head` being present, so fresh deploys and subsequent restarts are untouched. Verified via the full Scenario 2b cold bring-up: - 35s `docker compose up -d --wait` exits 0 - All 6 OpenSearch monitors created - Cortex rules loaded at `--wait` return (1 stack + 3 otel_demo groups) - Datasource has all 3 required URI properties after upgrade - Exactly 1 data-connection saved-object, no orphans - Correlations reference the current datasource id - 21 Cortex / 20 Alertmanager alerts firing after 5-min demo soak Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

lezzago · 2026-05-07T18:11:04Z

I am concerned how this may affect existing deployments. Some areas to test:

1. vanilla Prometheus running initially and then making this update to swap to cortex

2. running with/without otel-demo

3. re-deploying multiple times and ensure duplicate resources aren't created (and won't error out if resources already exist)

Good call — I ran a full stability test pass against those three scenarios and surfaced four real issues. All fixed now.

Vanilla Prometheus → Cortex upgrade

Datasource not migrated. Pre-PR datasource had only prometheus.uri; init short-circuited on name match, so upgraded deployments silently lost the new alertmanager.uri / prometheus.ruler.uri and OSD Alert Manager UI showed zero Cortex alerts. Init now diffs against /api/dataconnections and reconciles via DELETE + POST (PUT/PATCH aren't exposed); also cleans up the orphaned saved-object wrapper and dangling correlations so the graph stays consistent. Reruns are idempotent.
Stale TSDB dirs. Cortex writes /data/tsdb + /data/ruler-storage alongside leftover chunks_head/wal/wbl. Entrypoint now cleans them up on first boot only, gated on /data/tsdb absent AND /data/chunks_head present.
Historical metrics aren't migrated — Cortex can't read vanilla-Prometheus blocks. New OTLP writes work immediately but pre-upgrade data is gone. Worth a release note.

With / without otel-demo

Base only: 15 containers, 1 stack rule group, AM empty, OSD reachable.
With demo: 38 containers, 1 stack + 3 otel_demo rule groups, 6 monitors, 21 Cortex / 20 AM alerts firing after 5-min soak.
Compose renders exit 0 on both compose v5.0.2 and v2.38.2 (CI parity).

Race surfaced: stack-monitor init vs. OpenSearch alerting plugin readiness. /_cluster/health green isn't enough — the plugin's indices still need to allocate.
Fresh installs silently ended up with 5/6 monitors. Monitor-create now retries on 5xx / "all shards failed" (up to 12× / 5s).

Re-deploy idempotency

Tested three teardown cycles + force-recreate of every init container:

Repeat up: exit 0, IDs stable, zero error log lines.
Rule loaders: loaded: N, failed: 0 on rerun; live-edit round-trip works.
Monitor loaders: all log Monitor already exists; count stable.
OSD init: saved-object counts stable across reruns (no duplicates).

Related bug fixed: cortex-rules-init had no healthcheck, so --wait returned while rules were still loading (counts (0, 0) for ~30s). Now writes /tmp/rules-loaded on clean completion; --wait blocks properly and counts are (1, 3) at return.

- README.md: mention Cortex under the `prometheus` service name, add Alertmanager to the components list and ports table (9093), update the 9090 description to reflect Cortex's Ruler + PromQL endpoints, and add an "Upgrading from Previous Releases" section that documents the (unavoidable) historical-metric loss and the `docker compose down -v` clean-slate path. - docs/starlight-docs/src/content/docs/alerting/index.md: add a "Prometheus/Cortex alerting" section explaining the two alerting surfaces, rule file locations (stack/ and otel-demo/), Alertmanager routing tree, and the unified Alert Manager UI in OSD, including a troubleshooting note for upgrades where the datasource still lacks the new URI properties. Starlight build passes `✓ All internal links are valid.`. Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

codecov · 2026-05-07T18:35:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 55.62%. Comparing base (15f21e2) to head (033a0d2).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #226   +/-   ##
=======================================
  Coverage   55.62%   55.62%           
=======================================
  Files           4        4           
  Lines         169      169           
  Branches       48       48           
=======================================
  Hits           94       94           
  Misses         74       74           
  Partials        1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

lezzago force-pushed the cortex-metrics branch 3 times, most recently from 87c1d84 to 9d290e3 Compare May 6, 2026 20:31

lezzago force-pushed the cortex-metrics branch from 9d290e3 to d93fd52 Compare May 6, 2026 20:39

lezzago marked this pull request as ready for review May 6, 2026 20:46

lezzago requested review from anirudha, goyamegh, joshuali925, kylehounslow, ps48 and vamsimanohar as code owners May 6, 2026 20:46

joshuali925 approved these changes May 6, 2026

View reviewed changes

kylehounslow requested changes May 6, 2026

View reviewed changes

lezzago force-pushed the cortex-metrics branch from d69b8e6 to 0c3b952 Compare May 7, 2026 18:02

Resolve endpoint issue with local yarn start

033a0d2

Signed-off-by: Ashish Agrawal <ashisagr@amazon.com>

Conversation

lezzago commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Alert Manager UI in observability stack

Notable design calls

Stability + upgrade-path fixes (commit d69b8e6)

Known follow-ups (not blocking)

Test plan

Uh oh!

lezzago commented May 6, 2026

Mend finding is a false-positive attribution, not a regression from this PR

Uh oh!

ps48 commented May 6, 2026

Uh oh!

lezzago commented May 6, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylehounslow commented May 6, 2026

Uh oh!

joshuali925 commented May 6, 2026

Code review

Uh oh!

lezzago commented May 7, 2026

Uh oh!

codecov Bot commented May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

lezzago commented May 6, 2026 •

edited

Loading

Stability + upgrade-path fixes (commit `d69b8e6`)

codecov Bot commented May 7, 2026 •

edited

Loading