Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .env
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,15 @@ OPENSEARCH_HOST=opensearch
OPENSEARCH_PORT=9200
OPENSEARCH_PROTOCOL=https
OPENSEARCH_JAVA_OPTS=-Xms1g -Xmx1g
# Endpoint written into the `local_cluster` data-source saved object that the
# init container seeds into OpenSearch. Point it at the host-reachable port
# (`https://localhost:9200`, published by the compose file) when running
# OpenSearch Dashboards on the host — the host-side OSD process cannot
# resolve the docker-compose service name `opensearch`, so any MDS-scoped
# OSD feature that dials this SO's endpoint would fail with
# `getaddrinfo ENOTFOUND opensearch`. Leave blank/commented when OSD itself
# runs inside the compose network.
OSD_DATASOURCE_ENDPOINT=https://localhost:9200

# OpenSearch Dashboards Configuration
OPENSEARCH_DASHBOARDS_VERSION=3.7.0
Expand Down Expand Up @@ -49,11 +58,20 @@ DATA_PREPPER_HTTP_PORT=21892
ISM_RETENTION_DAYS=7

# Prometheus Configuration
# The "prometheus" service now runs Cortex under the hood (see docker-compose.yml),
# which is wire-compatible for remote-write/query/ruler/alertmanager APIs.
# PROMETHEUS_VERSION is retained for legacy references; the actual image tag
# comes from CORTEX_VERSION below.
PROMETHEUS_VERSION=v3.8.1
CORTEX_VERSION=v1.18.1
PROMETHEUS_HOST=prometheus.observability-stack-network
PROMETHEUS_PORT=9090
PROMETHEUS_RETENTION=15d

# Alertmanager Configuration
ALERTMANAGER_VERSION=v0.27.0
ALERTMANAGER_PORT=9093

# Resource Limits
OPENSEARCH_MEMORY_LIMIT=2G
PROMETHEUS_MEMORY_LIMIT=500M
Expand All @@ -62,6 +80,7 @@ DATA_PREPPER_MEMORY_LIMIT=1G
DASHBOARDS_MEMORY_LIMIT=2G
WEATHER_AGENT_MEMORY_LIMIT=200M
CANARY_MEMORY_LIMIT=100M
ALERTMANAGER_MEMORY_LIMIT=128M

# Network Configuration
NETWORK_NAME=observability-stack-network
Expand Down Expand Up @@ -110,6 +129,15 @@ OTEL_RESOURCE_ATTRIBUTES=service.namespace=opentelemetry-demo,service.version=${
# Metrics Temporality
OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative

# Enable metrics + logs export on every OTel-instrumented service (Node.js,
# Python, Go, .NET, Java, Rust). Without this, Node.js SDKs in particular
# default to NOT exporting metrics even when traces are being emitted — so
# the frontend container would only show nodejs_* runtime metrics in Cortex
# and no http_server_duration_* counters. "otlp" matches the existing trace
# exporter so all three signals go to the same collector pipeline.
OTEL_METRICS_EXPORTER=otlp
OTEL_LOGS_EXPORTER=otlp

# OTLP Endpoints
OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_GRPC}
PUBLIC_OTEL_EXPORTER_OTLP_TRACES_ENDPOINT=http://localhost:8080/otlp-http/v1/traces
Expand Down
24 changes: 20 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,9 @@ Observability Stack is an open-source stack designed for modern distributed syst
- **OpenTelemetry Collector**: Receives OTLP data and routes it to appropriate backends
- **Data Prepper**: Transforms and enriches logs and traces before storage
- **OpenSearch**: Stores and indexes logs and traces for search and analysis
- **Prometheus**: Stores time-series metrics data
- **OpenSearch Dashboards**: Provides web-based visualization and exploration
- **Prometheus**: Stores time-series metrics data — runs the Cortex engine under the service name `prometheus` (same API surface, plus Ruler and Alertmanager endpoints)
- **Alertmanager**: Routes alerts from Cortex-side PromQL rules to notification channels
- **OpenSearch Dashboards**: Provides web-based visualization and exploration — includes the Alert Manager UI for viewing both OpenSearch monitors and Cortex alerts in one place
- **PPL (Piped Processing Language)**: Native query language for logs and traces — pipe-based, human-readable, 50+ commands

## See it in action
Expand Down Expand Up @@ -148,6 +149,20 @@ To stop the stack and remove all data volumes:
docker compose down -v
```

## Upgrading from Previous Releases

This release swaps vanilla Prometheus for Cortex (kept under the same `prometheus` service name) and adds an always-on Alertmanager. Existing deployments can upgrade in place, with two caveats worth calling out:

- **Historical metrics do not carry over.** Cortex writes to a different on-disk layout (`/data/tsdb`, `/data/ruler-storage`) than vanilla Prometheus (`/prometheus/chunks_head`, `/prometheus/wal`). Cortex does not read the old TSDB blocks, so any metrics stored in the `prometheus-data` volume before the upgrade are unreadable after it. New OTLP writes work immediately.
- **The in-place upgrade migrates OSD state automatically**, but if you prefer a clean slate, wipe volumes before bringing the new stack up:
```bash
docker compose down -v
docker compose up -d
```
The `docker compose down -v` path is the safest if you're on an older build. The automatic migration reconciles the `ObservabilityStack_Prometheus` datasource to add the new `prometheus.ruler.uri` / `alertmanager.uri` properties, cleans up the old saved-object wrapper, and removes stale vanilla-Prometheus directories from the data volume on first Cortex boot.

See [Alerting](docs/starlight-docs/src/content/docs/alerting/index.md) for a tour of the new Cortex rules, Alertmanager routing, and the Alert Manager UI in OpenSearch Dashboards.

## Instrumenting Your Agent

Observability Stack accepts telemetry data via the OpenTelemetry Protocol (OTLP) and follows the [OpenTelemetry Gen-AI Semantic Conventions](https://opentelemetry.io/docs/specs/semconv/gen-ai/) for standardized attribute naming and structure for AI agents.
Expand Down Expand Up @@ -264,8 +279,9 @@ docker compose ps
|------|---------|----------|-------------|
| **4317** | OTel Collector | gRPC | OTLP gRPC receiver — used by most OpenTelemetry SDKs |
| **4318** | OTel Collector | HTTP | OTLP HTTP receiver — used by Strands SDK, browser-based exporters |
| **5601** | OpenSearch Dashboards | HTTP | Web UI for logs, traces, and dashboards |
| **9090** | Prometheus | HTTP | Prometheus Web UI and API |
| **5601** | OpenSearch Dashboards | HTTP | Web UI for logs, traces, dashboards, and Alert Manager |
| **9090** | Prometheus (Cortex) | HTTP | PromQL query API (`/prometheus/...`) and Ruler admin API (`/api/v1/rules/...`) |
| **9093** | Alertmanager | HTTP | Alert routing UI and API for Cortex-side PromQL alerts |
| **9200** | OpenSearch | HTTPS | REST API (self-signed cert, use `curl -k`) |
| **21890** | Data Prepper | gRPC | Internal OTLP receiver (from OTel Collector) |

Expand Down
84 changes: 84 additions & 0 deletions docker-compose.otel-demo.yml
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,8 @@ services:
environment:
- KAFKA_ADDR
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=accounting
Expand Down Expand Up @@ -68,6 +70,8 @@ services:
- FLAGD_HOST
- FLAGD_PORT
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_LOGS_EXPORTER=otlp
Expand Down Expand Up @@ -99,6 +103,8 @@ services:
- FLAGD_PORT
- VALKEY_ADDR
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=cart
Expand Down Expand Up @@ -137,6 +143,8 @@ services:
- KAFKA_ADDR
- GOMEMLIMIT=16MiB
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=checkout
Expand Down Expand Up @@ -178,6 +186,8 @@ services:
- IPV6_ENABLED
- VERSION=${IMAGE_VERSION}
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=currency
Expand All @@ -204,6 +214,8 @@ services:
- FLAGD_HOST
- FLAGD_PORT
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=email
Expand All @@ -227,6 +239,8 @@ services:
- FLAGD_PORT
- KAFKA_ADDR
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_INSTRUMENTATION_KAFKA_EXPERIMENTAL_SPAN_ATTRIBUTES=true
- OTEL_INSTRUMENTATION_MESSAGING_EXPERIMENTAL_RECEIVE_TELEMETRY_ENABLED=true
Expand Down Expand Up @@ -263,6 +277,8 @@ services:
- RECOMMENDATION_ADDR
- SHIPPING_ADDR
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_RESOURCE_ATTRIBUTES
- ENV_PLATFORM
- OTEL_SERVICE_NAME=frontend
Expand Down Expand Up @@ -387,6 +403,8 @@ services:
- LOCUST_AUTOSTART
- LOCUST_BROWSER_TRAFFIC_ENABLED=false
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=load-generator
Expand Down Expand Up @@ -420,6 +438,8 @@ services:
- FLAGD_HOST
- FLAGD_PORT
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=payment
Expand Down Expand Up @@ -448,6 +468,8 @@ services:
- FLAGD_PORT
- GOMEMLIMIT=16MiB
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=product-catalog
Expand Down Expand Up @@ -478,6 +500,8 @@ services:
- PRODUCT_REVIEWS_PORT
- OTEL_PYTHON_LOG_CORRELATION=true
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=product-reviews
Expand Down Expand Up @@ -518,6 +542,8 @@ services:
environment:
- IPV6_ENABLED
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_PHP_AUTOLOAD_ENABLED=true
- QUOTE_PORT
Expand Down Expand Up @@ -548,6 +574,8 @@ services:
- FLAGD_PORT
- OTEL_PYTHON_LOG_CORRELATION=true
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=recommendation
Expand Down Expand Up @@ -578,6 +606,8 @@ services:
- SHIPPING_PORT
- QUOTE_ADDR
- OTEL_EXPORTER_OTLP_ENDPOINT
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=shipping
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
Expand Down Expand Up @@ -631,6 +661,8 @@ services:
environment:
- FLAGD_UI_PORT
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=flagd-ui
Expand Down Expand Up @@ -661,6 +693,8 @@ services:
- KAFKA_LISTENERS=PLAINTEXT://${KAFKA_HOST}:9092,CONTROLLER://${KAFKA_HOST}:9093
- KAFKA_CONTROLLER_QUORUM_VOTERS=1@${KAFKA_HOST}:9093
- OTEL_EXPORTER_OTLP_ENDPOINT=http://${OTEL_COLLECTOR_HOST}:${OTEL_COLLECTOR_PORT_HTTP}
- OTEL_METRICS_EXPORTER
- OTEL_LOGS_EXPORTER
- OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE
- OTEL_RESOURCE_ATTRIBUTES
- OTEL_SERVICE_NAME=kafka
Expand Down Expand Up @@ -731,4 +765,54 @@ services:
<<: *network
logging: *logging

# ******************
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If demo-only, then should move to another compose file e.g. docker-compose.otel-demo.yml

# Demo-only Alerting Extensions
# ******************
# Demo-only rule loader. Runs alongside the base cortex-rules-init container
# (defined in docker-compose.yml) and loads the `otel_demo` Cortex namespace
# only. Named separately rather than overlaying the base service because
# Docker Compose >= v2.38 rejects service-name overlays on `include:`-imported
# resources. Both containers hit the same idempotent Ruler upsert API.
cortex-rules-init-otel-demo:
image: python:3.11-alpine
container_name: cortex-rules-init-otel-demo
# `sleep infinity` after success so `docker compose up --wait` is happy.
command: sh -c "pip install requests pyyaml && python /init.py && exec sleep infinity"
depends_on:
prometheus:
condition: service_healthy
volumes:
- ./docker-compose/cortex/init-cortex-rules.py:/init.py
- ./docker-compose/prometheus/rules-otel-demo:/rules/otel_demo:ro
<<: *network
restart: "no"
# Mirror the base cortex-rules-init healthcheck: the script touches
# /tmp/rules-loaded on a clean load, and `--wait` blocks on it so callers
# that query /api/v1/rules/otel_demo after --wait see the rules already in
# Cortex. 40×3s=120s covers pip install + load time.
healthcheck:
test: ["CMD", "test", "-f", "/tmp/rules-loaded"]
interval: 3s
timeout: 2s
retries: 40
start_period: 10s
logging: *logging

# OTel Demo Monitors Init - Creates OpenSearch alerting monitors for demo
# traces/logs (checkout, payment, cart, frontend). Idempotent.
otel-demo-monitors-init:
image: python:3.11-alpine
container_name: otel-demo-monitors-init
command: sh -c "pip install requests && python /init.py"
depends_on:
opensearch:
condition: service_healthy
environment:
- OPENSEARCH_USER=${OPENSEARCH_USER}
- OPENSEARCH_PASSWORD=${OPENSEARCH_PASSWORD}
volumes:
- ./docker-compose/opentelemetry-demo/init-otel-demo-monitors.py:/init.py
<<: *network
restart: "no"
logging: *logging

Loading
Loading