Runtime observability stack — metrics, logs, alerts

## Problem

The indexer (`services/indexer/`) has a `/metrics` endpoint and the reconciler (#258) logs structured events — but there's **no production observability stack** wiring them anywhere. Mainnet without monitoring = operating blind.

## What this blocks

`MAINNET_READINESS.md` §3 — no current row for observability; this is a real gap. Indexer reconciler runbook (`docs/operations/indexer-reorg-recovery.md`) references alert conditions that have no alerting backend.

## Proposed solution

Choose stack: **Grafana Cloud + Loki + Prometheus + PagerDuty** (free / cheap tier OK initially) OR **Datadog** (heavier but turnkey).

Wire up:

1. **Prometheus scrape** of `GET /metrics` on the indexer
2. **Loki ingest** of structured JSON logs from indexer + reconciler daemon
3. **Grafana dashboard** with:
   - Indexer lag (`lastIndexedSlot` vs cluster slot, target ≤ 64 slots)
   - Reconciler `_unresolved` row count per event table
   - RPC quorum divergence counter
   - Backfill cron health
4. **PagerDuty alerts** (severity tiers from indexer-reorg-recovery.md runbook):
   - P1: `event tx never finalized` flood (> 5 in 1h)
   - P2: indexer lag > 256 slots for 10min
   - P2: `_unresolved` count > 1000 in any table
   - P3: cross-validation gap detected
5. **Runbook update** — `docs/operations/indexer-reorg-recovery.md` Section 3 (Step 3 — pause B2B oracle) gets a working PagerDuty URL

## Acceptance criteria

- [ ] Grafana / Datadog account procured
- [ ] Prometheus scrape of indexer `/metrics` live on staging
- [ ] Loki / Datadog log ingestion live
- [ ] Dashboard published (link in `docs/operations/`)
- [ ] PagerDuty 4 alert rules configured
- [ ] On-call rotation defined (1 primary + 1 secondary)
- [ ] Runbook updated with working escalation channels

## Estimated scope

**Medium** — 1 dev × 1 week. Most of the cost is operational decisions (who pays Datadog, who's on-call) rather than code.

## References

- PR #258 — indexer reconciler with structured logs ready for ingestion
- [`docs/operations/indexer-reorg-recovery.md`](../docs/operations/indexer-reorg-recovery.md) — runbook waiting for working alerts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime observability stack — metrics, logs, alerts #271

Problem

What this blocks

Proposed solution

Acceptance criteria

Estimated scope

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Runtime observability stack — metrics, logs, alerts #271

Description

Problem

What this blocks

Proposed solution

Acceptance criteria

Estimated scope

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions