Skip to content

Runtime observability stack — metrics, logs, alerts #271

@alrimarleskovar

Description

@alrimarleskovar

Problem

The indexer (services/indexer/) has a /metrics endpoint and the reconciler (#258) logs structured events — but there's no production observability stack wiring them anywhere. Mainnet without monitoring = operating blind.

What this blocks

MAINNET_READINESS.md §3 — no current row for observability; this is a real gap. Indexer reconciler runbook (docs/operations/indexer-reorg-recovery.md) references alert conditions that have no alerting backend.

Proposed solution

Choose stack: Grafana Cloud + Loki + Prometheus + PagerDuty (free / cheap tier OK initially) OR Datadog (heavier but turnkey).

Wire up:

  1. Prometheus scrape of GET /metrics on the indexer
  2. Loki ingest of structured JSON logs from indexer + reconciler daemon
  3. Grafana dashboard with:
    • Indexer lag (lastIndexedSlot vs cluster slot, target ≤ 64 slots)
    • Reconciler _unresolved row count per event table
    • RPC quorum divergence counter
    • Backfill cron health
  4. PagerDuty alerts (severity tiers from indexer-reorg-recovery.md runbook):
    • P1: event tx never finalized flood (> 5 in 1h)
    • P2: indexer lag > 256 slots for 10min
    • P2: _unresolved count > 1000 in any table
    • P3: cross-validation gap detected
  5. Runbook updatedocs/operations/indexer-reorg-recovery.md Section 3 (Step 3 — pause B2B oracle) gets a working PagerDuty URL

Acceptance criteria

  • Grafana / Datadog account procured
  • Prometheus scrape of indexer /metrics live on staging
  • Loki / Datadog log ingestion live
  • Dashboard published (link in docs/operations/)
  • PagerDuty 4 alert rules configured
  • On-call rotation defined (1 primary + 1 secondary)
  • Runbook updated with working escalation channels

Estimated scope

Medium — 1 dev × 1 week. Most of the cost is operational decisions (who pays Datadog, who's on-call) rather than code.

References

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions