Problem
The indexer (services/indexer/) has a /metrics endpoint and the reconciler (#258) logs structured events — but there's no production observability stack wiring them anywhere. Mainnet without monitoring = operating blind.
What this blocks
MAINNET_READINESS.md §3 — no current row for observability; this is a real gap. Indexer reconciler runbook (docs/operations/indexer-reorg-recovery.md) references alert conditions that have no alerting backend.
Proposed solution
Choose stack: Grafana Cloud + Loki + Prometheus + PagerDuty (free / cheap tier OK initially) OR Datadog (heavier but turnkey).
Wire up:
- Prometheus scrape of
GET /metrics on the indexer
- Loki ingest of structured JSON logs from indexer + reconciler daemon
- Grafana dashboard with:
- Indexer lag (
lastIndexedSlot vs cluster slot, target ≤ 64 slots)
- Reconciler
_unresolved row count per event table
- RPC quorum divergence counter
- Backfill cron health
- PagerDuty alerts (severity tiers from indexer-reorg-recovery.md runbook):
- P1:
event tx never finalized flood (> 5 in 1h)
- P2: indexer lag > 256 slots for 10min
- P2:
_unresolved count > 1000 in any table
- P3: cross-validation gap detected
- Runbook update —
docs/operations/indexer-reorg-recovery.md Section 3 (Step 3 — pause B2B oracle) gets a working PagerDuty URL
Acceptance criteria
Estimated scope
Medium — 1 dev × 1 week. Most of the cost is operational decisions (who pays Datadog, who's on-call) rather than code.
References
Problem
The indexer (
services/indexer/) has a/metricsendpoint and the reconciler (#258) logs structured events — but there's no production observability stack wiring them anywhere. Mainnet without monitoring = operating blind.What this blocks
MAINNET_READINESS.md§3 — no current row for observability; this is a real gap. Indexer reconciler runbook (docs/operations/indexer-reorg-recovery.md) references alert conditions that have no alerting backend.Proposed solution
Choose stack: Grafana Cloud + Loki + Prometheus + PagerDuty (free / cheap tier OK initially) OR Datadog (heavier but turnkey).
Wire up:
GET /metricson the indexerlastIndexedSlotvs cluster slot, target ≤ 64 slots)_unresolvedrow count per event tableevent tx never finalizedflood (> 5 in 1h)_unresolvedcount > 1000 in any tabledocs/operations/indexer-reorg-recovery.mdSection 3 (Step 3 — pause B2B oracle) gets a working PagerDuty URLAcceptance criteria
/metricslive on stagingdocs/operations/)Estimated scope
Medium — 1 dev × 1 week. Most of the cost is operational decisions (who pays Datadog, who's on-call) rather than code.
References
docs/operations/indexer-reorg-recovery.md— runbook waiting for working alerts