This guide explains the monitoring infrastructure for Matchbook using Prometheus and Grafana.
Matchbook uses:
- Prometheus for metrics collection and alerting
- Grafana for visualization and dashboards
# Start all services including monitoring
docker-compose -f Docker/docker-compose.yml up -d
# Access Grafana
open http://localhost:3000
# Default credentials: admin/admin
# Access Prometheus
open http://localhost:9092┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Indexer │ │ API │ │ Crank │
│ :9090 │ │ :8080 │ │ :9091 │
└──────┬──────┘ └──────┬──────┘ └──────┬──────┘
│ │ │
│ /metrics │ /metrics │ /metrics
└───────────────────┼───────────────────┘
│
┌──────▼──────┐
│ Prometheus │
│ :9092 │
└──────┬──────┘
│
┌────────────┼────────────┐
│ │
┌──────▼──────┐ ┌───────▼───────┐
│ Grafana │ │ Alertmanager │
│ :3000 │ │ :9093 │
└─────────────┘ └───────┬───────┘
│
┌─────────────┼─────────────┐
│ │ │
┌────▼────┐ ┌─────▼─────┐ ┌────▼────┐
│ Slack │ │ PagerDuty │ │ Email │
└─────────┘ └───────────┘ └─────────┘
| Metric | Type | Description |
|---|---|---|
indexer_slot_lag |
Gauge | Slots behind the tip |
indexer_events_processed_total |
Counter | Total events processed |
indexer_accounts_processed_total |
Counter | Total accounts processed |
indexer_parse_errors_total |
Counter | Parse errors by type |
| Metric | Type | Description |
|---|---|---|
api_requests_total |
Counter | Total requests by method, path, status |
api_request_duration_seconds |
Histogram | Request latency |
api_active_connections |
Gauge | Current HTTP connections |
ws_active_connections |
Gauge | Current WebSocket connections |
ws_messages_sent_total |
Counter | WebSocket messages sent |
ws_subscriptions |
Gauge | Active subscriptions |
| Metric | Type | Description |
|---|---|---|
crank_matches_executed_total |
Counter | Total matches executed |
crank_transactions_total |
Counter | Transactions by status |
crank_profit_lamports |
Gauge | Current profit |
crank_priority_fee_lamports |
Gauge | Current priority fee |
crank_fees_paid_lamports |
Counter | Total fees paid |
| Metric | Type | Description |
|---|---|---|
market_volume_total |
Counter | Trading volume by market |
market_trades_total |
Counter | Trade count by market |
market_order_count |
Gauge | Active orders by market/side |
market_spread_bps |
Gauge | Spread in basis points |
- Service health status
- Slot lag
- Request rate
- Latency percentiles
- Active connections
- Crank matches
- Slot lag over time
- Processing throughput
- Parse errors by type
- Request rate by endpoint
- Latency percentiles (p50, p95, p99)
- Error rate
- Requests by status code
- Active connections
- Matches over time
- Transaction success rate
- Profitability
- Transactions by status
- 24h volume
- 24h trades
- Spread
- Order book depth
- Volume over time
Matchbook uses Prometheus Alertmanager for alert routing and notification.
# Local development
open http://localhost:9093
# View active alerts
curl http://localhost:9093/api/v2/alerts| Severity | Response | Notification |
|---|---|---|
| Critical | Immediate action required | PagerDuty + Slack |
| Warning | Investigate soon | Slack |
| Info | Informational | Log only |
| Alert | Condition | Description |
|---|---|---|
IndexerDown |
up == 0 for 2m |
Indexer not responding |
IndexerSlotLagCritical |
slot_lag > 500 for 2m |
Indexer severely behind |
APIDown |
up == 0 for 2m |
API not responding |
APIHighErrorRate |
error_rate > 5% for 5m |
High API error rate |
CrankDown |
up == 0 for 2m |
Crank not responding |
CrankNotMatching |
Crossed orders, no matches for 5m | Crank not executing matches |
EventQueueNearFull |
Queue > 80% capacity for 5m | Event queue needs draining |
DatabaseConnectionErrors |
> 10 errors/min for 2m | Database connection issues |
DiskSpaceLow |
< 10% free for 5m | Disk space critical |
| Alert | Condition | Description |
|---|---|---|
IndexerSlotLagHigh |
slot_lag > 100 for 5m |
Indexer falling behind |
APIHighLatency |
p99 > 500ms for 5m |
High API latency |
CrankHighFailureRate |
failure_rate > 10% for 5m |
Crank transactions failing |
CrankLowProfitability |
profit < 0 for 10m |
Crank not profitable |
HighMemoryUsage |
> 90% for 10m |
High memory usage |
HighCPUUsage |
> 90% for 10m |
High CPU usage |
WebSocketHighConnectionCount |
> 5000 for 5m |
Many WebSocket connections |
The Alertmanager configuration is in monitoring/alertmanager/alertmanager.yml:
route:
receiver: 'slack-warnings'
group_by: ['alertname', 'service', 'severity']
routes:
- match:
severity: critical
receiver: 'pagerduty-critical'
- match:
severity: warning
receiver: 'slack-warnings'
receivers:
- name: 'slack-warnings'
slack_configs:
- channel: '#matchbook-alerts'
- name: 'pagerduty-critical'
pagerduty_configs:
- service_key: '<YOUR_KEY>'- Create a Slack webhook: https://api.slack.com/messaging/webhooks
- Set the webhook URL in
alertmanager.yml:global: slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
- Create a PagerDuty service and get the integration key
- Set the service key in
alertmanager.yml:receivers: - name: 'pagerduty-critical' pagerduty_configs: - service_key: 'YOUR_SERVICE_KEY'
- Configure SMTP settings in
alertmanager.yml:global: smtp_smarthost: 'smtp.example.com:587' smtp_from: 'alertmanager@matchbook.taunais.com' smtp_auth_username: 'user' smtp_auth_password: 'password'
-
Add the alert rule to
monitoring/prometheus/alerts/matchbook.yml:- alert: MyNewAlert expr: my_metric > threshold for: 5m labels: severity: warning service: my-service annotations: summary: "Alert summary" description: "Detailed description with {{ $value }}" runbook_url: "https://github.com/joaquinbejar/matchbook/blob/main/docs/runbooks/my-alert.md"
-
Reload Prometheus:
curl -X POST http://localhost:9092/-/reload
To temporarily silence an alert:
# Via Alertmanager UI
open http://localhost:9093/#/silences
# Via API
curl -X POST http://localhost:9093/api/v2/silences \
-H "Content-Type: application/json" \
-d '{
"matchers": [{"name": "alertname", "value": "MyAlert", "isRegex": false}],
"startsAt": "2024-01-01T00:00:00Z",
"endsAt": "2024-01-01T01:00:00Z",
"createdBy": "admin",
"comment": "Maintenance window"
}'Alerts are automatically suppressed when related critical alerts fire:
IndexerDownsuppressesIndexerSlotLagHighAPIDownsuppressesAPIHighLatency,APIHighErrorRateCrankDownsuppressesCrankHighFailureRate- Critical alerts suppress warning alerts for the same service
use prometheus::{Counter, Histogram, Gauge, register_counter, register_histogram, register_gauge};
use lazy_static::lazy_static;
lazy_static! {
pub static ref MY_COUNTER: Counter = register_counter!(
"my_counter_total",
"Description of my counter"
).unwrap();
pub static ref MY_HISTOGRAM: Histogram = register_histogram!(
"my_duration_seconds",
"Description of my histogram"
).unwrap();
pub static ref MY_GAUGE: Gauge = register_gauge!(
"my_gauge",
"Description of my gauge"
).unwrap();
}
// Usage
MY_COUNTER.inc();
MY_HISTOGRAM.observe(duration.as_secs_f64());
MY_GAUGE.set(value);- Use snake_case
- Include unit suffix:
_seconds,_bytes,_total - Counters should end with
_total - Use labels for dimensions:
{method="GET", path="/v1/markets"}
scrape_configs:
- job_name: my-service
static_configs:
- targets:
- my-service:9090
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: (.+):.*
replacement: $1Pre-compute expensive queries:
groups:
- name: my-rules
rules:
- record: my:metric:5m
expr: sum(rate(my_counter_total[5m]))groups:
- name: my-alerts
rules:
- alert: MyAlert
expr: my_metric > 100
for: 5m
labels:
severity: warning
annotations:
summary: "My alert fired"
description: "Value is {{ $value }}"apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-service
spec:
selector:
matchLabels:
app: my-service
endpoints:
- port: metrics
interval: 15smetadata:
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "9090"
prometheus.io/path: "/metrics"- Check target status in Prometheus UI (
/targets) - Verify network connectivity
- Check service annotations/labels
- Verify metrics endpoint:
curl http://service:port/metrics - Check metric registration in code
- Verify scrape config matches service
- Check Grafana datasource configuration
- Verify Prometheus is accessible from Grafana
- Check dashboard JSON syntax