Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions .env.example
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,14 @@ SOROBAN_CONTRACT_ID=
# CB_SUCCESS_THRESHOLD=1 # consecutive successes in HALF_OPEN before closing (default: 1)
# CB_TIMEOUT_MS=30000 # ms to wait in OPEN before probing again (default: 30000)

# Webhook Delivery Circuit Breaker
# Per-provider circuit breaker for outbound webhook delivery.
# Intentionally separate from CB_* so webhook and RPC failure modes can be tuned independently.
# WEBHOOK_CB_FAILURE_THRESHOLD=5 # consecutive failures before opening (1-100, default: 5)
# WEBHOOK_CB_SUCCESS_THRESHOLD=1 # consecutive successes in HALF_OPEN before closing (1-20, default: 1)
# WEBHOOK_CB_TIMEOUT_MS=60000 # cooldown ms before probing (1000-300000, default: 60000)
# # Recommended: >= max retry backoff delay (default: 30000)

# Graceful Shutdown — Webhook Delivery Drain
# How long (ms) to wait for in-flight webhook deliveries to finish after SIGTERM
# before force-flushing them to the DLQ. Set to a value that covers your p99
Expand Down
67 changes: 67 additions & 0 deletions docs/WEBHOOK-DLQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -196,6 +196,73 @@ interface WebhookDLQEntry {
}
```

## Circuit Breaker

`WebhookDeliveryService` maintains a **per-provider circuit breaker** that
prevents repeated HTTP calls to providers that are persistently down. When a
provider's breaker is OPEN, deliveries are routed directly to the DLQ without
making a network request.

### State machine

```
CLOSED ──(failures ≥ threshold)──► OPEN
OPEN ──(cooldown elapsed) ──► HALF_OPEN
HALF_OPEN ──(probe succeeds) ──► CLOSED
HALF_OPEN ──(probe fails) ──► OPEN
```

| State | Behaviour |
|------------|---------------------------------------------------------------------------|
| `CLOSED` | Normal delivery. Consecutive failures are counted. |
| `OPEN` | Fast-path: delivery is skipped, payload is routed to DLQ immediately. |
| `HALF_OPEN`| One probe attempt is allowed. Success → CLOSED, failure → OPEN. |

### Retry / backoff coordination

The circuit breaker counts *consecutive* failures at the delivery layer.
Retry backoff (exponential with jitter, see `src/queue/webhook-retry-policy.ts`)
is applied by the queue layer *before* calling `deliver()` again. Each call to
`deliver()` therefore represents one real attempt — the breaker and the retry
policy do not double-count.

The recommended `WEBHOOK_CB_TIMEOUT_MS` value should be **≥ the maximum retry
backoff delay** (default: 30 s) so the breaker does not re-open immediately on
the first probe after cooldown. The default cooldown is 60 s.

### Configuration

All thresholds are read from environment variables and validated/clamped at
startup. They are intentionally separate from the RPC circuit breaker
(`CB_*`) so webhook and RPC failure modes can be tuned independently.

| Variable | Default | Description |
|----------|---------|-------------|
| `WEBHOOK_CB_FAILURE_THRESHOLD` | `5` | Consecutive failures before opening (1–100) |
| `WEBHOOK_CB_SUCCESS_THRESHOLD` | `1` | Consecutive successes in HALF_OPEN before closing (1–20) |
| `WEBHOOK_CB_TIMEOUT_MS` | `60000` | Cooldown ms before probing (1 000–300 000) |

### Metrics

| Metric | Labels | Description |
|--------|--------|-------------|
| `webhook_breaker_state` | `provider` | Current state: 0=CLOSED, 1=OPEN, 2=HALF_OPEN |
| `webhook_delivery_attempts_total` | `status`, `provider`, `reason` | Includes `reason=circuit_open` for fast-path deliveries |

Use `webhook_breaker_state` in Grafana dashboards and alerting rules to detect
providers that are persistently down.

### Security notes

- Provider labels are sanitized to a finite allow-list (`stripe`, `github`,
`slack`, `sendgrid`, `generic`) to prevent metric cardinality explosion.
- The `resetBreaker()` method is intended for admin use only; any API endpoint
that exposes it must be protected behind an authenticated admin route.
- No PII or raw error messages are recorded in metrics — only the error code
(e.g. `ECONNREFUSED`) is captured.

---

## Metrics

DLQ operations are tracked via Prometheus counters in `webhookMetrics.ts`:
Expand Down
1 change: 0 additions & 1 deletion jest.config.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,6 @@ module.exports = {
'tests/load',
'tests/stress',
'webhook.service.test.ts',
'webhookDelivery.test.ts',
'reputation-scheduler.service.test.ts',
'occ.integration.test.ts',
'deployment/integration.test.ts',
Expand Down
35 changes: 18 additions & 17 deletions package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

11 changes: 9 additions & 2 deletions package.json
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,14 @@
"engines": {
"node": ">=18"
},
"keywords": ["blockchain", "events", "ingestion", "idempotency", "talenttrust", "api"],
"keywords": [
"blockchain",
"events",
"ingestion",
"idempotency",
"talenttrust",
"api"
],
"author": "Talenttrust Team",
"license": "MIT",
"dependencies": {
Expand Down Expand Up @@ -85,7 +92,7 @@
"pino-pretty": "^13.0.0",
"supertest": "^7.0.0",
"swagger-cli": "^4.0.4",
"ts-jest": "^29.2.5",
"ts-jest": "^29.4.11",
"ts-node": "^10.9.2",
"ts-node-dev": "^2.0.0",
"typescript": "^5.9.3"
Expand Down
11 changes: 11 additions & 0 deletions src/appConfiguration.ts
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,12 @@
chaosTargets: string[];
chaosProbability: number;
circuitBreaker: CircuitBreakerConfig;
/**
* Per-provider circuit-breaker configuration for outbound webhook delivery.
* Thresholds are intentionally separate from the RPC circuit breaker so
* webhook and RPC failure modes can be tuned independently.
*/
webhookCircuitBreaker: CircuitBreakerConfig;
}

const MAX_TIMEOUT_MS = 10_000;
Expand Down Expand Up @@ -62,7 +68,7 @@
.filter(Boolean);
}

function parseAssets(value: string | undefined): string[] {

Check warning on line 71 in src/appConfiguration.ts

View workflow job for this annotation

GitHub Actions / Lint

'parseAssets' is defined but never used. Allowed unused vars must match /^_/u
if (!value) {
return ['USDC', 'XLM', 'BTC', 'ETH']; // Default assets
}
Expand Down Expand Up @@ -98,5 +104,10 @@
successThreshold: clamp(toNumber(env.CB_SUCCESS_THRESHOLD, 1), 1, 20),
timeoutMs: clamp(toNumber(env.CB_TIMEOUT_MS, 30_000), 1_000, 300_000),
},
webhookCircuitBreaker: {
failureThreshold: clamp(toNumber(env.WEBHOOK_CB_FAILURE_THRESHOLD, 5), 1, 100),
successThreshold: clamp(toNumber(env.WEBHOOK_CB_SUCCESS_THRESHOLD, 1), 1, 20),
timeoutMs: clamp(toNumber(env.WEBHOOK_CB_TIMEOUT_MS, 60_000), 1_000, 300_000),
},
};
}
25 changes: 25 additions & 0 deletions src/logger.ts
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,31 @@ function serializeError(err: Error): Record<string, unknown> {
* Sanitise a context object so that sensitive keys are never logged.
* Extend this list as the domain grows.
*/
const SENSITIVE_KEYS = new Set([
'authorization',
'cookie',
'set-cookie',
'x-api-key',
'x-api-secret',
'x-auth-token',
'x-access-token',
'proxy-authorization',
'password',
'passwd',
'secret',
'token',
'access_token',
'refresh_token',
'api_key',
'apikey',
'credential',
'private',
'ssn',
'credit_card',
'webhooksecret',
'webhook_secret',
]);

const redactionPaths = SENSITIVE_KEYS.flatMap(key => [
key,
`*.${key}`,
Expand Down
2 changes: 1 addition & 1 deletion src/queue/index.ts
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ export { queueConfig, getRedisConfig } from './config';
export {
WebhookDLQEntry,
WebhookDLQQuery,
WebhookDLQConfig,
DLQConfig as WebhookDLQConfig,
getWebhookDLQStorage,
clearWebhookDLQInstance,
initializeDLQMetrics,
Expand Down
Loading
Loading