The Dead-Letter Queue (DLQ) is a failure recovery mechanism that captures and isolates repeatedly failing tasks from the normal execution flow. This prevents resource waste, reduces noise in logs, and provides operators with diagnostic information to resolve persistent issues.
Some tasks may fail repeatedly due to:
- Invalid configuration
- Broken target contracts
- Persistent permission problems
- Insufficient gas balance
- Network issues
Blindly retrying every failing task can:
- Consume unnecessary resources
- Create noise in logs and metrics
- Hide the difference between temporary and permanent failures
- Block healthy task execution
The Dead-Letter Queue provides:
- Automatic Quarantine: Tasks that exceed failure thresholds are automatically isolated
- Failure Context: Full execution context is stored for diagnosis
- Error Pattern Analysis: Identifies consistent error patterns
- Operator Visibility: HTTP endpoints for inspecting quarantined tasks
- Recovery Mechanism: Manual recovery of tasks after issues are resolved
- Metrics Integration: Prometheus metrics for monitoring
┌─────────────────────────────────────────────────────────────┐
│ Execution Flow │
├─────────────────────────────────────────────────────────────┤
│ │
│ Task Due ──> Check Quarantine ──> Execute │
│ │ │ │
│ │ │ │
│ Quarantined? Success/Failure │
│ │ │ │
│ ▼ ▼ │
│ Skip Task Record Failure │
│ │ │
│ ▼ │
│ Check Failure Threshold │
│ │ │
│ ▼ │
│ Exceeded? ──> Quarantine │
│ │
└─────────────────────────────────────────────────────────────┘
Configure the Dead-Letter Queue via environment variables:
# Maximum number of failures before quarantine (default: 5)
DLQ_MAX_FAILURES=5
# Time window for counting failures in milliseconds (default: 3600000 = 1 hour)
DLQ_FAILURE_WINDOW_MS=3600000
# Enable automatic quarantine (default: true)
DLQ_AUTO_QUARANTINE=true
# Maximum number of dead-letter records to keep (default: 1000)
DLQ_MAX_RECORDS=1000A task is quarantined when:
- Failure Count Threshold: Task fails
DLQ_MAX_FAILUREStimes withinDLQ_FAILURE_WINDOW_MS - Non-Retryable Error: Task encounters a non-retryable error (e.g.,
INVALID_ARGS,CONTRACT_PANIC)
- Retryable: Temporary failures (network errors, timeouts, rate limits)
- Non-Retryable: Permanent failures (invalid arguments, contract panics, auth errors)
- Duplicate: Transaction already processed (treated as success)
Each quarantined task has a dead-letter record containing:
{
taskId: 123,
quarantinedAt: 1234567890000,
reason: "max_failures_exceeded",
status: "quarantined",
failureCount: 5,
firstFailure: 1234567800000,
lastFailure: 1234567890000,
// Error pattern analysis
errorPattern: {
type: "TIMEOUT",
classification: "retryable",
phase: "execution",
confidence: 0.8,
totalFailures: 5,
uniqueErrors: 2
},
// Last 10 failures for diagnosis
failureHistory: [
{
timestamp: 1234567890000,
error: {
message: "Request timeout",
code: "TIMEOUT",
stack: "..."
},
errorClassification: "retryable",
attempt: 5,
txHash: "abc123...",
phase: "execution",
taskConfig: {
last_run: 1000,
interval: 3600,
gas_balance: 5000,
target: "CTEST...",
function: "test_fn"
}
},
// ... more failures
],
// Recovery info (if recovered)
recoveredAt: null,
recoveryReason: null
}GET http://localhost:3000/dead-letterResponse:
{
"stats": {
"totalQuarantined": 10,
"totalRecovered": 3,
"activeQuarantined": 7,
"totalRecords": 10,
"config": {
"maxFailures": 5,
"failureWindowMs": 3600000,
"autoQuarantine": true,
"maxRecords": 1000
}
},
"records": [
{
"taskId": 123,
"quarantinedAt": 1234567890000,
"reason": "max_failures_exceeded",
"failureCount": 5,
"errorPattern": { ... }
}
],
"quarantinedTasks": [123, 456, 789]
}GET http://localhost:3000/dead-letter/123Response:
{
"taskId": 123,
"quarantinedAt": 1234567890000,
"reason": "max_failures_exceeded",
"status": "quarantined",
"failureCount": 5,
"failureHistory": [ ... ],
"errorPattern": { ... }
}The following metrics are exposed at /metrics/prometheus:
# Number of tasks currently quarantined
keeper_quarantined_tasks_count
# Total tasks that have been quarantined
keeper_tasks_quarantined_total
# Total tasks recovered from quarantine
keeper_tasks_recovered_total
# Total tasks skipped due to quarantine
keeper_tasks_quarantined_skipped_total
Failures are automatically recorded by the ExecutionQueue when tasks fail:
// Automatic recording in queue.js
catch (error) {
if (deadLetterQueue) {
deadLetterQueue.recordFailure(taskId, {
error,
errorClassification: error.classification || 'retryable',
attempt,
txHash: error.txHash || null,
taskConfig,
phase: error.phase || 'execution',
});
}
}deadLetterQueue.quarantine(taskId, 'manual_quarantine', {
operator: 'admin',
reason: 'Suspected malicious task'
});// Recover a task from quarantine
const success = deadLetterQueue.recover(taskId, 'issue_resolved');
if (success) {
console.log(`Task ${taskId} recovered and will be retried`);
}if (deadLetterQueue.isQuarantined(taskId)) {
console.log('Task is quarantined, skipping execution');
}const stats = deadLetterQueue.getStats();
console.log(`Active quarantined tasks: ${stats.activeQuarantined}`);
console.log(`Total quarantined: ${stats.totalQuarantined}`);
console.log(`Total recovered: ${stats.totalRecovered}`);The Dead-Letter Queue emits events for monitoring:
// Task quarantined
deadLetterQueue.on('task:quarantined', ({ taskId, record }) => {
console.log(`Task ${taskId} quarantined: ${record.reason}`);
// Send alert to operators
});
// Task recovered
deadLetterQueue.on('task:recovered', ({ taskId, recoveryReason }) => {
console.log(`Task ${taskId} recovered: ${recoveryReason}`);
});
// Failure recorded
deadLetterQueue.on('failure:recorded', ({ taskId, failureRecord }) => {
console.log(`Failure recorded for task ${taskId}`);
});
// DLQ cleared
deadLetterQueue.on('dlq:cleared', (options) => {
console.log('Dead-letter queue cleared');
});-
Check DLQ Overview:
curl http://localhost:3000/dead-letter
-
Inspect Specific Task:
curl http://localhost:3000/dead-letter/123
-
Analyze Error Pattern:
- Check
errorPattern.typefor dominant error - Review
errorPattern.confidencefor consistency - Examine
failureHistoryfor detailed context
- Check
-
Review Task Configuration:
- Check
taskConfig.gas_balancefor insufficient gas - Verify
taskConfig.targetcontract is valid - Ensure
taskConfig.functionexists
- Check
After resolving the underlying issue:
-
Programmatic Recovery:
deadLetterQueue.recover(taskId, 'gas_refilled');
-
Bulk Recovery (if needed):
const quarantined = deadLetterQueue.getQuarantinedTasks(); quarantined.forEach(taskId => { deadLetterQueue.recover(taskId, 'bulk_recovery'); });
// Clear only recovered records
deadLetterQueue.clear({ recoveredOnly: true });
// Clear all records (use with caution)
deadLetterQueue.clear();- Monitor Quarantine Rate: Set up alerts when
keeper_quarantined_tasks_countincreases - Regular Review: Periodically review quarantined tasks to identify systemic issues
- Adjust Thresholds: Tune
DLQ_MAX_FAILURESbased on your network conditions - Document Recoveries: Always provide meaningful
recoveryReasonwhen recovering tasks - Investigate Patterns: Use error pattern analysis to identify root causes
- Clean Up: Regularly clear recovered records to prevent unbounded growth
- Check
DLQ_AUTO_QUARANTINE=true - Verify failure count exceeds
DLQ_MAX_FAILURES - Ensure failures occur within
DLQ_FAILURE_WINDOW_MS
- Increase
DLQ_MAX_FAILURESthreshold - Extend
DLQ_FAILURE_WINDOW_MSwindow - Review error classification logic
- Reduce
DLQ_MAX_RECORDSlimit - Regularly clear recovered records
- Monitor
data/dead-letter-queue.jsonfile size
The Dead-Letter Queue integrates seamlessly with:
- ExecutionQueue: Automatic failure recording and quarantine checking
- MetricsServer: HTTP endpoints and Prometheus metrics
- Logger: Structured logging for all DLQ operations
- Retry Logic: Respects error classifications from retry module
Potential improvements:
- Automatic Recovery: Retry quarantined tasks after a cooldown period
- Webhook Notifications: Alert operators when tasks are quarantined
- Dashboard UI: Web interface for managing quarantined tasks
- Pattern-Based Rules: Custom quarantine rules based on error patterns
- Export/Import: Backup and restore dead-letter records