Summary
ServiceNotificationFunction attaches a DLQ to the SNS subscription, but not to the Lambda function itself. This means Lambda-side failures — thrown errors, init errors, timeouts — don't land in any DLQ and are silently dropped after Lambda's internal async retries exhaust.
This was surfaced while diagnosing the Node 24 outage from #58: init errors on the notification handler were retried 3 times by Lambda's async-invoke mechanism and then gone forever, with no trace in the SNS DLQ.
Why the existing DLQ doesn't help
For SNS → Lambda async invocations, SNS considers delivery successful the moment Lambda's `Invoke` API returns 202. What happens inside Lambda (init error, handler exception, timeout) is entirely Lambda's problem: Lambda retries the invocation up to 2 more times with backoff, and if no `DeadLetterConfig` or `onFailure` destination is configured on the function, the message is silently discarded.
The current `ServiceNotificationFunction` construct (`constructs/service-notification-function.ts`) only configures `deadLetterQueue` on the `SnsEventSource`, which is the SNS subscription-level DLQ. That DLQ catches SNS-side delivery failures (e.g. Lambda throttled, Lambda doesn't exist, permission denied) but not Lambda invocation failures — which is the vast majority of real-world failure modes.
Observed behavior
From the Node 24 incident:
- 2 SNS messages delivered to the Lambda
- Each retried 3x (original + 2 retries) with `Runtime.CallbackHandlerDeprecated` init errors
- SNS subscription DLQ: 0 messages (as expected — SNS did its job)
- Messages: gone. No DLQ, no destination, no record beyond CloudWatch logs.
Only because SES writes the raw emails to S3 upstream was it possible to know exactly what was lost.
Suggested fix
Add a Lambda-level failure destination so function-side failures are captured. Two options:
Option A — on-failure destination (preferred)
Use `addEventInvokeConfig` with `onFailure: new SqsDestination(dlq)`. This captures the full async invocation context (request ID, error type, payload) in the DLQ message, which is substantially more useful for replay than the raw event alone.
```ts
const lambdaDlq = new sqs.Queue(this, 'LambdaDLQ', {
receiveMessageWaitTime: cdk.Duration.seconds(20),
});
this.configureAsyncInvoke({
onFailure: new destinations.SqsDestination(lambdaDlq),
retryAttempts: 2,
});
```
Option B — function-level DLQ (simpler, less context)
Pass `deadLetterQueue` or `deadLetterQueueEnabled: true` when constructing the `NodejsFunction`. This captures the original event but no error metadata.
Option A is the modern pattern (destinations supersede DLQs as of 2019) and gives you the error reason for free.
Consistency note
The queue-handler wrappers (`ServiceQueueFunction`) don't need this because SQS event source mappings handle retries and DLQs on the queue side. But `ServiceNotificationFunction`, `ServiceEventFunction`, and `ServiceCronFunction` all invoke Lambda asynchronously and would benefit from the same pattern. Worth auditing whether all async-invoke wrappers need the fix, not just notification.
Environment
Summary
ServiceNotificationFunctionattaches a DLQ to the SNS subscription, but not to the Lambda function itself. This means Lambda-side failures — thrown errors, init errors, timeouts — don't land in any DLQ and are silently dropped after Lambda's internal async retries exhaust.This was surfaced while diagnosing the Node 24 outage from #58: init errors on the notification handler were retried 3 times by Lambda's async-invoke mechanism and then gone forever, with no trace in the SNS DLQ.
Why the existing DLQ doesn't help
For SNS → Lambda async invocations, SNS considers delivery successful the moment Lambda's `Invoke` API returns 202. What happens inside Lambda (init error, handler exception, timeout) is entirely Lambda's problem: Lambda retries the invocation up to 2 more times with backoff, and if no `DeadLetterConfig` or `onFailure` destination is configured on the function, the message is silently discarded.
The current `ServiceNotificationFunction` construct (`constructs/service-notification-function.ts`) only configures `deadLetterQueue` on the `SnsEventSource`, which is the SNS subscription-level DLQ. That DLQ catches SNS-side delivery failures (e.g. Lambda throttled, Lambda doesn't exist, permission denied) but not Lambda invocation failures — which is the vast majority of real-world failure modes.
Observed behavior
From the Node 24 incident:
Only because SES writes the raw emails to S3 upstream was it possible to know exactly what was lost.
Suggested fix
Add a Lambda-level failure destination so function-side failures are captured. Two options:
Option A — on-failure destination (preferred)
Use `addEventInvokeConfig` with `onFailure: new SqsDestination(dlq)`. This captures the full async invocation context (request ID, error type, payload) in the DLQ message, which is substantially more useful for replay than the raw event alone.
```ts
const lambdaDlq = new sqs.Queue(this, 'LambdaDLQ', {
receiveMessageWaitTime: cdk.Duration.seconds(20),
});
this.configureAsyncInvoke({
onFailure: new destinations.SqsDestination(lambdaDlq),
retryAttempts: 2,
});
```
Option B — function-level DLQ (simpler, less context)
Pass `deadLetterQueue` or `deadLetterQueueEnabled: true` when constructing the `NodejsFunction`. This captures the original event but no error metadata.
Option A is the modern pattern (destinations supersede DLQs as of 2019) and gives you the error reason for free.
Consistency note
The queue-handler wrappers (`ServiceQueueFunction`) don't need this because SQS event source mappings handle retries and DLQs on the queue side. But `ServiceNotificationFunction`, `ServiceEventFunction`, and `ServiceCronFunction` all invoke Lambda asynchronously and would benefit from the same pattern. Worth auditing whether all async-invoke wrappers need the fix, not just notification.
Environment