Skip to content

ServiceNotificationFunction silently drops messages on Lambda invocation failure #59

@mckalexee

Description

@mckalexee

Summary

ServiceNotificationFunction attaches a DLQ to the SNS subscription, but not to the Lambda function itself. This means Lambda-side failures — thrown errors, init errors, timeouts — don't land in any DLQ and are silently dropped after Lambda's internal async retries exhaust.

This was surfaced while diagnosing the Node 24 outage from #58: init errors on the notification handler were retried 3 times by Lambda's async-invoke mechanism and then gone forever, with no trace in the SNS DLQ.

Why the existing DLQ doesn't help

For SNS → Lambda async invocations, SNS considers delivery successful the moment Lambda's `Invoke` API returns 202. What happens inside Lambda (init error, handler exception, timeout) is entirely Lambda's problem: Lambda retries the invocation up to 2 more times with backoff, and if no `DeadLetterConfig` or `onFailure` destination is configured on the function, the message is silently discarded.

The current `ServiceNotificationFunction` construct (`constructs/service-notification-function.ts`) only configures `deadLetterQueue` on the `SnsEventSource`, which is the SNS subscription-level DLQ. That DLQ catches SNS-side delivery failures (e.g. Lambda throttled, Lambda doesn't exist, permission denied) but not Lambda invocation failures — which is the vast majority of real-world failure modes.

Observed behavior

From the Node 24 incident:

  • 2 SNS messages delivered to the Lambda
  • Each retried 3x (original + 2 retries) with `Runtime.CallbackHandlerDeprecated` init errors
  • SNS subscription DLQ: 0 messages (as expected — SNS did its job)
  • Messages: gone. No DLQ, no destination, no record beyond CloudWatch logs.

Only because SES writes the raw emails to S3 upstream was it possible to know exactly what was lost.

Suggested fix

Add a Lambda-level failure destination so function-side failures are captured. Two options:

Option A — on-failure destination (preferred)

Use `addEventInvokeConfig` with `onFailure: new SqsDestination(dlq)`. This captures the full async invocation context (request ID, error type, payload) in the DLQ message, which is substantially more useful for replay than the raw event alone.

```ts
const lambdaDlq = new sqs.Queue(this, 'LambdaDLQ', {
receiveMessageWaitTime: cdk.Duration.seconds(20),
});
this.configureAsyncInvoke({
onFailure: new destinations.SqsDestination(lambdaDlq),
retryAttempts: 2,
});
```

Option B — function-level DLQ (simpler, less context)

Pass `deadLetterQueue` or `deadLetterQueueEnabled: true` when constructing the `NodejsFunction`. This captures the original event but no error metadata.

Option A is the modern pattern (destinations supersede DLQs as of 2019) and gives you the error reason for free.

Consistency note

The queue-handler wrappers (`ServiceQueueFunction`) don't need this because SQS event source mappings handle retries and DLQs on the queue side. But `ServiceNotificationFunction`, `ServiceEventFunction`, and `ServiceCronFunction` all invoke Lambda asynchronously and would benefit from the same pattern. Worth auditing whether all async-invoke wrappers need the fix, not just notification.

Environment

  • `@faceteer/cdk` 8.0.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions