Skip to content

RFC: List undelivered events #711

@alexluong

Description

@alexluong

Problem

There's no way to know if an event was successfully delivered to all its matched destinations. Discovering delivery gaps requires exporting logstore data and running offline analysis.

Users want to know: "is something broken that I need to act on?"

Background

  • Events are stateless, immutable facts
  • Attempts are stateless log entries with a status (success/failed)
  • Delivery (event + destination pair) is an implicit concept, not a persisted entity (intentionally removed in chore: remove delivery event concept #653)
  • Retry state lives in RSMQ (Redis sorted set), separate from the logstore
  • Destination state (disabled/deleted) lives in Dragonfly/Redis, is mutable and not historical

Proposal

A new query operation — event.listUndelivered() — that surfaces event+destination pairs where the delivery journey ended without success.

What "undelivered" means

An event+destination pair is undelivered when the journey is over and no attempt succeeded:

  • Retry exhausted (max attempts reached)
  • Destination disabled at delivery time
  • Destination deleted/not found at delivery time

Pairs with a pending retry are not undelivered — they're still in progress.

Response shape

[
  {
    "event_id": "evt_...",
    "destination_id": "des_...",
    "attempts": 3,
    "last_code": "429",
    "last_attempt_time": "2026-02-19T08:53:22Z",
    "reason": "retry_exhausted"
  }
]

reason values:

  • retry_exhausted — max attempts reached, no success
  • destination_disabled — destination was disabled when delivery was attempted
  • destination_not_found — destination was deleted/not found

Options

Option A: Logstore aggregation query

Query the logstore for event+destination pairs with no successful attempt.

SELECT event_id, destination_id, count(*) as attempts, ...
FROM attempts
GROUP BY event_id, destination_id
HAVING countIf(status = 'success') = 0

Pros:

  • No new store, no new write path
  • Stateless, derived from existing data

Cons:

  • Expensive aggregation query on large datasets (needs materialized view in ClickHouse)
  • Can't distinguish "retry pending" from "retry dropped" without cross-checking RSMQ
  • Filtering out in-progress pairs breaks pagination (fetch 100, filter to 20, page is incomplete)
  • Can't reliably provide reason — destination state is mutable and not historical

Option B: Logstore query + RSMQ enrichment

Same as Option A, but enrich each result with a live RSMQ ZSCORE lookup to get next_attempt_time.

  • Retry ID is deterministic: sha256(event_id + ":" + destination_id) → RSMQ message ID
  • ZSCORE is O(1) per lookup, ~100 per page is cheap

Pros:

  • No new store
  • next_attempt_time present = still in progress, null = terminal
  • Live data, not stale

Cons:

  • Same expensive logstore aggregation as Option A
  • Pagination problem remains — can't server-side filter "in progress" pairs without over-fetching
  • Cross-store query (logstore + RSMQ), no atomicity (retry could fire between queries)
  • reason still hard to derive

Option C: Dedicated undelivered store

Maintain a dedicated store that tracks event+destination pairs needing attention. Could be a Redis sorted set, or a new table/space within the existing logstore (ClickHouse or Postgres).

Write triggers (add to store):

  • Retry scheduler exhausts max attempts → add with reason retry_exhausted
  • Delivery worker encounters disabled destination → add with reason destination_disabled
  • Delivery worker encounters deleted/not-found destination → add with reason destination_not_found

Removal triggers:

  • Manual retry succeeds → remove
  • User acknowledges/dismisses → remove

Pros:

  • Fast reads, naturally filtered to just problematic cases
  • Clean pagination — no over-fetching or cross-store filtering
  • reason is captured at write time (point-in-time, accurate)
  • No expensive aggregation queries
  • listUndelivered is a simple read from this store

Cons:

  • New store and write path to maintain
  • Write triggers must be added to retry scheduler + delivery worker
  • If a write is missed (bug, crash), the pair won't appear — no self-healing
  • Store can drift from reality (e.g. destination re-enabled after being marked as destination_disabled)

Out of scope

  • Replay/resolution — acting on undelivered events (replay, dismiss, bulk retry) is a separate concern. This endpoint is a read-only view. The undelivered store could be one input to a replay mechanism, but replay can also be triggered from other sources (user selection, time range, post-outage bulk replay).

Open questions

  • API path? GET /events/undelivered conflicts with GET /events/:event_id
  • Scoping — accepts tenant_id filter (or uses JWT tenant scope), otherwise returns global results
  • Filtering — by destination, time range, failure code, reason?
  • An event can be undelivered without ever having an attempt (e.g. destination disabled/deleted at publish time). In that case, the event was never written to the logstore. Should the undelivered store capture the full event data (self-contained, can replay) or just IDs (lightweight, but event data may not exist anywhere)?
  • For Option C: what store? Redis sorted set by timestamp? Separate table in logstore?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions