RFC: List undelivered events

## Problem

There's no way to know if an event was successfully delivered to all its matched destinations. Discovering delivery gaps requires exporting logstore data and running offline analysis.

Users want to know: "is something broken that I need to act on?"

## Background

- **Events** are stateless, immutable facts
- **Attempts** are stateless log entries with a status (success/failed)
- **Delivery** (event + destination pair) is an implicit concept, not a persisted entity (intentionally removed in #653)
- Retry state lives in RSMQ (Redis sorted set), separate from the logstore
- Destination state (disabled/deleted) lives in Dragonfly/Redis, is mutable and not historical

## Proposal

A new query operation — `event.listUndelivered()` — that surfaces event+destination pairs where the delivery journey ended without success.

### What "undelivered" means

An event+destination pair is undelivered when the journey is **over** and **no attempt succeeded**:

- Retry exhausted (max attempts reached)
- Destination disabled at delivery time
- Destination deleted/not found at delivery time

Pairs with a pending retry are **not** undelivered — they're still in progress.

### Response shape

```json
[
  {
    "event_id": "evt_...",
    "destination_id": "des_...",
    "attempts": 3,
    "last_code": "429",
    "last_attempt_time": "2026-02-19T08:53:22Z",
    "reason": "retry_exhausted"
  }
]
```

`reason` values:
- `retry_exhausted` — max attempts reached, no success
- `destination_disabled` — destination was disabled when delivery was attempted
- `destination_not_found` — destination was deleted/not found

## Options

### Option A: Logstore aggregation query

Query the logstore for event+destination pairs with no successful attempt.

```sql
SELECT event_id, destination_id, count(*) as attempts, ...
FROM attempts
GROUP BY event_id, destination_id
HAVING countIf(status = 'success') = 0
```

**Pros:**
- No new store, no new write path
- Stateless, derived from existing data

**Cons:**
- Expensive aggregation query on large datasets (needs materialized view in ClickHouse)
- Can't distinguish "retry pending" from "retry dropped" without cross-checking RSMQ
- Filtering out in-progress pairs breaks pagination (fetch 100, filter to 20, page is incomplete)
- Can't reliably provide `reason` — destination state is mutable and not historical

### Option B: Logstore query + RSMQ enrichment

Same as Option A, but enrich each result with a live RSMQ `ZSCORE` lookup to get `next_attempt_time`.

- Retry ID is deterministic: `sha256(event_id + ":" + destination_id)` → RSMQ message ID
- `ZSCORE` is O(1) per lookup, ~100 per page is cheap

**Pros:**
- No new store
- `next_attempt_time` present = still in progress, `null` = terminal
- Live data, not stale

**Cons:**
- Same expensive logstore aggregation as Option A
- Pagination problem remains — can't server-side filter "in progress" pairs without over-fetching
- Cross-store query (logstore + RSMQ), no atomicity (retry could fire between queries)
- `reason` still hard to derive

### Option C: Dedicated undelivered store

Maintain a dedicated store that tracks event+destination pairs needing attention. Could be a Redis sorted set, or a new table/space within the existing logstore (ClickHouse or Postgres).

**Write triggers (add to store):**
- Retry scheduler exhausts max attempts → add with reason `retry_exhausted`
- Delivery worker encounters disabled destination → add with reason `destination_disabled`
- Delivery worker encounters deleted/not-found destination → add with reason `destination_not_found`

**Removal triggers:**
- Manual retry succeeds → remove
- User acknowledges/dismisses → remove

**Pros:**
- Fast reads, naturally filtered to just problematic cases
- Clean pagination — no over-fetching or cross-store filtering
- `reason` is captured at write time (point-in-time, accurate)
- No expensive aggregation queries
- `listUndelivered` is a simple read from this store

**Cons:**
- New store and write path to maintain
- Write triggers must be added to retry scheduler + delivery worker
- If a write is missed (bug, crash), the pair won't appear — no self-healing
- Store can drift from reality (e.g. destination re-enabled after being marked as `destination_disabled`)

## Out of scope

- **Replay/resolution** — acting on undelivered events (replay, dismiss, bulk retry) is a separate concern. This endpoint is a read-only view. The undelivered store could be one input to a replay mechanism, but replay can also be triggered from other sources (user selection, time range, post-outage bulk replay).

## Open questions

- API path? `GET /events/undelivered` conflicts with `GET /events/:event_id`
- Scoping — accepts `tenant_id` filter (or uses JWT tenant scope), otherwise returns global results
- Filtering — by destination, time range, failure code, reason?
- An event can be undelivered without ever having an attempt (e.g. destination disabled/deleted at publish time). In that case, the event was never written to the logstore. Should the undelivered store capture the full event data (self-contained, can replay) or just IDs (lightweight, but event data may not exist anywhere)?
- For Option C: what store? Redis sorted set by timestamp? Separate table in logstore?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: List undelivered events #711

Problem

Background

Proposal

What "undelivered" means

Response shape

Options

Option A: Logstore aggregation query

Option B: Logstore query + RSMQ enrichment

Option C: Dedicated undelivered store

Out of scope

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: List undelivered events #711

Description

Problem

Background

Proposal

What "undelivered" means

Response shape

Options

Option A: Logstore aggregation query

Option B: Logstore query + RSMQ enrichment

Option C: Dedicated undelivered store

Out of scope

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions