-
Notifications
You must be signed in to change notification settings - Fork 31
Description
Problem
There's no way to know if an event was successfully delivered to all its matched destinations. Discovering delivery gaps requires exporting logstore data and running offline analysis.
Users want to know: "is something broken that I need to act on?"
Background
- Events are stateless, immutable facts
- Attempts are stateless log entries with a status (success/failed)
- Delivery (event + destination pair) is an implicit concept, not a persisted entity (intentionally removed in chore: remove delivery event concept #653)
- Retry state lives in RSMQ (Redis sorted set), separate from the logstore
- Destination state (disabled/deleted) lives in Dragonfly/Redis, is mutable and not historical
Proposal
A new query operation — event.listUndelivered() — that surfaces event+destination pairs where the delivery journey ended without success.
What "undelivered" means
An event+destination pair is undelivered when the journey is over and no attempt succeeded:
- Retry exhausted (max attempts reached)
- Destination disabled at delivery time
- Destination deleted/not found at delivery time
Pairs with a pending retry are not undelivered — they're still in progress.
Response shape
[
{
"event_id": "evt_...",
"destination_id": "des_...",
"attempts": 3,
"last_code": "429",
"last_attempt_time": "2026-02-19T08:53:22Z",
"reason": "retry_exhausted"
}
]reason values:
retry_exhausted— max attempts reached, no successdestination_disabled— destination was disabled when delivery was attempteddestination_not_found— destination was deleted/not found
Options
Option A: Logstore aggregation query
Query the logstore for event+destination pairs with no successful attempt.
SELECT event_id, destination_id, count(*) as attempts, ...
FROM attempts
GROUP BY event_id, destination_id
HAVING countIf(status = 'success') = 0Pros:
- No new store, no new write path
- Stateless, derived from existing data
Cons:
- Expensive aggregation query on large datasets (needs materialized view in ClickHouse)
- Can't distinguish "retry pending" from "retry dropped" without cross-checking RSMQ
- Filtering out in-progress pairs breaks pagination (fetch 100, filter to 20, page is incomplete)
- Can't reliably provide
reason— destination state is mutable and not historical
Option B: Logstore query + RSMQ enrichment
Same as Option A, but enrich each result with a live RSMQ ZSCORE lookup to get next_attempt_time.
- Retry ID is deterministic:
sha256(event_id + ":" + destination_id)→ RSMQ message ID ZSCOREis O(1) per lookup, ~100 per page is cheap
Pros:
- No new store
next_attempt_timepresent = still in progress,null= terminal- Live data, not stale
Cons:
- Same expensive logstore aggregation as Option A
- Pagination problem remains — can't server-side filter "in progress" pairs without over-fetching
- Cross-store query (logstore + RSMQ), no atomicity (retry could fire between queries)
reasonstill hard to derive
Option C: Dedicated undelivered store
Maintain a dedicated store that tracks event+destination pairs needing attention. Could be a Redis sorted set, or a new table/space within the existing logstore (ClickHouse or Postgres).
Write triggers (add to store):
- Retry scheduler exhausts max attempts → add with reason
retry_exhausted - Delivery worker encounters disabled destination → add with reason
destination_disabled - Delivery worker encounters deleted/not-found destination → add with reason
destination_not_found
Removal triggers:
- Manual retry succeeds → remove
- User acknowledges/dismisses → remove
Pros:
- Fast reads, naturally filtered to just problematic cases
- Clean pagination — no over-fetching or cross-store filtering
reasonis captured at write time (point-in-time, accurate)- No expensive aggregation queries
listUndeliveredis a simple read from this store
Cons:
- New store and write path to maintain
- Write triggers must be added to retry scheduler + delivery worker
- If a write is missed (bug, crash), the pair won't appear — no self-healing
- Store can drift from reality (e.g. destination re-enabled after being marked as
destination_disabled)
Out of scope
- Replay/resolution — acting on undelivered events (replay, dismiss, bulk retry) is a separate concern. This endpoint is a read-only view. The undelivered store could be one input to a replay mechanism, but replay can also be triggered from other sources (user selection, time range, post-outage bulk replay).
Open questions
- API path?
GET /events/undeliveredconflicts withGET /events/:event_id - Scoping — accepts
tenant_idfilter (or uses JWT tenant scope), otherwise returns global results - Filtering — by destination, time range, failure code, reason?
- An event can be undelivered without ever having an attempt (e.g. destination disabled/deleted at publish time). In that case, the event was never written to the logstore. Should the undelivered store capture the full event data (self-contained, can replay) or just IDs (lightweight, but event data may not exist anywhere)?
- For Option C: what store? Redis sorted set by timestamp? Separate table in logstore?
Metadata
Metadata
Assignees
Labels
Type
Projects
Status