Skip to content

Fix item scans around dead letters and lease extension#7

Open
ccarvalho-eng wants to merge 6 commits into
bedrock-kv:mainfrom
ccarvalho-eng:fix/raw-item-range-scans
Open

Fix item scans around dead letters and lease extension#7
ccarvalho-eng wants to merge 6 commits into
bedrock-kv:mainfrom
ccarvalho-eng:fix/raw-item-range-scans

Conversation

@ccarvalho-eng

@ccarvalho-eng ccarvalho-eng commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • keep dead-lettered jobs outside the live items/ scan prefix
  • scan item ranges as raw storage keys and ignore legacy non-item rows under items/
  • use the stored lease record's current item_key when completing or requeueing leased jobs
  • make lease-extender log assertions deterministic in CI

Root Cause

This PR fixes two queue-consumer failure modes found while running the downstream Beacon app:

https://github.com/ccarvalho-eng/the-beacon

Non-item rows under items/

Tuple-encoded keys can be used for ordered bounded scans. The problem here was not tuple ordering itself.

The queue stored dead-letter rows by deriving the dead-letter keyspace from keyspaces.items. That placed non-item rows beneath the live item scan prefix. Those keys are encoded as multiple tuple values under the items/ prefix, so item-scan code that decodes each suffix as one {priority, vesting_time, id} key can raise:

** (ArgumentError) Extra data after key: ...
    (bedrock) lib/bedrock/encoding/tuple.ex:17: Bedrock.Encoding.Tuple.unpack/1
    (bedrock) lib/bedrock/keyspace.ex:136: anonymous fn/2 in Bedrock.Keyspace.get_range_from_repo/3

The fix makes dead_letter/ a sibling of items/, and item scans now use raw storage-key ranges with a key-shape filter. That prevents future dead-letter writes from polluting live item scans and lets queues skip legacy malformed rows already present under the old prefix.

Stale lease item keys

Lease extension can move an item to a new {priority, vesting_time, id} key while the manager still holds the original lease struct. Terminal actions previously trusted the caller lease's item_key when completing or requeueing. If the stored lease had already been updated by the extender, dead-letter or retry paths could clear the stale key and leave an orphaned item behind.

Those orphaned rows remain visible to peek/4, but obtain_lease/5 cannot find them by the decoded value's key, so they can block later jobs in the same priority range.

The fix makes the stored lease record authoritative for terminal item keys.

Flow

flowchart LR
  enqueue[enqueue item] --> items[queues/:id/items]
  worker[worker leases item] --> lease[leases/:item_id stores current item_key]
  extend[lease extender moves item key] --> lease
  retry[complete or requeue] --> lease
  retry --> current_key[use stored lease item_key]
  dead[dead letter] --> dead_letter[queues/:id/dead_letter]
  scan[item scan] --> filter[raw range plus item-key filter]
  filter --> items
Loading

Reproduction

Beacon exposed both bugs while using the Bedrock-backed queue:

  • the consumer crashed with Extra data after key when item scans encountered non-item dead-letter rows under the old items/ prefix
  • after repeated failed payload attempts with lease extension enabled, visible payload rows had lease_id set, no corresponding lease record, and an item value whose computed Item.key/1 did not match the raw storage key

This PR adds direct regressions for both cases:

  • Store.peek/4 ignores legacy non-item rows under the item scan prefix
  • max-retry requeue writes dead letters outside the item scan prefix
  • complete/3 and requeue/4 use the stored lease item key when the caller lease is stale

Validation

  • git diff --check
  • syntax checks with Code.string_to_quoted!/1 for touched files

Local mix test and mix format are blocked before project tests by the checkout's dev/test dependency toolchain loading under local Elixir/OTP (styler/earmark_parser compile/load error). GitHub Actions runs the supported matrix and is the authoritative verification for this branch.

Avoid using keyspace range decoding for queue item scans because item keys are tuple-encoded as {priority, vesting_time, id}.

The keyspace decoder expects one complete tuple value and raises when a range scan returns nested tuple key suffixes. Scan raw key ranges by prefix for peek, minimum vesting time, and queue-empty checks instead.
@ccarvalho-eng ccarvalho-eng requested a review from jallum June 3, 2026 12:41
@ccarvalho-eng

Copy link
Copy Markdown
Collaborator Author

Correction after Jason pointed out the tuple encoding behavior:

The tuple encoding itself is still suitable for ordered range scans. A tuple key like {1, 2} does not need to be treated as a literal byte prefix of {1, 2, 3}; the important property is that tuple-packed keys sort so a bounded range can cover {1, 2, ...} and stop before {1, 3}.

So the first bug should not be framed as "tuple item keys cannot be range scanned." The more precise issue we reproduced in Beacon is that non-item rows can live under the items/ keyspace prefix and then get decoded by item-scan code. In particular, move_to_dead_letter/5 currently derives dead_letter from keyspaces.items, so dead-letter rows are placed beneath the item scan prefix. A later item scan can see those rows and Keyspace.get_range_from_repo/3 / tuple decoding raises with Extra data after key because the suffix is not one item tuple key.

That still needs an upstream fix, but the fix should be targeted at keeping dead-letter storage outside the items/ scan prefix, or otherwise ensuring item scans only range over valid item keys. The second bug in this PR remains separate: complete/requeue should trust the stored lease record item key after lease extension, not a stale caller lease struct.

@ccarvalho-eng ccarvalho-eng changed the title Fix tuple item key range scans Fix item scans around dead letters and lease extension Jun 3, 2026
Store dead-lettered jobs in a sibling queue keyspace and filter item scans by valid item key shape so legacy non-item rows cannot crash or block queue scans. Make lease-extender log capture deterministic in CI.
@ccarvalho-eng ccarvalho-eng force-pushed the fix/raw-item-range-scans branch from b6fd0ed to afa8e3d Compare June 3, 2026 14:04
Keep the log-capture tests from waiting through a second extender interval while still flushing Logger inside the capture window.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant