Skip to content

Optimize TaskBroker scanning: larger batches + cursor-based scan#213

Open
airhorns wants to merge 6 commits intomainfrom
task-broker-opt
Open

Optimize TaskBroker scanning: larger batches + cursor-based scan#213
airhorns wants to merge 6 commits intomainfrom
task-broker-opt

Conversation

@airhorns
Copy link
Copy Markdown
Contributor

@airhorns airhorns commented Mar 23, 2026

Summary

  • Increase scan_batch from 1024 → 4096 so a single scan pass can refill the buffer from low-watermark to target, eliminating 3x redundant re-reads per refill cycle
  • Cache the SlateDB scan iterator across passes: the scanner marches forward through the key range without recreating the iterator or re-reading buffered entries. Stops at future-timestamped tasks and resumes from there on the next pass.
  • When a future task is encountered, the consumed entry is held back and re-evaluated on the next pass with a fresh now_ms, so it is never lost from the iterator
  • wakeup() invalidates the cached iterator, ensuring newly-created tasks (from enqueue, expedite, restart, concurrency grants, retries) that sort before the old cursor position are picked up via a fresh scan from the beginning
  • Iterator is also refreshed after MAX_ITER_AGE (5s) to pick up scheduled tasks that have become ready
  • Add 9 tests covering every code path that creates tasks "behind the cursor"
  • Add real-shard dequeue benchmark exercising the full TaskBroker pipeline

Test plan

  • All 9 scanner ordering tests pass (enqueue at time=0, expedite, restart, concurrency grants, retries, priority ordering, multi-group, bulk drain)
  • All 21 floating concurrency tests pass
  • Existing dequeue, enqueue, cancel, and metrics tests pass
  • Clippy clean
  • CI green

🤖 Generated with Claude Code


Note

Medium Risk
Changes core dequeue/scanner behavior by introducing a cached iterator and new stop/resume semantics around future tasks and wakeups, which could affect task availability/latency if incorrect. Risk is mitigated by extensive new regression tests covering behind-cursor task creation paths.

Overview
Optimizes TaskBroker scanning to avoid redundant DB re-reads by caching a SlateDB DbIterator across scan passes, advancing a cursor through the task-group key range instead of restarting from the beginning every time.

The scan loop now refreshes the iterator on wakeup(), on iterator exhaustion, or after MAX_ITER_AGE (5s) to pick up newly-ready scheduled work; it also halts at the first future task and carries that consumed entry forward for re-evaluation on the next pass.

Adds a new task_scanner_ordering_tests.rs suite to ensure tasks created “behind the cursor” (e.g., start_at_ms=0, expedite/restart, concurrency grants, retries, priority ordering, multi-group, bulk enqueue) are still eventually dequeued, and updates task_scanner_profile benchmark to measure real sustained dequeue throughput instead of synthetic scan timing.

Written by Cursor Bugbot for commit cdc0501. This will update automatically on new commits. Configure here.

@airhorns airhorns changed the title Adjust task scanner batch size up to make sure we can fill the whole buffer in one scan Optimize TaskBroker scanning: larger batches + cursor-based scan Mar 23, 2026
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

airhorns and others added 6 commits March 23, 2026 18:14
Tests that all code paths creating task keys (enqueue, expedite, restart,
concurrency grants, retries, priority ordering) result in tasks being
picked up by the TaskBroker scanner even when those tasks sort before
entries already in the buffer. This coverage is important before
optimizing the scanner to use cursor-based scanning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TaskBroker scanner previously always started from the beginning of
the task group key range on every scan pass. When the buffer already
contained N entries, each scan had to read through all N before finding
new entries to insert — O(buffer_size) wasted reads per scan.

Now the scanner tracks a cursor (the last key read) and resumes from
there on subsequent passes within a fill cycle. This eliminates the
redundant re-reads of already-buffered entries.

The cursor is invalidated whenever wakeup() fires, which happens on
every code path that creates new task keys (enqueue, expedite, restart,
concurrency grants, retries). This ensures tasks created behind the
cursor are always picked up on the next scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replaces the synthetic scan simulation with a benchmark that exercises
the full TaskBroker pipeline end-to-end: enqueue 50K tasks via import,
then drain them via sustained dequeue calls measuring throughput and
per-batch latency.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of tracking a cursor key and creating a new SlateDB iterator
each scan pass, hold the actual DbIterator as a local variable in the
scan loop and march it forward. This avoids both the iterator creation
cost (~160us) and the cost of re-reading already-buffered entries.

The iterator stops when it hits future-timestamped tasks (HitFuture)
or exhausts the range (Exhausted). It is dropped and recreated when:
- wakeup() fires (new tasks may have been written behind the cursor)
- The iterator is older than MAX_ITER_AGE (5s)
- The iterator is exhausted

When a future task is encountered, the consumed entry is held back and
re-evaluated on the next pass with a fresh now_ms, so it is never lost
from the iterator.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant