Skip to content

Fix scan_requested fast-path defeating scanner backoff#215

Open
airhorns wants to merge 2 commits intomainfrom
fix-scan-requested-fastpath
Open

Fix scan_requested fast-path defeating scanner backoff#215
airhorns wants to merge 2 commits intomainfrom
fix-scan-requested-fastpath

Conversation

@airhorns
Copy link
Copy Markdown
Contributor

@airhorns airhorns commented Mar 24, 2026

Summary

The scan_requested fast-path in the TaskBroker scan loop unconditionally used a 1ms sleep whenever wakeup() had been called. With continuous enqueues (each calling wakeup()), scan_requested was permanently true, causing the scanner to spin at 100+ scans/s even when every scan found 0 new entries.

Before: scan_requested=true → always 1ms sleep → 100-150 scans/s, buffer never fills past ~40 entries
After: scan_requested=true + found tasks → 1ms sleep (fast-path preserved). scan_requested=true + found nothing → 5ms sleep (caps empty scans at ~200/s, responsive to nudge loop)

Single-line behavioral change in the scan loop — no architectural changes.

Test plan

  • All 71 existing tests pass (dequeue, enqueue, cancel, floating concurrency, metrics)
  • Clippy clean

🤖 Generated with Claude Code


Note

Medium Risk
Adjusts TaskBroker scanner timing and adds per-scan info logging, which can impact dequeue latency and production log volume under load. Benchmark changes are low risk, but scanner-loop behavior is performance-critical.

Overview
Fixes the scan_requested fast-path in the TaskBroker background scan loop so wakeups no longer force a 1ms sleep when a scan inserts zero tasks; instead it uses a short (5ms) delay to avoid busy-spinning during continuous enqueues while staying responsive when work is found.

Adds scan_tasks breakdown counters and emits a tracing::info! log summarizing read/skip/insert reasons per scan, and exposes a new shard helper broker_buffer_len() for total broker buffer size.

Reworks the task_scanner_profile bench to measure sustained dequeue throughput via real dequeue() calls (batch sizes 32 and 1) rather than synthetic key-range scan microbenchmarks.

Written by Cursor Bugbot for commit 940879a. This will update automatically on new commits. Configure here.

// (50ms, too slow for the nudge loop's 25ms window).
// 5ms matches the nudge poll interval and caps the
// scan rate at ~200/s during active workloads.
sleep_ms = 5;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notification permit defeats intended 5ms sleep path

Medium Severity

When scan_requested is true and inserted == 0, sleep_ms is set to 5 and execution falls through to the tokio::select! block. However, the same wakeup() call that set scan_requested = true also called notify.notify_one(), which stores a permit in the Notify. When the scanner enters tokio::select!, notified() resolves immediately by consuming that stored permit, so the 5ms sleep never actually happens. With continuous enqueues, a fresh permit is stored during every scan, making the effective sleep 0ms — potentially spinning faster than the old 1ms continue path this change replaces.

Additional Locations (1)
Fix in Cursor Fix in Web

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

buffer_len = self.buffer.len(),
tombstone_count = self.ack_tombstones.lock().unwrap().len(),
"scan_tasks breakdown"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hot-path info-level log causes excessive production log volume

Medium Severity

The new tracing::info! in scan_tasks fires on every scan iteration that reads at least one entry from the DB. During active scanning, this loop runs up to 200+ times per second per task group. All other logging in this module uses debug! level. This info!-level log on the hot path would produce massive log volume in production and is likely a leftover from debugging the backoff fix.

Fix in Cursor Fix in Web

/// Get the total number of tasks in the broker buffer across all task groups.
pub fn broker_buffer_len(&self) -> usize {
self.brokers.buffer_len()
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public method broker_buffer_len is never called

Low Severity

The newly added broker_buffer_len method on JobStoreShard is not called anywhere in the codebase — not in source, tests, or benchmarks. It expands the public API surface without a consumer.

Fix in Cursor Fix in Web

airhorns and others added 2 commits March 25, 2026 21:23
The scan_requested fast-path unconditionally used a 1ms sleep whenever
wakeup() had been called, regardless of whether the scan found anything.
With continuous enqueues (each calling wakeup), scan_requested was
permanently true, causing the scanner to spin at 100+ scans/s even when
every scan found 0 new entries to buffer.

Now the fast-path only uses 1ms when the scan actually inserted tasks.
When scan_requested was set but the scan found nothing new, use a 5ms
sleep instead — short enough for the nudge loop's 25ms polling window
but long enough to prevent the scanner from spinning at 1000/s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Adds debug-level logging to scan_tasks showing per-scan breakdown:
total entries read, inserted, and counts for each skip reason (future,
inflight, tombstone, already_buffered, defunct). Helps diagnose why the
buffer isn't filling in production.

Also exposes broker_buffer_len() on JobStoreShard for tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@airhorns airhorns force-pushed the fix-scan-requested-fastpath branch from ca2beda to 940879a Compare March 26, 2026 01:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant