Skip to content

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451

Open
cfbraun wants to merge 1 commit into
jundot:mainfrom
cfbraun:pr/inline-lru-unlinks
Open

fix(ssd-cache): inline LRU unlinks so eviction frees queue capacity#1451
cfbraun wants to merge 1 commit into
jundot:mainfrom
cfbraun:pr/inline-lru-unlinks

Conversation

@cfbraun
Copy link
Copy Markdown
Contributor

@cfbraun cfbraun commented May 27, 2026

Summary

_enforce_size_limit_for_new_block enqueues evicted file unlinks as ("unlink", path) items onto _write_queue — the same bounded queue that carries pending writes. Combined with the pre-eviction _write_queue.full() short-circuit at the top of save_block, this creates a deadlock under sustained save pressure:

  1. Writer is saturated → _write_queue is full.
  2. save_block's pre-eviction check sees full() → returns False immediately, before calling _enforce_size_limit_for_new_block.
  3. Eviction never runs → cache stays at the size cap.
  4. Every subsequent save drops; ssd_write_drops climbs forever while _total_size sits pinned at the cap.

Fix: inline the unlinks on the eviction-calling thread instead. Eviction typically removes a single block per save (evict_until_size stops as soon as total_size <= target), so this is one syscall per save in steady state. The deferred-unlink justification ("avoid blocking the inference thread with N file delete syscalls") doesn't materialise under normal load, and inlining removes the bounded-queue contention entirely.

Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s disk-usage cache, which can shrink _get_effective_max_size sharply — evict_until_size would then return hundreds of entries at once and the inline-unlink loop would stall the inference thread on a syscall storm. Cap the burst at _MAX_INLINE_UNLINKS_PER_SAVE = 32 and reinsert the deferred metadata into the index so subsequent saves drain the remainder.

Also adds an evict_unlink_failures stats counter (eviction now decrements the index before the on-disk unlink; if Path.unlink raises OSError, surfacing the counter lets operators see when on-disk size has drifted above what the index reports).

The dead writer-thread ("unlink", file_path) dispatch branch is removed since no path enqueues such tuples anymore.

Test plan

  • pytest tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks — 4 passed
  • pytest tests/test_paged_ssd_cache.py tests/test_hot_cache.py — 125 passed (4 new + 121 existing)

@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch 2 times, most recently from a9792b3 to 6577237 Compare May 27, 2026 09:30
@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch 8 times, most recently from 0c7cbd8 to 73f0184 Compare June 2, 2026 06:08
``_enforce_size_limit_for_new_block`` enqueues evicted file unlinks as
``("unlink", path)`` items onto ``_write_queue`` — the same bounded
queue that carries pending writes. Combined with the pre-eviction
``_write_queue.full()`` short-circuit at the top of ``save_block``, this
creates a deadlock under sustained save pressure:

  1. Writer is saturated → ``_write_queue`` is full.
  2. ``save_block``'s pre-eviction check sees ``full()`` → returns False
     immediately, BEFORE calling ``_enforce_size_limit_for_new_block``.
  3. Eviction never runs → cache stays at the size cap.
  4. Every subsequent save drops; ``ssd_write_drops`` climbs forever
     while ``_total_size`` sits pinned at the cap.

Inline the unlinks on the eviction-calling thread instead. Eviction
typically removes a single block per save (``evict_until_size`` stops
as soon as ``total_size <= target``), so this is one syscall per save
in steady state. The deferred-unlink justification ("avoid blocking
the inference thread with N file delete syscalls") doesn't materialise
under normal load, and inlining removes the bounded-queue contention
entirely.

Bounded inline burst. The ENOSPC-recovery path invalidates the 30 s
disk-usage cache, which can shrink ``_get_effective_max_size``
sharply on the next save — ``evict_until_size`` would then return
hundreds of entries at once and the inline-unlink loop would stall
the inference thread on a syscall storm. Cap the burst at
``_MAX_INLINE_UNLINKS_PER_SAVE = 32`` and reinsert the deferred
metadata into the index so subsequent saves drain the remainder.
Bounds per-call latency at the cost of taking multiple saves to fully
reconverge.

``evict_unlink_failures`` stats counter. Eviction now decrements the
index before the on-disk unlink; if ``Path.unlink`` raises ``OSError``,
the previous "log a warning and move on" pattern silently lost the
signal. Surfacing the counter lets operators see that the on-disk
size has drifted above what the index reports.

Tests (tests/test_paged_ssd_cache.py::TestInlineLRUUnlinks):
- test_eviction_does_not_enqueue_unlink_tasks: sentinel-patch on
  ``put_nowait`` asserts no ``("unlink", ...)`` items ever enter the
  queue.
- test_eviction_frees_capacity_under_pressure: with the writer busy,
  eviction still keeps ``_index.total_size`` near the configured cap.
- test_inline_eviction_burst_is_capped: forced mass-eviction removes
  at most ``_MAX_INLINE_UNLINKS_PER_SAVE`` entries; the rest reinsert
  so subsequent saves can drain.
- test_unlink_failure_increments_counter: a patched ``OSError`` from
  ``Path.unlink`` increments ``evict_unlink_failures``.

The dead writer-thread ``("unlink", file_path)`` dispatch branch is
removed since no path enqueues such tuples anymore.

87 existing paged_ssd_cache tests + 34 hot_cache tests + 4 new tests
pass.
@cfbraun cfbraun force-pushed the pr/inline-lru-unlinks branch from 73f0184 to 84fedc4 Compare June 2, 2026 13:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant