Skip to content

Disable paged cache when SSD cache init fails#1601

Open
lvsijian8 wants to merge 1 commit into
jundot:mainfrom
lvsijian8:fix/unwriteable-cache-dir
Open

Disable paged cache when SSD cache init fails#1601
lvsijian8 wants to merge 1 commit into
jundot:mainfrom
lvsijian8:fix/unwriteable-cache-dir

Conversation

@lvsijian8
Copy link
Copy Markdown
Contributor

Problem

When paged SSD cache is configured but the backing directory is unavailable or not writable, scheduler startup can leave partial cache state behind. PagedSSDCacheManager fails, but PagedCacheManager, BlockAwarePrefixCache, and the store-cache executor can still remain active.

This is easy to hit when the cache directory points to an external drive that is not currently mounted.

Root cause

_init_tiered_cache() did not report whether paged SSD cache initialization succeeded. The scheduler therefore continued setting up the cache pipeline even after initialization failed.

Fix

  • Make _init_tiered_cache() return whether setup succeeded.
  • Only attach the cold-restore callback and start the store-cache executor when setup succeeds.
  • Clear paged cache runtime state when setup fails, so the scheduler falls back to normal no-cache operation for that run.

Tests

  • .venv/bin/pytest tests/test_scheduler.py::TestSchedulerInitialization::test_init_falls_back_when_paged_ssd_cache_unavailable -q
  • .venv/bin/pytest tests/test_scheduler.py -q
  • .venv/bin/pytest tests/test_scheduler.py::TestSchedulerConfig tests/test_cache_factory.py tests/test_paged_ssd_cache.py::TestPagedSSDCacheManager::test_initialization -q
  • .venv/bin/pytest tests/test_cache_factory.py tests/test_paged_ssd_cache.py::TestPagedSSDCacheManager::test_initialization -q
  • .venv/bin/ruff check --select I001,F401,F811 tests/test_scheduler.py
  • git diff --check

Fixes #1455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RFE: Allow starting/working with unwriteable cache directory

1 participant