feat(dev): add file-based caching for library objects during development#3029
feat(dev): add file-based caching for library objects during development#3029
Conversation
Add optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) that persist across Django server reloads during development. This significantly speeds up the development iteration cycle by avoiding re-initialization of these objects on every code change. New settings: - USE_DEV_FILE_CACHE: Enable/disable the feature (default: False) - DEV_FILE_CACHE_DIR: Cache directory (default: /tmp/sefaria_dev_cache) Modified methods to use file cache: - Library.build_full_auto_completer() - Library.build_lexicon_auto_completers() - Library.build_cross_lexicon_auto_completer() - Library.build_linker() - TopicManager.build_slug_to_pools_cache() Added management command: - python manage.py clear_dev_cache Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR adds optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) during development to speed up Django server reloads. The feature uses pickle serialization to persist objects to disk at /tmp/sefaria_dev_cache.
Changes:
- Added file-based cache functions in
sefaria/system/cache.pyusing pickle serialization - Modified Library methods to check file cache before rebuilding objects
- Added management command
clear_dev_cacheto manage cached entries
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| sefaria/system/cache.py | Added file cache functions using pickle for save/load/clear operations |
| sefaria/settings.py | Added default settings for USE_DEV_FILE_CACHE and DEV_FILE_CACHE_DIR |
| sefaria/local_settings_example.py | Added documentation and example configuration for dev file cache |
| sefaria/model/text.py | Integrated file cache into AutoCompleter and Linker build methods |
| django_topics/models/topic.py | Integrated file cache into TopicManager's slug_to_pools cache building |
| reader/management/commands/clear_dev_cache.py | Added management command to clear and list cache entries |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # File-based caching for library objects (AutoCompleter, Linker) during development. | ||
| # When enabled, these expensive objects are serialized to disk after first build | ||
| # and loaded from disk on subsequent server reloads, significantly speeding up | ||
| # development iteration. Should NOT be used in production. |
There was a problem hiding this comment.
Consider adding documentation about when developers should clear the cache, such as after database changes, model updates, or when cached objects appear stale. This could be included in the docstring or as a comment in local_settings_example.py to help developers avoid debugging stale cache issues.
| # development iteration. Should NOT be used in production. | |
| # development iteration. Should NOT be used in production. | |
| # Cached files are stored under DEV_FILE_CACHE_DIR; if you change the database contents, | |
| # update models/serialization for these objects, or observe stale behavior, clear this | |
| # directory (delete its contents) so that fresh cache files will be rebuilt. |
| import pickle | ||
| from pathlib import Path |
There was a problem hiding this comment.
The pickle import should be placed at the top of the file with other standard library imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.
| try: | ||
| cache_path = _get_dev_cache_path(name) | ||
| with open(cache_path, 'wb') as f: | ||
| pickle.dump(obj, f, protocol=pickle.HIGHEST_PROTOCOL) |
There was a problem hiding this comment.
Using pickle to deserialize untrusted data is a security risk as it can lead to arbitrary code execution. While this is intended for development use only, consider adding a warning in the documentation or using a safer serialization format like JSON where possible. At minimum, ensure the cache directory has appropriate permissions and is not world-writable.
| self.style.WARNING( | ||
| 'Dev file cache is not enabled. Set USE_DEV_FILE_CACHE = True in local_settings.py' | ||
| ) | ||
| ) |
There was a problem hiding this comment.
The command will still attempt to list or clear the cache even when USE_DEV_FILE_CACHE is False, which could be confusing. Consider returning early after displaying the warning message to prevent executing the cache operations when the feature is disabled.
| ) | |
| ) | |
| return |
| # ============================================================================= | ||
|
|
||
| import pickle | ||
| from pathlib import Path |
There was a problem hiding this comment.
The Path import should be placed at the top of the file with other imports for better code organization and adherence to PEP 8 standards. Currently it's imported in the middle of the file after other code sections.
| # Default values for settings that may be overridden in local_settings. | ||
| # These ensure the application works even if local_settings doesn't define them. | ||
| USE_DEV_FILE_CACHE = False | ||
| DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache" |
There was a problem hiding this comment.
Using /tmp as the default cache directory may cause issues on systems where /tmp is cleaned on reboot or in containerized environments. Consider using a more persistent location or documenting that developers should configure DEV_FILE_CACHE_DIR in local_settings.py for reliable caching across system restarts.
| DEV_FILE_CACHE_DIR = "/tmp/sefaria_dev_cache" | |
| DEV_FILE_CACHE_DIR = os.path.join(os.path.expanduser("~"), ".sefaria_dev_cache") |
| cached = load_from_dev_file_cache("topic_slug_to_pools") | ||
| if cached is not None: | ||
| self.slug_to_pools = cached |
There was a problem hiding this comment.
After loading from cache, slug_to_pools is a regular dict, but when built from scratch it's a defaultdict(list). This inconsistency could cause issues. Consider either: 1) wrapping the cached dict in defaultdict(list, cached) after loading, or 2) always using a regular dict and avoiding defaultdict behavior.
| def _get_dev_file_cache_dir() -> Path: | ||
| """Get the development file cache directory from settings.""" | ||
| cache_dir = Path(getattr(settings, 'DEV_FILE_CACHE_DIR', '/tmp/sefaria_dev_cache')) | ||
| cache_dir.mkdir(parents=True, exist_ok=True) |
There was a problem hiding this comment.
The cache directory is created with default permissions using mkdir(parents=True, exist_ok=True). For security, consider explicitly setting restrictive permissions (e.g., mode=0o700) to ensure the cache directory is only accessible by the application user, preventing unauthorized access to pickled objects.
| cache_dir.mkdir(parents=True, exist_ok=True) | |
| cache_dir.mkdir(parents=True, exist_ok=True, mode=0o700) | |
| try: | |
| cache_dir.chmod(0o700) | |
| except Exception as e: | |
| # Log a warning but do not fail if we cannot enforce strict permissions | |
| logger.warning(f"Unable to set permissions on dev file cache dir {cache_dir}: {e}") |
🧪 CI InsightsHere's what we observed from your CI run for 1baf6d6. 🟢 All jobs passed!But CI Insights is watching 👀 |
Add optional file-based caching for expensive library objects (AutoCompleter, Linker, topic pools) that persist across Django server reloads during development. This significantly speeds up the development iteration cycle by avoiding re-initialization of these objects on every code change.
New settings:
Modified methods to use file cache:
Added management command:
Description
A brief description of the PR
Code Changes
The following changes were made to the files below
Notes
Any additional notes go here