Refactor: Centralize cache directory management with new utilities#19
Refactor: Centralize cache directory management with new utilities#19lee-t wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Pull request overview
This PR refactors the codebase to centralize cache directory management and file download operations. The changes eliminate code duplication across multiple data pipeline modules by introducing reusable utilities in data_import_funcs.py, including get_cache_dir(), cached_download(), and related helper functions. The refactoring updates five data pipeline modules to use these centralized utilities instead of implementing their own download and caching logic.
Key changes:
- Introduced centralized cache management utilities (
get_cache_dir,set_default_cache_dir,cached_download) with support for environment variable configuration (PROTEINGYM_CACHE_DIR) - Migrated five data pipeline modules to use the new utilities, removing ~150 lines of duplicated download code
- Updated function signatures to accept
Noneforcache_dirparameters (defaults to.cacheor environment variable)
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.
Show a summary per file
| File | Description |
|---|---|
src/proteingympy/data_import_funcs.py |
Added centralized cache utilities including get_cache_dir(), cached_download(), decorator pattern for downloads, and batch download support. Updated existing download functions to use new utilities. |
src/proteingympy/make_zeroshot_dms_benchmarks.py |
Migrated from manual download implementation to cached_download(). Updated imports and function signatures to support optional cache directory. |
src/proteingympy/make_zero_shot_substitutions.py |
Replaced manual download logic with cached_download(). Updated imports and removed unused re module. |
src/proteingympy/make_supervised_scores.py |
Converted to use centralized caching utilities. Adopted Path objects for file operations. Updated imports, removing os and re modules. |
src/proteingympy/make_dms_substitutions.py |
Migrated download logic to cached_download(). Updated imports and function parameters to use optional cache directory. |
src/proteingympy/make_alphamissense_supplementary.py |
Integrated get_cache_dir() for cache path resolution. Updated to use Path objects for file operations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| import pandas as pd | ||
| import requests | ||
| import tempfile | ||
| from pathlib import Path | ||
| import zipfile | ||
| from typing import Dict, List, Optional, Tuple | ||
| import re | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
The os module import was removed but it's still being used in the _load_supervised_fold_data function at lines 93-94 (os.path.basename and os.path.splitext). Either add back import os or replace these with Path equivalents (e.g., Path(csv_file).stem).
| import pandas as pd | ||
| import requests | ||
| from typing import Dict, List, Optional | ||
| import tempfile | ||
| from pathlib import Path | ||
| import zipfile | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
The os module import was removed but it's still being used at line 52 (os.path.basename and os.path.splitext). Either add back import os or replace with Path equivalents (e.g., Path(csv_file).stem).
| import pandas as pd | ||
| import requests | ||
| import tempfile | ||
| from pathlib import Path | ||
| import zipfile | ||
| from typing import Dict, List, Optional, Any | ||
| import re | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
The os module import was removed but it's still being used in the _load_zero_shot_data function at lines 75-76 (os.path.basename and os.path.splitext). Either add back import os or replace with Path equivalents (e.g., Path(csv_file).stem).
| downloads: list[tuple[str, str]], | ||
| cache_dir: Optional[Union[str, Path]] = None, | ||
| use_cache: bool = True, | ||
| chunk_size: int = 8192 | ||
| ) -> dict[str, Path]: |
There was a problem hiding this comment.
The lowercase list and dict type hints (lines 180, 184) require Python 3.9+. For broader compatibility, use List[Tuple[str, str]] and Dict[str, Path] from the typing module instead. Add from typing import List, Dict, Tuple to the imports.
| import io | ||
| import json | ||
| import os | ||
| import re | ||
| import zipfile | ||
| from typing import Dict, Optional | ||
| from pathlib import Path | ||
|
|
||
| import pandas as pd | ||
| import requests | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
The os module import was removed but the module is still being used in multiple functions: os.makedirs and os.path.dirname (line 31), os.path.exists (line 50), and os.replace (line 64). Add back import os to prevent NameError at runtime.
| import pandas as pd | ||
| import requests | ||
| import tempfile | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
| import pandas as pd | ||
| import requests | ||
| import tempfile | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
| from typing import Dict, List, Optional, Any | ||
| import numpy as np | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
Import of 'get_cache_dir' is not used.
| from .data_import_funcs import cached_download, get_cache_dir | |
| from .data_import_funcs import cached_download |
| import pandas as pd | ||
| import requests | ||
| import tempfile | ||
| from pathlib import Path |
There was a problem hiding this comment.
Import of 'Path' is not used.
| from pathlib import Path |
| from typing import Dict, List, Optional, Any | ||
| import re | ||
|
|
||
| from .data_import_funcs import cached_download, get_cache_dir |
There was a problem hiding this comment.
Import of 'get_cache_dir' is not used.
| from .data_import_funcs import cached_download, get_cache_dir | |
| from .data_import_funcs import cached_download |
No description provided.