Skip to content

Refactor: Centralize cache directory management with new utilities#19

Open
lee-t wants to merge 1 commit into
mainfrom
cache-utils
Open

Refactor: Centralize cache directory management with new utilities#19
lee-t wants to merge 1 commit into
mainfrom
cache-utils

Conversation

@lee-t
Copy link
Copy Markdown
Collaborator

@lee-t lee-t commented Nov 21, 2025

No description provided.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the codebase to centralize cache directory management and file download operations. The changes eliminate code duplication across multiple data pipeline modules by introducing reusable utilities in data_import_funcs.py, including get_cache_dir(), cached_download(), and related helper functions. The refactoring updates five data pipeline modules to use these centralized utilities instead of implementing their own download and caching logic.

Key changes:

  • Introduced centralized cache management utilities (get_cache_dir, set_default_cache_dir, cached_download) with support for environment variable configuration (PROTEINGYM_CACHE_DIR)
  • Migrated five data pipeline modules to use the new utilities, removing ~150 lines of duplicated download code
  • Updated function signatures to accept None for cache_dir parameters (defaults to .cache or environment variable)

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 13 comments.

Show a summary per file
File Description
src/proteingympy/data_import_funcs.py Added centralized cache utilities including get_cache_dir(), cached_download(), decorator pattern for downloads, and batch download support. Updated existing download functions to use new utilities.
src/proteingympy/make_zeroshot_dms_benchmarks.py Migrated from manual download implementation to cached_download(). Updated imports and function signatures to support optional cache directory.
src/proteingympy/make_zero_shot_substitutions.py Replaced manual download logic with cached_download(). Updated imports and removed unused re module.
src/proteingympy/make_supervised_scores.py Converted to use centralized caching utilities. Adopted Path objects for file operations. Updated imports, removing os and re modules.
src/proteingympy/make_dms_substitutions.py Migrated download logic to cached_download(). Updated imports and function parameters to use optional cache directory.
src/proteingympy/make_alphamissense_supplementary.py Integrated get_cache_dir() for cache path resolution. Updated to use Path objects for file operations.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 8 to +14
import pandas as pd
import requests
import tempfile
from pathlib import Path
import zipfile
from typing import Dict, List, Optional, Tuple
import re

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os module import was removed but it's still being used in the _load_supervised_fold_data function at lines 93-94 (os.path.basename and os.path.splitext). Either add back import os or replace these with Path equivalents (e.g., Path(csv_file).stem).

Copilot uses AI. Check for mistakes.
Comment on lines 8 to +14
import pandas as pd
import requests
from typing import Dict, List, Optional
import tempfile
from pathlib import Path
import zipfile

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os module import was removed but it's still being used at line 52 (os.path.basename and os.path.splitext). Either add back import os or replace with Path equivalents (e.g., Path(csv_file).stem).

Copilot uses AI. Check for mistakes.
Comment on lines 7 to +13
import pandas as pd
import requests
import tempfile
from pathlib import Path
import zipfile
from typing import Dict, List, Optional, Any
import re

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os module import was removed but it's still being used in the _load_zero_shot_data function at lines 75-76 (os.path.basename and os.path.splitext). Either add back import os or replace with Path equivalents (e.g., Path(csv_file).stem).

Copilot uses AI. Check for mistakes.
Comment on lines +180 to +184
downloads: list[tuple[str, str]],
cache_dir: Optional[Union[str, Path]] = None,
use_cache: bool = True,
chunk_size: int = 8192
) -> dict[str, Path]:
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lowercase list and dict type hints (lines 180, 184) require Python 3.9+. For broader compatibility, use List[Tuple[str, str]] and Dict[str, Path] from the typing module instead. Add from typing import List, Dict, Tuple to the imports.

Copilot uses AI. Check for mistakes.
Comment on lines 7 to +17
import io
import json
import os
import re
import zipfile
from typing import Dict, Optional
from pathlib import Path

import pandas as pd
import requests

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The os module import was removed but the module is still being used in multiple functions: os.makedirs and os.path.dirname (line 31), os.path.exists (line 50), and os.replace (line 64). Add back import os to prevent NameError at runtime.

Copilot uses AI. Check for mistakes.
import pandas as pd
import requests
import tempfile
from pathlib import Path
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
import pandas as pd
import requests
import tempfile
from pathlib import Path
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
from typing import Dict, List, Optional, Any
import numpy as np

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'get_cache_dir' is not used.

Suggested change
from .data_import_funcs import cached_download, get_cache_dir
from .data_import_funcs import cached_download

Copilot uses AI. Check for mistakes.
import pandas as pd
import requests
import tempfile
from pathlib import Path
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'Path' is not used.

Suggested change
from pathlib import Path

Copilot uses AI. Check for mistakes.
from typing import Dict, List, Optional, Any
import re

from .data_import_funcs import cached_download, get_cache_dir
Copy link

Copilot AI Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import of 'get_cache_dir' is not used.

Suggested change
from .data_import_funcs import cached_download, get_cache_dir
from .data_import_funcs import cached_download

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants