perf: mtime+size fast skip for indexing#5
Conversation
Indexing 35K files (214GB) took 34+ minutes even when all files were already indexed because _check_unchanged() computed SHA-256 on every file. Now stat() mtime+size is checked first — if both match the DB record, the expensive checksum is skipped entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement to the file indexing process. By leveraging file modification times and sizes, the system can now quickly identify and skip files that have not changed, avoiding the computationally intensive SHA-256 checksum calculation. This optimization drastically reduces the time required for re-indexing large collections of static files. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant performance optimization for re-indexing files by using file modification time and size to skip expensive checksum calculations. The changes are well-implemented across the database and indexer modules. A new mtime column is added to the files table with a corresponding migration script that correctly handles existing databases. The core logic in _check_unchanged effectively uses this new information for a "fast path" check. My review includes a couple of minor suggestions to improve code clarity and make the contracts of the new function signatures more explicit. Overall, this is a great improvement.
| def _extract_one( | ||
| path: Path, checksum: str, file_size: int, existing_id: int | None | ||
| path: Path, | ||
| checksum: str, | ||
| file_size: int, | ||
| existing_id: int | None, | ||
| mtime: float = 0.0, | ||
| ) -> dict[str, Any]: |
There was a problem hiding this comment.
The mtime parameter has a default value (0.0), but it appears to be a required piece of information that is always provided by all callers. To make the function's contract clearer and avoid potential bugs if it's ever called without this argument, it would be better to make it a required positional argument by removing the default value.
| def _extract_one( | |
| path: Path, checksum: str, file_size: int, existing_id: int | None | |
| path: Path, | |
| checksum: str, | |
| file_size: int, | |
| existing_id: int | None, | |
| mtime: float = 0.0, | |
| ) -> dict[str, Any]: | |
| def _extract_one( | |
| path: Path, | |
| checksum: str, | |
| file_size: int, | |
| existing_id: int | None, | |
| mtime: float, | |
| ) -> dict[str, Any]: |
| checksum = str(extraction["checksum"]) | ||
| file_type = str(extraction["file_type"]) | ||
| file_size = int(extraction["file_size"]) | ||
| mtime = float(extraction.get("mtime") or 0.0) |
There was a problem hiding this comment.
The extraction dictionary is expected to always contain the mtime key for files that are being stored, as it's consistently added in _extract_one. Using extraction.get("mtime") or 0.0 is unnecessarily defensive. It's better to access it directly with extraction["mtime"]. This simplifies the code and ensures that if this assumption is ever broken, it will raise a KeyError immediately, making debugging easier.
| mtime = float(extraction.get("mtime") or 0.0) | |
| mtime = float(extraction["mtime"]) |
Summary
mtime REALcolumn tofilestable with migration for existing DBs_check_unchanged()now stat()s mtime+size first — if both match the DB record, SHA-256 is skipped entirelyTest plan
uv run ruff check src tests— all checks passeduv run mypy src— no issuesuv run pytest -v— 91 passed, 1 skippedgourmand --full .— 0 violationstrove_index_toolon~/Documents/Recreational/Travelafter deploy — should complete in seconds🤖 Generated with Claude Code