Skip to content

perf: mtime+size fast skip for indexing#5

Merged
fatherlinux merged 1 commit into
mainfrom
perf/mtime-skip
Mar 21, 2026
Merged

perf: mtime+size fast skip for indexing#5
fatherlinux merged 1 commit into
mainfrom
perf/mtime-skip

Conversation

@fatherlinux
Copy link
Copy Markdown
Member

Summary

  • Adds mtime REAL column to files table with migration for existing DBs
  • _check_unchanged() now stat()s mtime+size first — if both match the DB record, SHA-256 is skipped entirely
  • For 35K files (214GB), re-indexing unchanged files drops from 34+ minutes (reading all 214GB) to seconds (just stat() calls)

Test plan

  • uv run ruff check src tests — all checks passed
  • uv run mypy src — no issues
  • uv run pytest -v — 91 passed, 1 skipped
  • gourmand --full . — 0 violations
  • Re-run trove_index_tool on ~/Documents/Recreational/Travel after deploy — should complete in seconds

🤖 Generated with Claude Code

Indexing 35K files (214GB) took 34+ minutes even when all files were
already indexed because _check_unchanged() computed SHA-256 on every
file. Now stat() mtime+size is checked first — if both match the DB
record, the expensive checksum is skipped entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fatherlinux fatherlinux merged commit 3c6137a into main Mar 21, 2026
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement to the file indexing process. By leveraging file modification times and sizes, the system can now quickly identify and skip files that have not changed, avoiding the computationally intensive SHA-256 checksum calculation. This optimization drastically reduces the time required for re-indexing large collections of static files.

Highlights

  • Performance Optimization: Implemented a fast-skip mechanism for file indexing by comparing mtime (modification time) and file_size before computing expensive SHA-256 checksums.
  • Database Schema Update: Added an mtime column to the files table in the database schema, along with a migration script for existing databases.
  • Significant Speedup: Achieved substantial performance gains for re-indexing unchanged files, reducing processing time from minutes to seconds for large datasets.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant performance optimization for re-indexing files by using file modification time and size to skip expensive checksum calculations. The changes are well-implemented across the database and indexer modules. A new mtime column is added to the files table with a corresponding migration script that correctly handles existing databases. The core logic in _check_unchanged effectively uses this new information for a "fast path" check. My review includes a couple of minor suggestions to improve code clarity and make the contracts of the new function signatures more explicit. Overall, this is a great improvement.

Comment on lines 127 to 133
def _extract_one(
path: Path, checksum: str, file_size: int, existing_id: int | None
path: Path,
checksum: str,
file_size: int,
existing_id: int | None,
mtime: float = 0.0,
) -> dict[str, Any]:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The mtime parameter has a default value (0.0), but it appears to be a required piece of information that is always provided by all callers. To make the function's contract clearer and avoid potential bugs if it's ever called without this argument, it would be better to make it a required positional argument by removing the default value.

Suggested change
def _extract_one(
path: Path, checksum: str, file_size: int, existing_id: int | None
path: Path,
checksum: str,
file_size: int,
existing_id: int | None,
mtime: float = 0.0,
) -> dict[str, Any]:
def _extract_one(
path: Path,
checksum: str,
file_size: int,
existing_id: int | None,
mtime: float,
) -> dict[str, Any]:

checksum = str(extraction["checksum"])
file_type = str(extraction["file_type"])
file_size = int(extraction["file_size"])
mtime = float(extraction.get("mtime") or 0.0)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The extraction dictionary is expected to always contain the mtime key for files that are being stored, as it's consistently added in _extract_one. Using extraction.get("mtime") or 0.0 is unnecessarily defensive. It's better to access it directly with extraction["mtime"]. This simplifies the code and ensures that if this assumption is ever broken, it will raise a KeyError immediately, making debugging easier.

Suggested change
mtime = float(extraction.get("mtime") or 0.0)
mtime = float(extraction["mtime"])

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant