perf: mtime+size fast skip for indexing by fatherlinux · Pull Request #5 · crunchtools/mcp-trove

fatherlinux · 2026-03-21T07:39:56Z

Summary

Adds mtime REAL column to files table with migration for existing DBs
_check_unchanged() now stat()s mtime+size first — if both match the DB record, SHA-256 is skipped entirely
For 35K files (214GB), re-indexing unchanged files drops from 34+ minutes (reading all 214GB) to seconds (just stat() calls)

Test plan

uv run ruff check src tests — all checks passed
uv run mypy src — no issues
uv run pytest -v — 91 passed, 1 skipped
gourmand --full . — 0 violations
Re-run trove_index_tool on ~/Documents/Recreational/Travel after deploy — should complete in seconds

🤖 Generated with Claude Code

Indexing 35K files (214GB) took 34+ minutes even when all files were already indexed because _check_unchanged() computed SHA-256 on every file. Now stat() mtime+size is checked first — if both match the DB record, the expensive checksum is skipped entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-21T07:40:15Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant performance enhancement to the file indexing process. By leveraging file modification times and sizes, the system can now quickly identify and skip files that have not changed, avoiding the computationally intensive SHA-256 checksum calculation. This optimization drastically reduces the time required for re-indexing large collections of static files.

Highlights

Performance Optimization: Implemented a fast-skip mechanism for file indexing by comparing mtime (modification time) and file_size before computing expensive SHA-256 checksums.
Database Schema Update: Added an mtime column to the files table in the database schema, along with a migration script for existing databases.
Significant Speedup: Achieved substantial performance gains for re-indexing unchanged files, reducing processing time from minutes to seconds for large datasets.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant performance optimization for re-indexing files by using file modification time and size to skip expensive checksum calculations. The changes are well-implemented across the database and indexer modules. A new mtime column is added to the files table with a corresponding migration script that correctly handles existing databases. The core logic in _check_unchanged effectively uses this new information for a "fast path" check. My review includes a couple of minor suggestions to improve code clarity and make the contracts of the new function signatures more explicit. Overall, this is a great improvement.

gemini-code-assist · 2026-03-21T07:41:30Z

 def _extract_one(
-    path: Path, checksum: str, file_size: int, existing_id: int | None
+    path: Path,
+    checksum: str,
+    file_size: int,
+    existing_id: int | None,
+    mtime: float = 0.0,
 ) -> dict[str, Any]:


The mtime parameter has a default value (0.0), but it appears to be a required piece of information that is always provided by all callers. To make the function's contract clearer and avoid potential bugs if it's ever called without this argument, it would be better to make it a required positional argument by removing the default value.

Suggested change

def _extract_one(

path: Path, checksum: str, file_size: int, existing_id: int | None

path: Path,

checksum: str,

file_size: int,

existing_id: int | None,

mtime: float = 0.0,

) -> dict[str, Any]:

def _extract_one(

path: Path,

checksum: str,

file_size: int,

existing_id: int | None,

mtime: float,

) -> dict[str, Any]:

gemini-code-assist · 2026-03-21T07:41:31Z

    checksum = str(extraction["checksum"])
    file_type = str(extraction["file_type"])
    file_size = int(extraction["file_size"])
+    mtime = float(extraction.get("mtime") or 0.0)


The extraction dictionary is expected to always contain the mtime key for files that are being stored, as it's consistently added in _extract_one. Using extraction.get("mtime") or 0.0 is unnecessarily defensive. It's better to access it directly with extraction["mtime"]. This simplifies the code and ensures that if this assumption is ever broken, it will raise a KeyError immediately, making debugging easier.

Suggested change

mtime = float(extraction.get("mtime") or 0.0)

mtime = float(extraction["mtime"])

fatherlinux merged commit 3c6137a into main Mar 21, 2026

gemini-code-assist Bot reviewed Mar 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: mtime+size fast skip for indexing#5

perf: mtime+size fast skip for indexing#5
fatherlinux merged 1 commit into
mainfrom
perf/mtime-skip

fatherlinux commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	mtime = float(extraction.get("mtime") or 0.0)
	mtime = float(extraction["mtime"])

Conversation

fatherlinux commented Mar 21, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant