feat: per-file error tracking with trove_quality tool by fatherlinux · Pull Request #2 · crunchtools/mcp-trove

fatherlinux · 2026-03-21T06:21:27Z

Summary

Adds index_errors table to track per-file indexing failures with error type classification (transient vs permanent) and resolution tracking
Wires error recording into _extract_and_store_batched() and auto-resolves on successful re-index in _store_one()
Adds trove_quality MCP tool (tool #10) so LLM clients can query which files failed and decide to retry transient errors via trove_reindex

Test plan

uv run pytest -v — 91 passed, 1 skipped
uv run ruff check src tests — clean
uv run mypy src — clean
Deploy updated container, index a directory with known failures, call trove_quality to confirm errors are recorded

🤖 Generated with Claude Code

Record individual file failures during indexing in a new index_errors table so they can be queried later instead of grepping stderr logs. Errors are classified as transient (retryable) or permanent, and automatically marked resolved when a file is successfully re-indexed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

gemini-code-assist · 2026-03-21T06:21:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the indexing pipeline by introducing a comprehensive per-file error tracking system. It enables the system to log, classify, and manage indexing failures at a granular level, distinguishing between temporary and persistent issues. The new trove_quality tool provides a crucial interface for external clients to monitor indexing health and strategically re-process problematic files, ultimately improving the reliability and maintainability of the data indexing process.

Highlights

Per-file error tracking: Added an index_errors table to track individual file indexing failures, including error type classification (transient vs. permanent) and resolution status.
Automated error recording and resolution: Integrated error recording into the _extract_and_store_batched() process for failed extractions and implemented auto-resolution for errors when a file is successfully re-indexed in _store_one().
New trove_quality MCP tool: Introduced a new trove_quality tool (tool #10) that allows LLM clients to query files that failed indexing, helping them decide which transient errors to retry via trove_reindex.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a valuable feature for per-file error tracking, including a new database table, API endpoint, and associated business logic. The implementation is well-structured and includes relevant tests. My review focuses on a significant performance and correctness issue in the trove_quality tool's implementation. I've provided a detailed code suggestion to address this by leveraging more efficient database query patterns, which will improve scalability and ensure accurate statistics.

gemini-code-assist · 2026-03-21T06:23:39Z

+    # Compute aggregate counts across all errors (not just the page returned)
+    all_errors = db.query_errors(resolved=None, path=path, limit=10_000)
+    total = len(all_errors)
+    resolved_count = sum(1 for e in all_errors if e["resolved"])
+    unresolved_count = total - resolved_count
+
+    by_type: dict[str, int] = dict(Counter(e["error_type"] for e in all_errors))


The current implementation for calculating aggregate error statistics is inefficient. It fetches up to 10,000 full error records into memory to perform calculations. This approach is memory-intensive, will be slow at scale, and will produce incorrect totals if the number of errors exceeds the 10,000 limit.

This can be performed much more efficiently by executing aggregate queries (COUNT, SUM, GROUP BY) directly in the database, which avoids loading all records into memory and removes the arbitrary limit.

Suggested change

# Compute aggregate counts across all errors (not just the page returned)

all_errors = db.query_errors(resolved=None, path=path, limit=10_000)

total = len(all_errors)

resolved_count = sum(1 for e in all_errors if e["resolved"])

unresolved_count = total - resolved_count

by_type: dict[str, int] = dict(Counter(e["error_type"] for e in all_errors))

# Compute aggregate counts efficiently in the database

where_sql = "WHERE path LIKE ?" if path else ""

params = (path + "%",) if path else ()

stats = db.query_one(

f"SELECT COUNT(*) AS total, SUM(resolved) AS resolved_count FROM index_errors {where_sql}", # noqa: S608

params,

)

total = stats["total"] if stats else 0

resolved_count = stats["resolved_count"] if stats and stats["resolved_count"] else 0

unresolved_count = total - resolved_count

type_rows = db.query(

f"SELECT error_type, COUNT(*) as count FROM index_errors {where_sql} GROUP BY error_type", # noqa: S608

params,

)

by_type: dict[str, int] = {row["error_type"]: row["count"] for row in type_rows}

fatherlinux merged commit 9bffcca into main Mar 21, 2026
9 of 11 checks passed

gemini-code-assist Bot reviewed Mar 21, 2026

View reviewed changes

fatherlinux mentioned this pull request Mar 21, 2026

Bump version to 0.4.0 #3

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: per-file error tracking with trove_quality tool#2

feat: per-file error tracking with trove_quality tool#2
fatherlinux merged 1 commit into
mainfrom
feat/per-file-error-tracking

fatherlinux commented Mar 21, 2026

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

-    # Compute aggregate counts across all errors (not just the page returned)
-    all_errors = db.query_errors(resolved=None, path=path, limit=10_000)
-    total = len(all_errors)
-    resolved_count = sum(1 for e in all_errors if e["resolved"])
-    unresolved_count = total - resolved_count
-    by_type: dict[str, int] = dict(Counter(e["error_type"] for e in all_errors))
+    # Compute aggregate counts efficiently in the database
+    where_sql = "WHERE path LIKE ?" if path else ""
+    params = (path + "%",) if path else ()
+    stats = db.query_one(
+        f"SELECT COUNT(*) AS total, SUM(resolved) AS resolved_count FROM index_errors {where_sql}",  # noqa: S608
+        params,
+    )
+    total = stats["total"] if stats else 0
+    resolved_count = stats["resolved_count"] if stats and stats["resolved_count"] else 0
+    unresolved_count = total - resolved_count
+    type_rows = db.query(
+        f"SELECT error_type, COUNT(*) as count FROM index_errors {where_sql} GROUP BY error_type",  # noqa: S608
+        params,
+    )
+    by_type: dict[str, int] = {row["error_type"]: row["count"] for row in type_rows}

Conversation

fatherlinux commented Mar 21, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Mar 21, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant