Skip to content

feat: per-file error tracking with trove_quality tool#2

Merged
fatherlinux merged 1 commit into
mainfrom
feat/per-file-error-tracking
Mar 21, 2026
Merged

feat: per-file error tracking with trove_quality tool#2
fatherlinux merged 1 commit into
mainfrom
feat/per-file-error-tracking

Conversation

@fatherlinux
Copy link
Copy Markdown
Member

Summary

  • Adds index_errors table to track per-file indexing failures with error type classification (transient vs permanent) and resolution tracking
  • Wires error recording into _extract_and_store_batched() and auto-resolves on successful re-index in _store_one()
  • Adds trove_quality MCP tool (tool #10) so LLM clients can query which files failed and decide to retry transient errors via trove_reindex

Test plan

  • uv run pytest -v — 91 passed, 1 skipped
  • uv run ruff check src tests — clean
  • uv run mypy src — clean
  • Deploy updated container, index a directory with known failures, call trove_quality to confirm errors are recorded

🤖 Generated with Claude Code

Record individual file failures during indexing in a new index_errors
table so they can be queried later instead of grepping stderr logs.
Errors are classified as transient (retryable) or permanent, and
automatically marked resolved when a file is successfully re-indexed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of the indexing pipeline by introducing a comprehensive per-file error tracking system. It enables the system to log, classify, and manage indexing failures at a granular level, distinguishing between temporary and persistent issues. The new trove_quality tool provides a crucial interface for external clients to monitor indexing health and strategically re-process problematic files, ultimately improving the reliability and maintainability of the data indexing process.

Highlights

  • Per-file error tracking: Added an index_errors table to track individual file indexing failures, including error type classification (transient vs. permanent) and resolution status.
  • Automated error recording and resolution: Integrated error recording into the _extract_and_store_batched() process for failed extractions and implemented auto-resolution for errors when a file is successfully re-indexed in _store_one().
  • New trove_quality MCP tool: Introduced a new trove_quality tool (tool #10) that allows LLM clients to query files that failed indexing, helping them decide which transient errors to retry via trove_reindex.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@fatherlinux fatherlinux merged commit 9bffcca into main Mar 21, 2026
9 of 11 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a valuable feature for per-file error tracking, including a new database table, API endpoint, and associated business logic. The implementation is well-structured and includes relevant tests. My review focuses on a significant performance and correctness issue in the trove_quality tool's implementation. I've provided a detailed code suggestion to address this by leveraging more efficient database query patterns, which will improve scalability and ensure accurate statistics.

Comment on lines +137 to +143
# Compute aggregate counts across all errors (not just the page returned)
all_errors = db.query_errors(resolved=None, path=path, limit=10_000)
total = len(all_errors)
resolved_count = sum(1 for e in all_errors if e["resolved"])
unresolved_count = total - resolved_count

by_type: dict[str, int] = dict(Counter(e["error_type"] for e in all_errors))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current implementation for calculating aggregate error statistics is inefficient. It fetches up to 10,000 full error records into memory to perform calculations. This approach is memory-intensive, will be slow at scale, and will produce incorrect totals if the number of errors exceeds the 10,000 limit.

This can be performed much more efficiently by executing aggregate queries (COUNT, SUM, GROUP BY) directly in the database, which avoids loading all records into memory and removes the arbitrary limit.

Suggested change
# Compute aggregate counts across all errors (not just the page returned)
all_errors = db.query_errors(resolved=None, path=path, limit=10_000)
total = len(all_errors)
resolved_count = sum(1 for e in all_errors if e["resolved"])
unresolved_count = total - resolved_count
by_type: dict[str, int] = dict(Counter(e["error_type"] for e in all_errors))
# Compute aggregate counts efficiently in the database
where_sql = "WHERE path LIKE ?" if path else ""
params = (path + "%",) if path else ()
stats = db.query_one(
f"SELECT COUNT(*) AS total, SUM(resolved) AS resolved_count FROM index_errors {where_sql}", # noqa: S608
params,
)
total = stats["total"] if stats else 0
resolved_count = stats["resolved_count"] if stats and stats["resolved_count"] else 0
unresolved_count = total - resolved_count
type_rows = db.query(
f"SELECT error_type, COUNT(*) as count FROM index_errors {where_sql} GROUP BY error_type", # noqa: S608
params,
)
by_type: dict[str, int] = {row["error_type"]: row["count"] for row in type_rows}

@fatherlinux fatherlinux mentioned this pull request Mar 21, 2026
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant