Skip to content

Use AST-based hashing for semantic file change detection#259

Open
Copilot wants to merge 5 commits intomainfrom
copilot/use-ast-hashing-for-file-changes
Open

Use AST-based hashing for semantic file change detection#259
Copilot wants to merge 5 commits intomainfrom
copilot/use-ast-hashing-for-file-changes

Conversation

Copy link
Contributor

Copilot AI commented Mar 19, 2026

File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.

How it works

The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.

# Same hash — only comments differ
compute_semantic_file_hash(b"# old\ndef f(): pass", Path("x.py"))
compute_semantic_file_hash(b"# new\ndef f(): pass", Path("x.py"))

# Same hash — only formatting differs
compute_semantic_file_hash(b"def add(a,b):\n    return a+b", Path("x.py"))
compute_semantic_file_hash(b"def add(a, b):\n    return a + b", Path("x.py"))

# Different hash — operator changed
compute_semantic_file_hash(b"return a + b", Path("x.py"))
compute_semantic_file_hash(b"return a * b", Path("x.py"))

Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.

Changes

  • discovery.py: Add compute_semantic_file_hash() with internal helpers _compute_ast_hash, _walk_ast_nodes, _get_semantic_language. Update DiscoveredFile.__init__, from_path, and file_hash property to use it. The file_hash property uses self.absolute_path for existence checks and reads to handle relative paths correctly regardless of CWD.
  • indexing_service.py: Replace get_blake_hash(content_bytes) with compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.
  • test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, and DiscoveredFile integration.

Notes

  • No new dependencies — uses existing ast_grep_py (tree-sitter).
  • pyo3_runtime.PanicException from unsupported languages inherits from BaseException, not Exception, and cannot be imported directly. The fallback catches Exception for standard errors, then catches BaseException only for pyo3_runtime.PanicException identified by module name — re-raising all other non-recoverable errors (GeneratorExit, etc.). This approach works in environments without pyo3 since ImportError from ast_grep_py is handled by the except Exception clause.
  • Docstring changes do produce different hashes since docstrings are AST expression nodes, not comments.
Original prompt

This section details on the original issue you should resolve

<issue_title>Use AST-based hashing for semantic file change detection</issue_title>
<issue_description>## Description

Implement semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection.

Current Approach

  • Files hashed based on raw content
  • Formatting changes trigger unnecessary re-indexing
  • Comments and whitespace changes cause false positives

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Use ast() to compute hash for semantic files like Python, JavaScript, etc.

Benefits:

  • Detect actual semantic changes
  • Ignore formatting, comments, whitespace
  • More intelligent change detection
  • Reduce unnecessary re-indexing

Implementation:

# Hash AST structure instead of raw bytes
ast_hash = hash_ast(parse_file(path))

Considerations

  • Language-specific AST parsers (tree-sitter already available)
  • Fallback to content hash for non-semantic files
  • Performance impact of parsing
  • Cache parsed ASTs when possible

Impact

  • Fewer unnecessary re-indexes
  • Better performance
  • Smarter change detection
  • More efficient resource usage

Source

  • Files: discovery.py:165, discovery.py:182
  • Branch: 003-our-aim-to</issue_description>

<agent_instructions>@copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

If you assess the path to implementation is straight forward, then go ahead, complete it and submit a PR. Otherwise, submit a plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@bashandbone @copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Use ast-grep to parse supported language files (25+ languages) and
compute hashes from the canonical AST representation instead of raw
file bytes. This ignores comments, whitespace, and formatting changes
so that only genuine semantic modifications trigger re-indexing.

Changes:
- Add compute_semantic_file_hash() and helpers to discovery.py
- Update DiscoveredFile.__init__, from_path, and file_hash property
- Update indexing_service._process_discovery_batch to use semantic hashing
- Add comprehensive unit tests (23 tests) for semantic hashing

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Copilot AI changed the title [WIP] Implement AST-based hashing for semantic file change detection Use AST-based hashing for semantic file change detection Mar 19, 2026
Copilot AI requested a review from bashandbone March 19, 2026 19:50
@bashandbone bashandbone marked this pull request as ready for review March 19, 2026 22:43
Copilot AI review requested due to automatic review settings March 19, 2026 22:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.

Changes:

  • Add compute_semantic_file_hash() and AST-walking helpers in core/discovery.py, and wire it into DiscoveredFile hashing.
  • Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
  • Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and DiscoveredFile integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
src/codeweaver/core/discovery.py Adds AST-based hashing and updates DiscoveredFile hashing behavior.
src/codeweaver/engine/services/indexing_service.py Uses semantic hashing when determining whether files have changed.
tests/unit/core/test_semantic_hashing.py Adds unit tests validating semantic hashing behavior and integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

return get_blake_hash(canonical)
except (KeyboardInterrupt, SystemExit):
raise
except BaseException:
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.

Suggested change
except BaseException:
except Exception:

Copilot uses AI. Check for mistakes.
Comment on lines +337 to +344
"""Return the blake3 hash of the file, using AST-based hashing when supported."""
if self._file_hash is not None:
return self._file_hash
if self.path.exists() and self.path.is_file():
content_hash = get_blake_hash(self.path.read_bytes())
content_bytes = self.path.read_bytes()
content_hash = compute_semantic_file_hash(
content_bytes, self.path, ext_category=self.ext_category
)
Copy link

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot

Please correct the following two issues:

DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

Let's also catch a narrower error than BaseException, like the pyO3 error specifically. We need to be careful because when Codeweaver moves to a monorepo there will be situations where pyO3 isn't a transitive dependency.

Offhand, I know watchfiles and rignore use it, but those are engine dependencies. Some providers use it (DuckDuckGo via primp, for example). We may need to make the catch or caught exception dependent on whether we have pyO3 in the environment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed both issues in ed23aa2:

  1. file_hash now uses self.absolute_path for existence checks and reads, matching the pattern already used in size, contents, and raw_contents.

  2. Narrowed exception handling in _compute_ast_hash:

    • except Exception catches all standard Python errors (including ImportError if ast_grep_py is missing)
    • except BaseException only fires for non-Exception subclasses; checks getattr(type(exc), "__module__", None) == "pyo3_runtime" to identify PanicException by module name, re-raising everything else (GeneratorExit, etc.)

    This works in environments without pyo3 since the ImportError from ast_grep_py falls into the first except Exception handler. The pyo3 check only activates when a Rust-side panic actually occurs — and PanicException can't be imported directly (it's a Rust-embedded type not registered in sys.modules), hence the module name check.

@github-actions
Copy link
Contributor

Code Review - AST-based Semantic File Hashing

I've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering.

✅ Strengths

1. Sound Architecture

  • Clean separation of concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash functions
  • Proper fallback strategy: AST parsing → content hash (graceful degradation)
  • Uses existing ast_grep_py dependency (no new dependencies)

2. Robust Error Handling

  • Catches BaseException to handle pyo3_runtime.PanicException (which doesn't inherit from Exception)
  • Re-raises KeyboardInterrupt and SystemExit appropriately
  • Debug logging with exc_info=True for troubleshooting (discovery.py:102, 142)
  • Multiple fallback paths ensure the system remains functional

3. Excellent Test Coverage

  • 23 comprehensive unit tests covering:
    • Comment invariance (hash stays same when only comments change)
    • Formatting invariance (whitespace/spacing changes don't affect hash)
    • Semantic change detection (logic changes produce different hashes)
    • Language detection
    • Fallback behavior
    • DiscoveredFile integration
  • Tests cover both Python and JavaScript
  • Tests verify docstrings ARE considered semantic (correct behavior)

4. Performance Considerations

  • AST parsing happens only for supported semantic files
  • Efficient tree traversal that skips comment nodes
  • Blake3 hashing remains fast

5. Documentation

  • Clear docstrings explaining behavior
  • Updated field descriptions (discovery.py:174)
  • Good code examples in PR description

🔍 Observations

1. Test Failures Are Unrelated
All semantic hashing tests (23/23) are passing in CI. The failing tests are in:

  • tests/e2e/test_user_journeys.py - E2E user journey tests
  • tests/integration/real/test_full_pipeline.py - Integration tests
  • tests/integration/real/test_search_behavior.py - Search behavior tests
  • tests/integration/workflows/test_search_workflows.py - Workflow tests

These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes.

2. Implementation Notes

  • The AST hash includes both node kinds and leaf text: f"{kind}:{node.text()}" (discovery.py:71)
  • This ensures semantic content is captured while formatting is normalized
  • Proper handling of empty files (returns None for empty AST)

3. Integration Points

  • DiscoveredFile.__init__ (discovery.py:220)
  • DiscoveredFile.from_path (discovery.py:273)
  • DiscoveredFile.file_hash property (discovery.py:341)
  • IndexingService._process_discovery_batch (indexing_service.py:316)

All integration points properly use the new semantic hashing function.

💡 Minor Suggestions

1. Consider Caching (Future Enhancement)
For large files that are parsed frequently, consider caching the AST or hash. This could be a future optimization if profiling shows it's needed.

2. Metrics/Telemetry (Future Enhancement)
Track AST parsing failures vs. successes to monitor which languages/files are falling back to content hashing. This could help identify parser issues or unsupported edge cases.

3. Documentation Enhancement
Consider adding a comment explaining why BaseException is needed for pyo3_runtime.PanicException (discovery.py:101). The PR description mentions this, but it would be helpful in the code itself.

🎯 Impact Assessment

Positive Impacts:

  • ✅ Significantly reduces unnecessary re-indexing from formatting changes
  • ✅ Better developer experience (linters, formatters won't trigger re-index)
  • ✅ More efficient resource usage
  • ✅ Supports 25+ languages via ast-grep

Security: No security concerns identified.

Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files.

Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable.

📋 Verdict

This PR is well-implemented and ready to merge once the unrelated test failures are addressed.

The semantic hashing implementation itself is solid. I recommend:

  1. Investigating the failing E2E and integration tests (they appear unrelated to this PR)
  2. If those tests are flaky or pre-existing issues, consider merging this PR and addressing them separately
  3. Consider adding the minor documentation enhancement mentioned above

Great work on this feature! 🎉

@github-actions
Copy link
Contributor

Code Review - PR #259: Use AST-based hashing for semantic file change detection

I've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis:

✅ Overview

This PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur.

✅ Implementation Quality

Strengths:

  1. Well-designed architecture: The implementation follows a clean separation of concerns with three helper functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language) and one public API (compute_semantic_file_hash).

  2. Robust error handling:

    • Catches BaseException to handle pyo3_runtime.PanicException which doesn't inherit from Exception
    • Properly re-raises KeyboardInterrupt and SystemExit
    • Graceful fallback to content hashing on any failure
    • Good use of debug logging with exc_info=True for troubleshooting
  3. Comprehensive testing: 23 new unit tests cover:

    • Comment invariance ✓
    • Formatting/whitespace invariance ✓
    • Semantic change detection ✓
    • Language detection ✓
    • Fallback behavior ✓
    • DiscoveredFile integration ✓
    • All tests are passing
  4. Leverages existing dependencies: Uses already-included ast-grep-py (tree-sitter) without adding new dependencies.

  5. Documentation: Clear docstrings explaining behavior, especially the important note about docstrings being semantic content (not comments).

✅ Code Changes Analysis

src/codeweaver/core/discovery.py:

  • _walk_ast_nodes(): Recursively traverses AST, skipping comment nodes by checking if "comment" is in the kind name (case-insensitive). Builds a canonical representation of node kinds and leaf text.
  • _compute_ast_hash(): Wraps AST parsing with proper error handling, returns None on failure.
  • _get_semantic_language(): Determines if a file has AST support via extension or ExtCategory.
  • compute_semantic_file_hash(): Public API that attempts AST hashing, falls back to content hashing.
  • Integration points: Updated DiscoveredFile.__init__, from_path, and file_hash property to use the new function.

src/codeweaver/engine/services/indexing_service.py:

  • Line 316: Changed from get_blake_hash(content_bytes) to compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.

⚠️ Test Failures (Unrelated to this PR)

The CI shows 23 failing tests, but these are NOT caused by this PR:

  • Most failures are from MCP spec validation errors: ValueError: Output schemas must represent object types due to MCP spec limitations
  • Some e2e test failures with KeyError: 'max_file_size'
  • One FileNotFoundError in vector storage
  • All new semantic hashing tests (23/23) are passing

These appear to be pre-existing issues in the test suite, likely from recent refactoring mentioned in the PR description.

💡 Technical Considerations

Good decisions:

  1. Comment detection (line 67): Using "comment" in kind.lower() is simple and effective for catching comment node types across different languages (e.g., "comment", "line_comment", "block_comment").

  2. Docstrings are semantic: The PR correctly notes that docstrings produce different hashes since they're AST expression nodes, not comments. This is the right choice as docstrings are part of the API contract.

  3. UTF-8 decoding with error handling (line 135): Uses errors="replace" which prevents crashes on invalid UTF-8 while still allowing AST hashing to proceed.

  4. Walrus operator usage: Clean use of assignment expressions (:=) for both language and hash checks.

🎯 Security & Performance

Security:

  • ✅ No security concerns
  • ✅ Proper exception handling prevents crashes
  • ✅ No exposure of sensitive information

Performance:

  • ✅ AST parsing is only done for supported semantic languages
  • ✅ Graceful fallback ensures no performance regression for non-semantic files
  • ✅ Reduces unnecessary re-indexing, improving overall system performance
  • ✅ Already uses blake3 which is very fast

📊 Impact Assessment

Benefits:

  • Fewer unnecessary re-indexes when developers format code
  • More intelligent change detection
  • Better resource usage
  • Improves developer experience (auto-formatters won't trigger full re-indexing)

Risks:

  • ⚠️ Very low - comprehensive fallback strategy ensures robustness
  • All changes are backward compatible

✅ Recommendation

APPROVE - This PR is ready to merge.

The implementation is well-designed, thoroughly tested, and delivers significant value. The failing tests are unrelated to this PR and should be addressed separately.

Suggestions for future consideration:

  1. Consider caching parsed ASTs if files are hashed multiple times in quick succession
  2. Monitor AST parsing performance in production to validate the benefits

Test Results: All 23 new semantic hashing tests passing ✓
CI Failures: Unrelated pre-existing issues (23 failures in MCP/e2e tests)
Code Quality: Excellent - clean architecture, robust error handling, comprehensive tests
Documentation: Clear and complete

Great work! This is a valuable improvement to the codebase. 🎉

@github-actions
Copy link
Contributor

Additional Code Review - PR #259

I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings:

✅ Code Quality & Design

Excellent Implementation:

  • Clean abstraction layers: The code properly separates concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash (discovery.py:59-144)
  • Defensive programming: Proper exception handling with BaseException catch to handle pyo3_runtime.PanicException from unsupported languages
  • Graceful degradation: Falls back to content hashing when AST parsing fails or language is unsupported
  • No new dependencies: Leverages existing ast_grep_py (tree-sitter) dependency

🔍 Technical Observations

1. Comment Detection Algorithm (discovery.py:66-67)

if "comment" in kind.lower():
    return

This simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes.

2. AST Canonicalization (discovery.py:71, 97)
The canonical representation combines node kinds and leaf text values, ensuring:

  • Whitespace normalization (not represented in AST)
  • Formatting independence
  • Semantic content preservation

3. Integration Points
All file hashing locations properly updated:

  • DiscoveredFile.__init__ (discovery.py:220)
  • DiscoveredFile.from_path (discovery.py:273)
  • DiscoveredFile.file_hash property (discovery.py:341)
  • IndexingService._process_discovery_batch (indexing_service.py:316)

4. Test Coverage (test_semantic_hashing.py)
Comprehensive 285-line test suite with 23 tests covering:

  • Comment invariance (Python & JavaScript)
  • Whitespace/formatting invariance
  • Semantic change detection
  • Docstring handling (correctly treated as semantic)
  • Language detection
  • Fallback behavior
  • DiscoveredFile integration

🎯 Impact Analysis

Benefits:

  • Reduced re-indexing: Formatting changes (prettier, black, eslint) won't trigger re-index
  • Better CI/CD: Automated formatters won't cause false positives
  • Resource efficiency: Fewer unnecessary embedding computations
  • 25+ language support: Python, JavaScript, TypeScript, Go, Rust, Java, C++, etc.

Migration Impact:
Existing file hashes will differ from new semantic hashes, triggering a one-time re-index of all semantic files. This is expected and acceptable behavior.

Performance:

  • AST parsing adds minimal overhead (only for semantic files)
  • Tree-sitter parsers are highly optimized
  • Blake3 hashing remains fast
  • Net performance gain from avoiding unnecessary re-indexing

⚠️ Test Failures (Unrelated to This PR)

The failing tests are NOT related to semantic hashing:

  • test_full_pipeline_index_then_search: MCP schema validation error
  • E2E user journey tests: Config/CLI issues
  • Integration tests: Pre-existing flakiness

All 23 semantic hashing tests pass successfully.

💡 Minor Recommendations

1. Documentation Enhancement
Consider adding inline comment explaining BaseException usage:

except (KeyboardInterrupt, SystemExit):
    raise
except BaseException:  # Catches pyo3_runtime.PanicException which doesn't inherit from Exception
    logger.debug("AST parsing failed for language %s", language_name, exc_info=True)

2. Future Optimization (Optional)
If profiling shows AST parsing as a bottleneck, consider:

  • Caching parsed ASTs for large frequently-accessed files
  • Parallel AST parsing for batch operations

3. Monitoring (Optional)
Track AST parsing success/failure rates to identify:

  • Languages with parser issues
  • Edge cases requiring fallback

✅ Security & Safety

  • No security concerns: Read-only operations, no external inputs
  • Safe error handling: Properly re-raises KeyboardInterrupt and SystemExit
  • Debug logging: Includes exc_info=True for troubleshooting (discovery.py:102, 142)

📊 Code Style & Best Practices

  • Follows project conventions
  • Type hints properly used
  • Docstrings clear and comprehensive
  • Error messages informative
  • Export added to __all__ (discovery.py:425)

🎉 Recommendation

APPROVE with minor suggestion

This is a well-architected, thoroughly tested feature that solves a real problem. The implementation is production-ready. The test failures are unrelated and should be addressed separately.

Suggested action: Merge this PR once the team confirms the failing tests are pre-existing issues.

Excellent work! 🚀

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
@github-actions
Copy link
Contributor

PR Review: AST-based Semantic File Hashing

I've completed a thorough review of this PR. Overall, this is excellent work with clean design and comprehensive testing. Here's my detailed assessment:

Strengths ✅

  1. Excellent Architecture

    • Clean separation of concerns with well-named helper functions
    • Proper fallback to raw content hashing for unsupported languages
    • Smart reuse of existing ast_grep_py dependency
  2. Comprehensive Test Coverage

    • 23 unit tests covering all major scenarios
    • Tests for comment/whitespace invariance, semantic changes, language detection, and fallback behavior
    • Multi-language coverage (Python, JavaScript)
  3. High Code Quality

    • Clear, descriptive docstrings
    • Robust error handling with proper exception catching
    • Good use of debug logging with exc_info=True
    • Maintains backward compatibility
  4. Smart Implementation Details

    • Correctly handles ext_category parameter throughout the call chain
    • Proper understanding that docstrings are semantic content (not comments)
    • Well-documented handling of pyo3_runtime.PanicException

Concerns ⚠️

  1. BLOCKING: Test Failures

    • Multiple test runs are failing (Python 3.12, 3.13, 3.14, Docker Compose)
    • These must be investigated and resolved before merge
    • Please verify if these failures are related to this PR or pre-existing
  2. Performance Considerations

    • AST parsing adds computational overhead vs raw hashing
    • For large codebases, this could impact discovery/indexing performance
    • Consider adding performance benchmarks or monitoring
  3. Minor Code Quality Suggestions

    In _walk_ast_nodes (discovery.py:66-67):

    if "comment" in kind.lower():
    • Substring matching for comment detection could be fragile
    • Could match non-comment nodes with "comment" in the name
    • Suggestion: Use more robust comment detection if ast-grep provides a better API

    In compute_semantic_file_hash (discovery.py:135):

    content_str = content_bytes.decode("utf-8", errors="replace")
    • Using errors="replace" could mask encoding issues
    • Suggestion: Consider logging a warning when replacement characters are inserted
  4. Documentation

    • Suggestion: Add inline comments explaining the AST traversal strategy in _walk_ast_nodes

Security & Performance

  • ✅ No security concerns identified
  • ✅ Proper exception handling prevents DoS via malformed files
  • ⚠️ Performance impact needs monitoring (though likely acceptable since it avoids expensive re-indexing)

Recommendations Before Merge

  1. CRITICAL: Investigate and fix all failing CI tests
  2. Important: Consider adding performance benchmarks
  3. Nice to have: Add more inline comments for complex logic
  4. Nice to have: More robust comment node detection
  5. Nice to have: Log warnings for encoding issues

Verdict

This PR demonstrates excellent software engineering practices. The implementation is clean, well-tested, and adds significant value. However, the failing CI tests are a blocking issue.

Once test failures are resolved, I would approve this PR as it:

  • ✅ Solves a real problem (unnecessary re-indexing)
  • ✅ Has excellent design and implementation
  • ✅ Includes comprehensive test coverage
  • ✅ Follows codebase patterns
  • ✅ Adds significant value with minimal risk

Great work overall! Please address the test failures and this will be ready to merge. 🚀

- file_hash property now uses self.absolute_path for existence checks
  and reads so relative paths resolve correctly regardless of CWD
- _compute_ast_hash catches Exception for standard errors, then
  catches BaseException only for pyo3_runtime.PanicException identified
  by module name, re-raising all other non-recoverable errors

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
except Exception:
logger.debug("AST parsing failed for language %s", language_name, exc_info=True)
return None
except BaseException as exc:
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Review: AST-based Semantic File Hashing

Summary

This PR implements AST-based semantic file hashing to avoid unnecessary re-indexing when only comments, whitespace, or formatting changes occur. The implementation is well-designed and thoroughly tested.

Code Quality: Excellent ✅

Strengths

1. Clean Architecture

  • Well-organized into small, focused functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language)
  • Clear separation of concerns between AST parsing and hash computation
  • Proper use of existing infrastructure (ast-grep, blake3)

2. Robust Error Handling (src/codeweaver/core/discovery.py:78-110)

  • Graceful fallback to content-based hashing on parse failures
  • Clever handling of pyo3_runtime.PanicException that can't be imported directly
  • Proper exception hierarchy: catches Exception first, then BaseException selectively
  • Avoids swallowing critical exceptions like GeneratorExit
  • Good logging with exc_info=True for debugging

3. Semantic Correctness

  • Correctly excludes comments via "comment" in kind.lower() check
  • Builds canonical representation from AST node kinds and leaf text
  • Properly handles both named and unnamed AST nodes
  • Docstrings correctly treated as semantic (not comments) since they're expression nodes

4. Path Handling Fix (src/codeweaver/core/discovery.py:341-351)

  • The file_hash property now correctly uses self.absolute_path instead of self.path
  • This ensures relative paths resolve correctly regardless of current working directory
  • Important fix for reliability

5. Integration Points

  • Updated both DiscoveredFile initialization and indexing_service._process_discovery_batch
  • Consistent API with optional ext_category parameter
  • Properly exported in __all__

Testing: Comprehensive ✅

The test suite (tests/unit/core/test_semantic_hashing.py) is excellent with 23 tests covering:

  • ✅ Comment invariance (Python and JavaScript)
  • ✅ Formatting/whitespace invariance
  • ✅ Semantic change detection (operators, identifiers)
  • ✅ Language detection and fallback
  • ✅ Empty content and invalid syntax
  • ✅ Docstring changes (correctly different hash)
  • DiscoveredFile integration
  • ext_category precedence

Notable test quality:

  • Tests both low-level functions and integration
  • Clear test names and docstrings
  • Good coverage of edge cases

Performance Considerations

Positive:

  • Uses existing ast_grep_py dependency (no new deps)
  • Lazy import of SgRoot only when needed
  • Falls back quickly on parse failures

Potential concerns:

  • AST parsing is inherently slower than blake3 hashing raw bytes
  • For large codebases with frequent file scans, this could add latency
  • However, the benefit of avoiding unnecessary re-indexing likely outweighs the parsing cost

Recommendation: Monitor performance in production. If AST parsing becomes a bottleneck, consider:

  • Caching parsed ASTs (but memory cost may be high)
  • Parallel processing of file hashing
  • Size threshold above which to skip AST hashing

Security: Good ✅

  • No credential handling or external network calls
  • Proper exception handling prevents crashes
  • Uses errors='replace' when decoding bytes to UTF-8
  • No code execution risks (ast-grep is a parser, not evaluator)

Documentation: Good ✅

  • Clear docstrings for all new functions
  • Updated field description for _file_hash in DiscoveredFile
  • Helpful inline comments explaining the pyo3_runtime.PanicException handling
  • PR description clearly explains the approach and changes

Minor Observations

  1. Docstring behavior (noted in PR description): Docstring changes produce different hashes. This is semantically correct since docstrings are AST nodes, not comments. However, for tools like linters that auto-format docstrings, this might cause some re-indexing. This seems acceptable.

  2. Language coverage: The implementation supports 25+ languages via ast-grep. The fallback ensures any unsupported files still work.

  3. Type safety: Good use of type hints throughout, including BlakeHashKey | None return types.

CI Status

All checks passing ✅

Recommendation

APPROVE

This is high-quality code that solves a real problem (unnecessary re-indexing) with a well-thought-out approach. The implementation is robust, well-tested, and properly integrated. The use of AST-based hashing for semantic files is the right architectural choice.

Suggested Follow-ups (optional, not blocking)

  1. Monitor performance impact in production environments with large codebases
  2. Consider adding metrics/logging for AST hash vs content hash usage rates
  3. Could add a configuration option to disable AST hashing if needed for performance

@bashandbone
Copy link
Contributor

@copilot hey team I think I disagree with the direction here and I'd like you to at least consider an alternative. I may also not fully understand the implementation, so feel free to correct me.

We already parse and cache asts as part of the engine pipeline. An AstThing object holds its ast and can query across them. Would it make more sense for each AstThing to store its own hash? You could even potentially isolate only changed nodes by comparing a group of hashes for a file.

Copy link
Contributor Author

Copilot AI commented Mar 21, 2026

@copilot hey team I think I disagree with the direction here and I'd like you to at least consider an alternative. I may also not fully understand the implementation, so feel free to correct me.

We ...

The semantic file hash serves a different purpose from AstThing's AST — it's a pre-indexing gate. The pipeline flow is:

  1. Discovery (_process_discovery_batch) → read bytes, compute hash, compare with manifest → decide if file needs re-indexing
  2. Indexing (SemanticChunker._parse_and_chunk) → parse AST → create AstThings/FileThings → chunk → embed → store

The hash exists to skip step 2 entirely for unchanged files. If we moved hashing to AstThing, we'd need to construct a full FileThing (with all its Pydantic model machinery, ThingRegistry lookups, classification, importance scoring, etc.) for every discovered file just to check if it changed — which is exactly the expensive work we're trying to avoid.

The lightweight _compute_ast_hash in discovery.py does a bare tree-sitter parse (just SgRoot + walk nodes), collects kind/text tokens, and hashes them. It never creates AstThings, never touches the ThingRegistry, and gets discarded immediately after hashing. On the other hand, AstThing wraps SgNode with computed fields, cached properties, semantic classification, importance scoring, UUID generation, and Pydantic model overhead — all unnecessary for a simple "did this file change?" check.

That said, your idea about node-level hashing for incremental re-indexing (comparing groups of hashes to isolate which nodes changed) is a genuinely interesting future direction. That would live in the chunking layer where AstThings already exist, and could avoid re-embedding unchanged chunks within a changed file. But that's an optimization on top of this change, not a replacement for it — you'd still want the file-level semantic hash as a fast first-pass filter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use AST-based hashing for semantic file change detection

3 participants