Use AST-based hashing for semantic file change detection#259
Use AST-based hashing for semantic file change detection#259
Conversation
Use ast-grep to parse supported language files (25+ languages) and compute hashes from the canonical AST representation instead of raw file bytes. This ignores comments, whitespace, and formatting changes so that only genuine semantic modifications trigger re-indexing. Changes: - Add compute_semantic_file_hash() and helpers to discovery.py - Update DiscoveredFile.__init__, from_path, and file_hash property - Update indexing_service._process_discovery_batch to use semantic hashing - Add comprehensive unit tests (23 tests) for semantic hashing Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.
Changes:
- Add
compute_semantic_file_hash()and AST-walking helpers incore/discovery.py, and wire it intoDiscoveredFilehashing. - Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
- Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and
DiscoveredFileintegration.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/codeweaver/core/discovery.py |
Adds AST-based hashing and updates DiscoveredFile hashing behavior. |
src/codeweaver/engine/services/indexing_service.py |
Uses semantic hashing when determining whether files have changed. |
tests/unit/core/test_semantic_hashing.py |
Adds unit tests validating semantic hashing behavior and integration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
src/codeweaver/core/discovery.py
Outdated
| return get_blake_hash(canonical) | ||
| except (KeyboardInterrupt, SystemExit): | ||
| raise | ||
| except BaseException: |
There was a problem hiding this comment.
_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.
| except BaseException: | |
| except Exception: |
| """Return the blake3 hash of the file, using AST-based hashing when supported.""" | ||
| if self._file_hash is not None: | ||
| return self._file_hash | ||
| if self.path.exists() and self.path.is_file(): | ||
| content_hash = get_blake_hash(self.path.read_bytes()) | ||
| content_bytes = self.path.read_bytes() | ||
| content_hash = compute_semantic_file_hash( | ||
| content_bytes, self.path, ext_category=self.ext_category | ||
| ) |
There was a problem hiding this comment.
DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.
There was a problem hiding this comment.
Please correct the following two issues:
DiscoveredFile.file_hashreads viaself.path.exists()/read_bytes(), butpathis often stored relative to the project root (seeset_relative_pathusage infrom_pathandabsolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back toget_blake_hash(b""). Prefer usingself.absolute_pathfor the existence/read checks here (and when passing the path intocompute_semantic_file_hash) to ensure consistent behavior regardless of CWD.
Let's also catch a narrower error than BaseException, like the pyO3 error specifically. We need to be careful because when Codeweaver moves to a monorepo there will be situations where pyO3 isn't a transitive dependency.
Offhand, I know watchfiles and rignore use it, but those are engine dependencies. Some providers use it (DuckDuckGo via primp, for example). We may need to make the catch or caught exception dependent on whether we have pyO3 in the environment
There was a problem hiding this comment.
Fixed both issues in ed23aa2:
-
file_hashnow usesself.absolute_pathfor existence checks and reads, matching the pattern already used insize,contents, andraw_contents. -
Narrowed exception handling in
_compute_ast_hash:except Exceptioncatches all standard Python errors (includingImportErrorifast_grep_pyis missing)except BaseExceptiononly fires for non-Exceptionsubclasses; checksgetattr(type(exc), "__module__", None) == "pyo3_runtime"to identifyPanicExceptionby module name, re-raising everything else (GeneratorExit, etc.)
This works in environments without pyo3 since the
ImportErrorfromast_grep_pyfalls into the firstexcept Exceptionhandler. The pyo3 check only activates when a Rust-side panic actually occurs — andPanicExceptioncan't be imported directly (it's a Rust-embedded type not registered insys.modules), hence the module name check.
Code Review - AST-based Semantic File HashingI've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering. ✅ Strengths1. Sound Architecture
2. Robust Error Handling
3. Excellent Test Coverage
4. Performance Considerations
5. Documentation
🔍 Observations1. Test Failures Are Unrelated
These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes. 2. Implementation Notes
3. Integration Points
All integration points properly use the new semantic hashing function. 💡 Minor Suggestions1. Consider Caching (Future Enhancement) 2. Metrics/Telemetry (Future Enhancement) 3. Documentation Enhancement 🎯 Impact AssessmentPositive Impacts:
Security: No security concerns identified. Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files. Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable. 📋 VerdictThis PR is well-implemented and ready to merge once the unrelated test failures are addressed. The semantic hashing implementation itself is solid. I recommend:
Great work on this feature! 🎉 |
Code Review - PR #259: Use AST-based hashing for semantic file change detectionI've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis: ✅ OverviewThis PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur. ✅ Implementation QualityStrengths:
✅ Code Changes Analysissrc/codeweaver/core/discovery.py:
src/codeweaver/engine/services/indexing_service.py:
|
Additional Code Review - PR #259I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings: ✅ Code Quality & DesignExcellent Implementation:
🔍 Technical Observations1. Comment Detection Algorithm (discovery.py:66-67) if "comment" in kind.lower():
returnThis simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes. 2. AST Canonicalization (discovery.py:71, 97)
3. Integration Points
4. Test Coverage (test_semantic_hashing.py)
🎯 Impact AnalysisBenefits:
Migration Impact: Performance:
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>
PR Review: AST-based Semantic File HashingI've completed a thorough review of this PR. Overall, this is excellent work with clean design and comprehensive testing. Here's my detailed assessment: Strengths ✅
Concerns
|
- file_hash property now uses self.absolute_path for existence checks and reads so relative paths resolve correctly regardless of CWD - _compute_ast_hash catches Exception for standard errors, then catches BaseException only for pyo3_runtime.PanicException identified by module name, re-raising all other non-recoverable errors Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Review: AST-based Semantic File Hashing
Summary
This PR implements AST-based semantic file hashing to avoid unnecessary re-indexing when only comments, whitespace, or formatting changes occur. The implementation is well-designed and thoroughly tested.
Code Quality: Excellent ✅
Strengths
1. Clean Architecture
- Well-organized into small, focused functions (
_walk_ast_nodes,_compute_ast_hash,_get_semantic_language) - Clear separation of concerns between AST parsing and hash computation
- Proper use of existing infrastructure (ast-grep, blake3)
2. Robust Error Handling (src/codeweaver/core/discovery.py:78-110)
- Graceful fallback to content-based hashing on parse failures
- Clever handling of
pyo3_runtime.PanicExceptionthat can't be imported directly - Proper exception hierarchy: catches
Exceptionfirst, thenBaseExceptionselectively - Avoids swallowing critical exceptions like
GeneratorExit - Good logging with
exc_info=Truefor debugging
3. Semantic Correctness
- Correctly excludes comments via
"comment" in kind.lower()check - Builds canonical representation from AST node kinds and leaf text
- Properly handles both named and unnamed AST nodes
- Docstrings correctly treated as semantic (not comments) since they're expression nodes
4. Path Handling Fix (src/codeweaver/core/discovery.py:341-351)
- The
file_hashproperty now correctly usesself.absolute_pathinstead ofself.path - This ensures relative paths resolve correctly regardless of current working directory
- Important fix for reliability
5. Integration Points
- Updated both
DiscoveredFileinitialization andindexing_service._process_discovery_batch - Consistent API with optional
ext_categoryparameter - Properly exported in
__all__
Testing: Comprehensive ✅
The test suite (tests/unit/core/test_semantic_hashing.py) is excellent with 23 tests covering:
- ✅ Comment invariance (Python and JavaScript)
- ✅ Formatting/whitespace invariance
- ✅ Semantic change detection (operators, identifiers)
- ✅ Language detection and fallback
- ✅ Empty content and invalid syntax
- ✅ Docstring changes (correctly different hash)
- ✅
DiscoveredFileintegration - ✅
ext_categoryprecedence
Notable test quality:
- Tests both low-level functions and integration
- Clear test names and docstrings
- Good coverage of edge cases
Performance Considerations
Positive:
- Uses existing
ast_grep_pydependency (no new deps) - Lazy import of
SgRootonly when needed - Falls back quickly on parse failures
Potential concerns:
- AST parsing is inherently slower than blake3 hashing raw bytes
- For large codebases with frequent file scans, this could add latency
- However, the benefit of avoiding unnecessary re-indexing likely outweighs the parsing cost
Recommendation: Monitor performance in production. If AST parsing becomes a bottleneck, consider:
- Caching parsed ASTs (but memory cost may be high)
- Parallel processing of file hashing
- Size threshold above which to skip AST hashing
Security: Good ✅
- No credential handling or external network calls
- Proper exception handling prevents crashes
- Uses
errors='replace'when decoding bytes to UTF-8 - No code execution risks (ast-grep is a parser, not evaluator)
Documentation: Good ✅
- Clear docstrings for all new functions
- Updated field description for
_file_hashinDiscoveredFile - Helpful inline comments explaining the
pyo3_runtime.PanicExceptionhandling - PR description clearly explains the approach and changes
Minor Observations
-
Docstring behavior (noted in PR description): Docstring changes produce different hashes. This is semantically correct since docstrings are AST nodes, not comments. However, for tools like linters that auto-format docstrings, this might cause some re-indexing. This seems acceptable.
-
Language coverage: The implementation supports 25+ languages via ast-grep. The fallback ensures any unsupported files still work.
-
Type safety: Good use of type hints throughout, including
BlakeHashKey | Nonereturn types.
CI Status
All checks passing ✅
Recommendation
APPROVE ✅
This is high-quality code that solves a real problem (unnecessary re-indexing) with a well-thought-out approach. The implementation is robust, well-tested, and properly integrated. The use of AST-based hashing for semantic files is the right architectural choice.
Suggested Follow-ups (optional, not blocking)
- Monitor performance impact in production environments with large codebases
- Consider adding metrics/logging for AST hash vs content hash usage rates
- Could add a configuration option to disable AST hashing if needed for performance
|
@copilot hey team I think I disagree with the direction here and I'd like you to at least consider an alternative. I may also not fully understand the implementation, so feel free to correct me. We already parse and cache asts as part of the engine pipeline. An AstThing object holds its ast and can query across them. Would it make more sense for each AstThing to store its own hash? You could even potentially isolate only changed nodes by comparing a group of hashes for a file. |
The semantic file hash serves a different purpose from AstThing's AST — it's a pre-indexing gate. The pipeline flow is:
The hash exists to skip step 2 entirely for unchanged files. If we moved hashing to AstThing, we'd need to construct a full FileThing (with all its Pydantic model machinery, ThingRegistry lookups, classification, importance scoring, etc.) for every discovered file just to check if it changed — which is exactly the expensive work we're trying to avoid. The lightweight That said, your idea about node-level hashing for incremental re-indexing (comparing groups of hashes to isolate which nodes changed) is a genuinely interesting future direction. That would live in the chunking layer where AstThings already exist, and could avoid re-embedding unchanged chunks within a changed file. But that's an optimization on top of this change, not a replacement for it — you'd still want the file-level semantic hash as a fast first-pass filter. |
File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.
How it works
The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.
Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.
Changes
discovery.py: Addcompute_semantic_file_hash()with internal helpers_compute_ast_hash,_walk_ast_nodes,_get_semantic_language. UpdateDiscoveredFile.__init__,from_path, andfile_hashproperty to use it. Thefile_hashproperty usesself.absolute_pathfor existence checks and reads to handle relative paths correctly regardless of CWD.indexing_service.py: Replaceget_blake_hash(content_bytes)withcompute_semantic_file_hash(content_bytes, path)in_process_discovery_batch.test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, andDiscoveredFileintegration.Notes
ast_grep_py(tree-sitter).pyo3_runtime.PanicExceptionfrom unsupported languages inherits fromBaseException, notException, and cannot be imported directly. The fallback catchesExceptionfor standard errors, then catchesBaseExceptiononly forpyo3_runtime.PanicExceptionidentified by module name — re-raising all other non-recoverable errors (GeneratorExit, etc.). This approach works in environments without pyo3 sinceImportErrorfromast_grep_pyis handled by theexcept Exceptionclause.Original prompt
⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.