Use AST-based hashing for semantic file change detection by Copilot · Pull Request #259 · knitli/codeweaver

Copilot · 2026-03-19T19:27:10Z

File hashing used raw content bytes, so comment edits, whitespace changes, and reformatting all triggered unnecessary re-indexing. This replaces raw hashing with AST-based hashing for the 25+ languages already supported by ast-grep.

How it works

The AST normalizes away whitespace (not represented in the tree) and we skip comment nodes during traversal. The canonical representation is the sequence of node kinds and leaf text values, which is then blake3-hashed as before.

# Same hash — only comments differ
compute_semantic_file_hash(b"# old\ndef f(): pass", Path("x.py"))
compute_semantic_file_hash(b"# new\ndef f(): pass", Path("x.py"))

# Same hash — only formatting differs
compute_semantic_file_hash(b"def add(a,b):\n    return a+b", Path("x.py"))
compute_semantic_file_hash(b"def add(a, b):\n    return a + b", Path("x.py"))

# Different hash — operator changed
compute_semantic_file_hash(b"return a + b", Path("x.py"))
compute_semantic_file_hash(b"return a * b", Path("x.py"))

Non-semantic files (markdown, plain text, etc.) and AST parse failures fall back to raw content hashing.

Changes

discovery.py: Add compute_semantic_file_hash() with internal helpers _compute_ast_hash, _walk_ast_nodes, _get_semantic_language. Update DiscoveredFile.__init__, from_path, and file_hash property to use it. The file_hash property uses self.absolute_path for existence checks and reads to handle relative paths correctly regardless of CWD.
indexing_service.py: Replace get_blake_hash(content_bytes) with compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.
test_semantic_hashing.py: 23 unit tests covering comment invariance, formatting invariance, semantic change detection, language detection, fallback behavior, and DiscoveredFile integration.

Notes

No new dependencies — uses existing ast_grep_py (tree-sitter).
pyo3_runtime.PanicException from unsupported languages inherits from BaseException, not Exception, and cannot be imported directly. The fallback catches Exception for standard errors, then catches BaseException only for pyo3_runtime.PanicException identified by module name — re-raising all other non-recoverable errors (GeneratorExit, etc.). This approach works in environments without pyo3 since ImportError from ast_grep_py is handled by the except Exception clause.
Docstring changes do produce different hashes since docstrings are AST expression nodes, not comments.

Original prompt

This section details on the original issue you should resolve

<issue_title>Use AST-based hashing for semantic file change detection</issue_title>
<issue_description>## Description

Implement semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection.

Current Approach

Files hashed based on raw content

Formatting changes trigger unnecessary re-indexing

Comments and whitespace changes cause false positives

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Use ast() to compute hash for semantic files like Python, JavaScript, etc.

Benefits:

Detect actual semantic changes

Ignore formatting, comments, whitespace

More intelligent change detection

Reduce unnecessary re-indexing

Implementation:
# Hash AST structure instead of raw bytes
ast_hash = hash_ast(parse_file(path))
Considerations

Language-specific AST parsers (tree-sitter already available)

Fallback to content hash for non-semantic files

Performance impact of parsing

Cache parsed ASTs when possible

Impact

Fewer unnecessary re-indexes

Better performance

Smarter change detection

More efficient resource usage

Source

Files: discovery.py:165, discovery.py:182

Branch: 003-our-aim-to</issue_description>

<agent_instructions>@copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

If you assess the path to implementation is straight forward, then go ahead, complete it and submit a PR. Otherwise, submit a plan.</agent_instructions>

Comments on the Issue (you are @copilot in this section)
@bashandbone @copilot we need a new assessment on this issue based on the codebase's current state following major refactoring

Fixes Use AST-based hashing for semantic file change detection #95

⌨️ Start Copilot coding agent tasks without leaving your editor — available in VS Code, Visual Studio, JetBrains IDEs and Eclipse.

Use ast-grep to parse supported language files (25+ languages) and compute hashes from the canonical AST representation instead of raw file bytes. This ignores comments, whitespace, and formatting changes so that only genuine semantic modifications trigger re-indexing. Changes: - Add compute_semantic_file_hash() and helpers to discovery.py - Update DiscoveredFile.__init__, from_path, and file_hash property - Update indexing_service._process_discovery_batch to use semantic hashing - Add comprehensive unit tests (23 tests) for semantic hashing Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

src/codeweaver/core/discovery.py

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot

Pull request overview

Introduces semantic-aware file hashing to reduce unnecessary re-indexing by hashing an AST-derived canonical form (ignoring comments/formatting) for supported languages, with a fallback to raw content hashing when AST parsing isn’t available.

Changes:

Add compute_semantic_file_hash() and AST-walking helpers in core/discovery.py, and wire it into DiscoveredFile hashing.
Update indexing discovery batching to use semantic hashing when deciding whether a file needs reindexing.
Add unit tests covering comment/whitespace invariance, semantic-change detection, language detection, fallback behavior, and DiscoveredFile integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
`src/codeweaver/core/discovery.py`	Adds AST-based hashing and updates `DiscoveredFile` hashing behavior.
`src/codeweaver/engine/services/indexing_service.py`	Uses semantic hashing when determining whether files have changed.
`tests/unit/core/test_semantic_hashing.py`	Adds unit tests validating semantic hashing behavior and integration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/unit/core/test_semantic_hashing.py

Copilot · 2026-03-19T22:48:41Z

src/codeweaver/core/discovery.py

+        return get_blake_hash(canonical)
+    except (KeyboardInterrupt, SystemExit):
+        raise
+    except BaseException:


_compute_ast_hash catches BaseException, which will also swallow MemoryError/GeneratorExit and other non-recoverable errors (falling back to a content hash instead of surfacing the failure). If the goal is specifically to handle pyo3_runtime.PanicException from ast-grep, consider catching that type (or Exception plus that type) and/or explicitly re-raising MemoryError/GeneratorExit similar to how KeyboardInterrupt/SystemExit are handled.

Suggested change

except BaseException:

except Exception:

Copilot · 2026-03-19T22:48:41Z

src/codeweaver/core/discovery.py

+        """Return the blake3 hash of the file, using AST-based hashing when supported."""
        if self._file_hash is not None:
            return self._file_hash
        if self.path.exists() and self.path.is_file():
-            content_hash = get_blake_hash(self.path.read_bytes())
+            content_bytes = self.path.read_bytes()
+            content_hash = compute_semantic_file_hash(
+                content_bytes, self.path, ext_category=self.ext_category
+            )


DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

@copilot

Please correct the following two issues:

DiscoveredFile.file_hash reads via self.path.exists()/read_bytes(), but path is often stored relative to the project root (see set_relative_path usage in from_path and absolute_path). If the current working directory isn’t the project root, this branch can incorrectly treat an existing file as missing and fall back to get_blake_hash(b""). Prefer using self.absolute_path for the existence/read checks here (and when passing the path into compute_semantic_file_hash) to ensure consistent behavior regardless of CWD.

Let's also catch a narrower error than BaseException, like the pyO3 error specifically. We need to be careful because when Codeweaver moves to a monorepo there will be situations where pyO3 isn't a transitive dependency.

Offhand, I know watchfiles and rignore use it, but those are engine dependencies. Some providers use it (DuckDuckGo via primp, for example). We may need to make the catch or caught exception dependent on whether we have pyO3 in the environment

Fixed both issues in ed23aa2:

file_hash now uses self.absolute_path for existence checks and reads, matching the pattern already used in size, contents, and raw_contents.

Narrowed exception handling in _compute_ast_hash:

except Exception catches all standard Python errors (including ImportError if ast_grep_py is missing)

except BaseException only fires for non-Exception subclasses; checks getattr(type(exc), "__module__", None) == "pyo3_runtime" to identify PanicException by module name, re-raising everything else (GeneratorExit, etc.)

This works in environments without pyo3 since the ImportError from ast_grep_py falls into the first except Exception handler. The pyo3 check only activates when a Rust-side panic actually occurs — and PanicException can't be imported directly (it's a Rust-embedded type not registered in sys.modules), hence the module name check.

github-actions · 2026-03-19T22:51:17Z

Code Review - AST-based Semantic File Hashing

I've reviewed PR #259 and the implementation looks excellent! This is a well-designed feature that addresses a real problem with smart engineering.

✅ Strengths

1. Sound Architecture

Clean separation of concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash functions
Proper fallback strategy: AST parsing → content hash (graceful degradation)
Uses existing ast_grep_py dependency (no new dependencies)

2. Robust Error Handling

Catches BaseException to handle pyo3_runtime.PanicException (which doesn't inherit from Exception)
Re-raises KeyboardInterrupt and SystemExit appropriately
Debug logging with exc_info=True for troubleshooting (discovery.py:102, 142)
Multiple fallback paths ensure the system remains functional

3. Excellent Test Coverage

23 comprehensive unit tests covering:
- Comment invariance (hash stays same when only comments change)
- Formatting invariance (whitespace/spacing changes don't affect hash)
- Semantic change detection (logic changes produce different hashes)
- Language detection
- Fallback behavior
- DiscoveredFile integration
Tests cover both Python and JavaScript
Tests verify docstrings ARE considered semantic (correct behavior)

4. Performance Considerations

AST parsing happens only for supported semantic files
Efficient tree traversal that skips comment nodes
Blake3 hashing remains fast

5. Documentation

Clear docstrings explaining behavior
Updated field descriptions (discovery.py:174)
Good code examples in PR description

🔍 Observations

1. Test Failures Are Unrelated
All semantic hashing tests (23/23) are passing in CI. The failing tests are in:

tests/e2e/test_user_journeys.py - E2E user journey tests
tests/integration/real/test_full_pipeline.py - Integration tests
tests/integration/real/test_search_behavior.py - Search behavior tests
tests/integration/workflows/test_search_workflows.py - Workflow tests

These appear to be pre-existing flaky tests or issues unrelated to the semantic hashing changes.

2. Implementation Notes

The AST hash includes both node kinds and leaf text: f"{kind}:{node.text()}" (discovery.py:71)
This ensures semantic content is captured while formatting is normalized
Proper handling of empty files (returns None for empty AST)

3. Integration Points

DiscoveredFile.__init__ (discovery.py:220)
DiscoveredFile.from_path (discovery.py:273)
DiscoveredFile.file_hash property (discovery.py:341)
IndexingService._process_discovery_batch (indexing_service.py:316)

All integration points properly use the new semantic hashing function.

💡 Minor Suggestions

1. Consider Caching (Future Enhancement)
For large files that are parsed frequently, consider caching the AST or hash. This could be a future optimization if profiling shows it's needed.

2. Metrics/Telemetry (Future Enhancement)
Track AST parsing failures vs. successes to monitor which languages/files are falling back to content hashing. This could help identify parser issues or unsupported edge cases.

3. Documentation Enhancement
Consider adding a comment explaining why BaseException is needed for pyo3_runtime.PanicException (discovery.py:101). The PR description mentions this, but it would be helpful in the code itself.

🎯 Impact Assessment

Positive Impacts:

✅ Significantly reduces unnecessary re-indexing from formatting changes
✅ Better developer experience (linters, formatters won't trigger re-index)
✅ More efficient resource usage
✅ Supports 25+ languages via ast-grep

Security: No security concerns identified.

Performance: Expected improvement due to fewer unnecessary re-indexing operations. AST parsing adds minimal overhead and only runs for supported semantic files.

Backward Compatibility: Fully compatible. Existing content hashes will differ from new semantic hashes, causing a one-time re-index, which is expected and acceptable.

📋 Verdict

This PR is well-implemented and ready to merge once the unrelated test failures are addressed.

The semantic hashing implementation itself is solid. I recommend:

Investigating the failing E2E and integration tests (they appear unrelated to this PR)
If those tests are flaky or pre-existing issues, consider merging this PR and addressing them separately
Consider adding the minor documentation enhancement mentioned above

Great work on this feature! 🎉

github-actions · 2026-03-19T22:52:13Z

Code Review - PR #259: Use AST-based hashing for semantic file change detection

I've thoroughly reviewed this PR and the implementation looks excellent! Here's my detailed analysis:

✅ Overview

This PR successfully implements semantic-aware file hashing using AST (Abstract Syntax Tree) instead of raw file contents for better change detection. This is a significant improvement that will reduce unnecessary re-indexing when only formatting, comments, or whitespace changes occur.

✅ Implementation Quality

Strengths:

Well-designed architecture: The implementation follows a clean separation of concerns with three helper functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language) and one public API (compute_semantic_file_hash).
Robust error handling:
- Catches BaseException to handle pyo3_runtime.PanicException which doesn't inherit from Exception
- Properly re-raises KeyboardInterrupt and SystemExit
- Graceful fallback to content hashing on any failure
- Good use of debug logging with exc_info=True for troubleshooting
Comprehensive testing: 23 new unit tests cover:
- Comment invariance ✓
- Formatting/whitespace invariance ✓
- Semantic change detection ✓
- Language detection ✓
- Fallback behavior ✓
- DiscoveredFile integration ✓
- All tests are passing ✓
Leverages existing dependencies: Uses already-included ast-grep-py (tree-sitter) without adding new dependencies.
Documentation: Clear docstrings explaining behavior, especially the important note about docstrings being semantic content (not comments).

✅ Code Changes Analysis

src/codeweaver/core/discovery.py:

_walk_ast_nodes(): Recursively traverses AST, skipping comment nodes by checking if "comment" is in the kind name (case-insensitive). Builds a canonical representation of node kinds and leaf text.
_compute_ast_hash(): Wraps AST parsing with proper error handling, returns None on failure.
_get_semantic_language(): Determines if a file has AST support via extension or ExtCategory.
compute_semantic_file_hash(): Public API that attempts AST hashing, falls back to content hashing.
Integration points: Updated DiscoveredFile.__init__, from_path, and file_hash property to use the new function.

src/codeweaver/engine/services/indexing_service.py:

Line 316: Changed from get_blake_hash(content_bytes) to compute_semantic_file_hash(content_bytes, path) in _process_discovery_batch.

⚠️ Test Failures (Unrelated to this PR)

The CI shows 23 failing tests, but these are NOT caused by this PR:

Most failures are from MCP spec validation errors: ValueError: Output schemas must represent object types due to MCP spec limitations
Some e2e test failures with KeyError: 'max_file_size'
One FileNotFoundError in vector storage
All new semantic hashing tests (23/23) are passing ✓

These appear to be pre-existing issues in the test suite, likely from recent refactoring mentioned in the PR description.

💡 Technical Considerations

Good decisions:

Comment detection (line 67): Using "comment" in kind.lower() is simple and effective for catching comment node types across different languages (e.g., "comment", "line_comment", "block_comment").
Docstrings are semantic: The PR correctly notes that docstrings produce different hashes since they're AST expression nodes, not comments. This is the right choice as docstrings are part of the API contract.
UTF-8 decoding with error handling (line 135): Uses errors="replace" which prevents crashes on invalid UTF-8 while still allowing AST hashing to proceed.
Walrus operator usage: Clean use of assignment expressions (:=) for both language and hash checks.

🎯 Security & Performance

Security:

✅ No security concerns
✅ Proper exception handling prevents crashes
✅ No exposure of sensitive information

Performance:

✅ AST parsing is only done for supported semantic languages
✅ Graceful fallback ensures no performance regression for non-semantic files
✅ Reduces unnecessary re-indexing, improving overall system performance
✅ Already uses blake3 which is very fast

📊 Impact Assessment

Benefits:

Fewer unnecessary re-indexes when developers format code
More intelligent change detection
Better resource usage
Improves developer experience (auto-formatters won't trigger full re-indexing)

Risks:

⚠️ Very low - comprehensive fallback strategy ensures robustness
All changes are backward compatible

✅ Recommendation

APPROVE - This PR is ready to merge.

The implementation is well-designed, thoroughly tested, and delivers significant value. The failing tests are unrelated to this PR and should be addressed separately.

Suggestions for future consideration:

Consider caching parsed ASTs if files are hashed multiple times in quick succession
Monitor AST parsing performance in production to validate the benefits

Test Results: All 23 new semantic hashing tests passing ✓
CI Failures: Unrelated pre-existing issues (23 failures in MCP/e2e tests)
Code Quality: Excellent - clean architecture, robust error handling, comprehensive tests
Documentation: Clear and complete

Great work! This is a valuable improvement to the codebase. 🎉

github-actions · 2026-03-19T22:52:29Z

Additional Code Review - PR #259

I've conducted an additional review of the AST-based semantic file hashing implementation. Here are my findings:

✅ Code Quality & Design

Excellent Implementation:

Clean abstraction layers: The code properly separates concerns with _walk_ast_nodes, _compute_ast_hash, _get_semantic_language, and compute_semantic_file_hash (discovery.py:59-144)
Defensive programming: Proper exception handling with BaseException catch to handle pyo3_runtime.PanicException from unsupported languages
Graceful degradation: Falls back to content hashing when AST parsing fails or language is unsupported
No new dependencies: Leverages existing ast_grep_py (tree-sitter) dependency

🔍 Technical Observations

1. Comment Detection Algorithm (discovery.py:66-67)

if "comment" in kind.lower():
    return

This simple but effective approach filters out comment nodes. Works across all 25+ supported languages since tree-sitter parsers consistently name comment nodes.

2. AST Canonicalization (discovery.py:71, 97)
The canonical representation combines node kinds and leaf text values, ensuring:

Whitespace normalization (not represented in AST)
Formatting independence
Semantic content preservation

3. Integration Points
All file hashing locations properly updated:

DiscoveredFile.__init__ (discovery.py:220)
DiscoveredFile.from_path (discovery.py:273)
DiscoveredFile.file_hash property (discovery.py:341)
IndexingService._process_discovery_batch (indexing_service.py:316)

4. Test Coverage (test_semantic_hashing.py)
Comprehensive 285-line test suite with 23 tests covering:

Comment invariance (Python & JavaScript)
Whitespace/formatting invariance
Semantic change detection
Docstring handling (correctly treated as semantic)
Language detection
Fallback behavior
DiscoveredFile integration

🎯 Impact Analysis

Benefits:

Reduced re-indexing: Formatting changes (prettier, black, eslint) won't trigger re-index
Better CI/CD: Automated formatters won't cause false positives
Resource efficiency: Fewer unnecessary embedding computations
25+ language support: Python, JavaScript, TypeScript, Go, Rust, Java, C++, etc.

Migration Impact:
Existing file hashes will differ from new semantic hashes, triggering a one-time re-index of all semantic files. This is expected and acceptable behavior.

Performance:

AST parsing adds minimal overhead (only for semantic files)
Tree-sitter parsers are highly optimized
Blake3 hashing remains fast
Net performance gain from avoiding unnecessary re-indexing

⚠️ Test Failures (Unrelated to This PR)

The failing tests are NOT related to semantic hashing:

test_full_pipeline_index_then_search: MCP schema validation error
E2E user journey tests: Config/CLI issues
Integration tests: Pre-existing flakiness

All 23 semantic hashing tests pass successfully.

💡 Minor Recommendations

1. Documentation Enhancement
Consider adding inline comment explaining BaseException usage:

except (KeyboardInterrupt, SystemExit):
    raise
except BaseException:  # Catches pyo3_runtime.PanicException which doesn't inherit from Exception
    logger.debug("AST parsing failed for language %s", language_name, exc_info=True)

2. Future Optimization (Optional)
If profiling shows AST parsing as a bottleneck, consider:

Caching parsed ASTs for large frequently-accessed files
Parallel AST parsing for batch operations

3. Monitoring (Optional)
Track AST parsing success/failure rates to identify:

Languages with parser issues
Edge cases requiring fallback

✅ Security & Safety

No security concerns: Read-only operations, no external inputs
Safe error handling: Properly re-raises KeyboardInterrupt and SystemExit
Debug logging: Includes exc_info=True for troubleshooting (discovery.py:102, 142)

📊 Code Style & Best Practices

Follows project conventions
Type hints properly used
Docstrings clear and comprehensive
Error messages informative
Export added to __all__ (discovery.py:425)

🎉 Recommendation

APPROVE with minor suggestion

This is a well-architected, thoroughly tested feature that solves a real problem. The implementation is production-ready. The test failures are unrelated and should be addressed separately.

Suggested action: Merge this PR once the team confirms the failing tests are pre-existing issues.

Excellent work! 🚀

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

github-actions · 2026-03-20T12:58:10Z

PR Review: AST-based Semantic File Hashing

I've completed a thorough review of this PR. Overall, this is excellent work with clean design and comprehensive testing. Here's my detailed assessment:

Strengths ✅

Excellent Architecture
- Clean separation of concerns with well-named helper functions
- Proper fallback to raw content hashing for unsupported languages
- Smart reuse of existing ast_grep_py dependency
Comprehensive Test Coverage
- 23 unit tests covering all major scenarios
- Tests for comment/whitespace invariance, semantic changes, language detection, and fallback behavior
- Multi-language coverage (Python, JavaScript)
High Code Quality
- Clear, descriptive docstrings
- Robust error handling with proper exception catching
- Good use of debug logging with exc_info=True
- Maintains backward compatibility
Smart Implementation Details
- Correctly handles ext_category parameter throughout the call chain
- Proper understanding that docstrings are semantic content (not comments)
- Well-documented handling of pyo3_runtime.PanicException

Concerns ⚠️

BLOCKING: Test Failures
- Multiple test runs are failing (Python 3.12, 3.13, 3.14, Docker Compose)
- These must be investigated and resolved before merge
- Please verify if these failures are related to this PR or pre-existing
Performance Considerations
- AST parsing adds computational overhead vs raw hashing
- For large codebases, this could impact discovery/indexing performance
- Consider adding performance benchmarks or monitoring
Minor Code Quality Suggestions

In _walk_ast_nodes (discovery.py:66-67):
```
if "comment" in kind.lower():
```
- Substring matching for comment detection could be fragile
- Could match non-comment nodes with "comment" in the name
- Suggestion: Use more robust comment detection if ast-grep provides a better API
In compute_semantic_file_hash (discovery.py:135):
```
content_str = content_bytes.decode("utf-8", errors="replace")
```
- Using errors="replace" could mask encoding issues
- Suggestion: Consider logging a warning when replacement characters are inserted
Documentation
- Suggestion: Add inline comments explaining the AST traversal strategy in _walk_ast_nodes

Security & Performance

✅ No security concerns identified
✅ Proper exception handling prevents DoS via malformed files
⚠️ Performance impact needs monitoring (though likely acceptable since it avoids expensive re-indexing)

Recommendations Before Merge

CRITICAL: Investigate and fix all failing CI tests
Important: Consider adding performance benchmarks
Nice to have: Add more inline comments for complex logic
Nice to have: More robust comment node detection
Nice to have: Log warnings for encoding issues

Verdict

This PR demonstrates excellent software engineering practices. The implementation is clean, well-tested, and adds significant value. However, the failing CI tests are a blocking issue.

Once test failures are resolved, I would approve this PR as it:

✅ Solves a real problem (unnecessary re-indexing)
✅ Has excellent design and implementation
✅ Includes comprehensive test coverage
✅ Follows codebase patterns
✅ Adds significant value with minimal risk

Great work overall! Please address the test failures and this will be ready to merge. 🚀

- file_hash property now uses self.absolute_path for existence checks and reads so relative paths resolve correctly regardless of CWD - _compute_ast_hash catches Exception for standard errors, then catches BaseException only for pyo3_runtime.PanicException identified by module name, re-raising all other non-recoverable errors Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

src/codeweaver/core/discovery.py

+    except Exception:
+        logger.debug("AST parsing failed for language %s", language_name, exc_info=True)
+        return None
+    except BaseException as exc:


github-actions

Pull Request Review: AST-based Semantic File Hashing

Summary

This PR implements AST-based semantic file hashing to avoid unnecessary re-indexing when only comments, whitespace, or formatting changes occur. The implementation is well-designed and thoroughly tested.

Code Quality: Excellent ✅

Strengths

1. Clean Architecture

Well-organized into small, focused functions (_walk_ast_nodes, _compute_ast_hash, _get_semantic_language)
Clear separation of concerns between AST parsing and hash computation
Proper use of existing infrastructure (ast-grep, blake3)

2. Robust Error Handling (src/codeweaver/core/discovery.py:78-110)

Graceful fallback to content-based hashing on parse failures
Clever handling of pyo3_runtime.PanicException that can't be imported directly
Proper exception hierarchy: catches Exception first, then BaseException selectively
Avoids swallowing critical exceptions like GeneratorExit
Good logging with exc_info=True for debugging

3. Semantic Correctness

Correctly excludes comments via "comment" in kind.lower() check
Builds canonical representation from AST node kinds and leaf text
Properly handles both named and unnamed AST nodes
Docstrings correctly treated as semantic (not comments) since they're expression nodes

4. Path Handling Fix (src/codeweaver/core/discovery.py:341-351)

The file_hash property now correctly uses self.absolute_path instead of self.path
This ensures relative paths resolve correctly regardless of current working directory
Important fix for reliability

5. Integration Points

Updated both DiscoveredFile initialization and indexing_service._process_discovery_batch
Consistent API with optional ext_category parameter
Properly exported in __all__

Testing: Comprehensive ✅

The test suite (tests/unit/core/test_semantic_hashing.py) is excellent with 23 tests covering:

✅ Comment invariance (Python and JavaScript)
✅ Formatting/whitespace invariance
✅ Semantic change detection (operators, identifiers)
✅ Language detection and fallback
✅ Empty content and invalid syntax
✅ Docstring changes (correctly different hash)
✅ DiscoveredFile integration
✅ ext_category precedence

Notable test quality:

Tests both low-level functions and integration
Clear test names and docstrings
Good coverage of edge cases

Performance Considerations

Positive:

Uses existing ast_grep_py dependency (no new deps)
Lazy import of SgRoot only when needed
Falls back quickly on parse failures

Potential concerns:

AST parsing is inherently slower than blake3 hashing raw bytes
For large codebases with frequent file scans, this could add latency
However, the benefit of avoiding unnecessary re-indexing likely outweighs the parsing cost

Recommendation: Monitor performance in production. If AST parsing becomes a bottleneck, consider:

Caching parsed ASTs (but memory cost may be high)
Parallel processing of file hashing
Size threshold above which to skip AST hashing

Security: Good ✅

No credential handling or external network calls
Proper exception handling prevents crashes
Uses errors='replace' when decoding bytes to UTF-8
No code execution risks (ast-grep is a parser, not evaluator)

Documentation: Good ✅

Clear docstrings for all new functions
Updated field description for _file_hash in DiscoveredFile
Helpful inline comments explaining the pyo3_runtime.PanicException handling
PR description clearly explains the approach and changes

Minor Observations

Docstring behavior (noted in PR description): Docstring changes produce different hashes. This is semantically correct since docstrings are AST nodes, not comments. However, for tools like linters that auto-format docstrings, this might cause some re-indexing. This seems acceptable.
Language coverage: The implementation supports 25+ languages via ast-grep. The fallback ensures any unsupported files still work.
Type safety: Good use of type hints throughout, including BlakeHashKey | None return types.

CI Status

All checks passing ✅

Recommendation

APPROVE ✅

This is high-quality code that solves a real problem (unnecessary re-indexing) with a well-thought-out approach. The implementation is robust, well-tested, and properly integrated. The use of AST-based hashing for semantic files is the right architectural choice.

Suggested Follow-ups (optional, not blocking)

Monitor performance impact in production environments with large codebases
Consider adding metrics/logging for AST hash vs content hash usage rates
Could add a configuration option to disable AST hashing if needed for performance

bashandbone · 2026-03-21T23:06:50Z

@copilot hey team I think I disagree with the direction here and I'd like you to at least consider an alternative. I may also not fully understand the implementation, so feel free to correct me.

We already parse and cache asts as part of the engine pipeline. An AstThing object holds its ast and can query across them. Would it make more sense for each AstThing to store its own hash? You could even potentially isolate only changed nodes by comparing a group of hashes for a file.

Copilot · 2026-03-21T23:11:30Z

@copilot hey team I think I disagree with the direction here and I'd like you to at least consider an alternative. I may also not fully understand the implementation, so feel free to correct me.

We ...

The semantic file hash serves a different purpose from AstThing's AST — it's a pre-indexing gate. The pipeline flow is:

Discovery (_process_discovery_batch) → read bytes, compute hash, compare with manifest → decide if file needs re-indexing
Indexing (SemanticChunker._parse_and_chunk) → parse AST → create AstThings/FileThings → chunk → embed → store

The hash exists to skip step 2 entirely for unchanged files. If we moved hashing to AstThing, we'd need to construct a full FileThing (with all its Pydantic model machinery, ThingRegistry lookups, classification, importance scoring, etc.) for every discovered file just to check if it changed — which is exactly the expensive work we're trying to avoid.

The lightweight _compute_ast_hash in discovery.py does a bare tree-sitter parse (just SgRoot + walk nodes), collects kind/text tokens, and hashes them. It never creates AstThings, never touches the ThingRegistry, and gets discarded immediately after hashing. On the other hand, AstThing wraps SgNode with computed fields, cached properties, semantic classification, importance scoring, UUID generation, and Pydantic model overhead — all unnecessary for a simple "did this file change?" check.

That said, your idea about node-level hashing for incremental re-indexing (comparing groups of hashes to isolate which nodes changed) is a genuinely interesting future direction. That would live in the chunking layer where AstThings already exist, and could avoid re-embedding unchanged chunks within a changed file. But that's an optimization on top of this change, not a replacement for it — you'd still want the file-level semantic hash as a fast first-pass filter.

Initial plan

e94b21d

Copilot AI assigned Copilot and bashandbone Mar 19, 2026

Copilot started work on behalf of bashandbone March 19, 2026 19:27 View session

github-code-quality bot found potential problems Mar 19, 2026

View reviewed changes

src/codeweaver/core/discovery.py Fixed Show fixed Hide fixed

fix: add debug logging with exc_info to AST hashing fallback paths

6092a9f

Co-authored-by: bashandbone <89049923+bashandbone@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Implement AST-based hashing for semantic file change detection~~ Use AST-based hashing for semantic file change detection Mar 19, 2026

Copilot AI requested a review from bashandbone March 19, 2026 19:50

Copilot finished work on behalf of bashandbone March 19, 2026 19:50

bashandbone marked this pull request as ready for review March 19, 2026 22:43

Copilot AI review requested due to automatic review settings March 19, 2026 22:43

Copilot started reviewing on behalf of bashandbone March 19, 2026 22:44 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

Apply suggestions from code review

d2d3134

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Adam Poulemanos <89049923+bashandbone@users.noreply.github.com>

Copilot started work on behalf of bashandbone March 20, 2026 12:55 View session

Copilot finished work on behalf of bashandbone March 20, 2026 13:05

github-code-quality bot found potential problems Mar 20, 2026

View reviewed changes

src/codeweaver/core/discovery.py

except Exception:

logger.debug("AST parsing failed for language %s", language_name, exc_info=True)

return None

except BaseException as exc:

github-actions bot reviewed Mar 20, 2026

View reviewed changes

Copilot started work on behalf of bashandbone March 21, 2026 23:06 View session

Copilot stopped work on behalf of bashandbone due to an error March 21, 2026 23:11
Copilot encountered an unknown Git error. If the problem persists, please contact GitHub Support. To retry, leave a comment on this pull request asking Copilot to try again.

Conversation

Copilot AI commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How it works

Changes

Notes

Current Approach

Proposed Approach

For Semantic Files (discovery.py:165, 182)

Considerations

Impact

Source

Comments on the Issue (you are @copilot in this section)

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 19, 2026

Choose a reason for hiding this comment

Uh oh!

bashandbone Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 19, 2026

Code Review - AST-based Semantic File Hashing

✅ Strengths

🔍 Observations

💡 Minor Suggestions

🎯 Impact Assessment

📋 Verdict

Uh oh!

github-actions bot commented Mar 19, 2026

Code Review - PR #259: Use AST-based hashing for semantic file change detection

✅ Overview

✅ Implementation Quality

✅ Code Changes Analysis

⚠️ Test Failures (Unrelated to this PR)

💡 Technical Considerations

🎯 Security & Performance

📊 Impact Assessment

✅ Recommendation

Uh oh!

github-actions bot commented Mar 19, 2026

Additional Code Review - PR #259

✅ Code Quality & Design

🔍 Technical Observations

🎯 Impact Analysis

⚠️ Test Failures (Unrelated to This PR)

💡 Minor Recommendations

✅ Security & Safety

📊 Code Style & Best Practices

🎉 Recommendation

Uh oh!

github-actions bot commented Mar 20, 2026

PR Review: AST-based Semantic File Hashing

Strengths ✅

Concerns ⚠️

Security & Performance

Recommendations Before Merge

Verdict

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Pull Request Review: AST-based Semantic File Hashing

Summary

Code Quality: Excellent ✅

Strengths

Testing: Comprehensive ✅

Performance Considerations

Security: Good ✅

Copilot AI commented Mar 19, 2026 •

edited

Loading