feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745
feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
Conversation
…issue CortexReach#692) Adds tree-sitter-based chunking to prevent splitting code mid-function. Changes: - detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content - astChunk(): split code at declaration boundaries (function/class/method) - smartChunk(): route through astChunk when file is detected as code - 20 unit tests covering all destructive split scenarios Test: node --test test/ast-code-chunking.test.mjs (20/20 pass) Config: astAwareCodeSplit defaults to true
c9bda50 to
0be5d00
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c9bda50358
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| startIndex: child.startIndex, | ||
| endIndex: child.endIndex, |
There was a problem hiding this comment.
Rebase metadata after prepending non-declaration text
When pendingNonDecl is prepended to a declaration, the emitted fullText chunk no longer starts at child.startIndex, but metadata still reports startIndex/endIndex from the declaration node only. Any consumer that uses metadatas to map chunk text back to original source will get wrong spans for chunks that include leading imports/comments, which can corrupt offset-based downstream logic.
Useful? React with 👍 / 👎.
| const trailing = chunkDocument(pendingNonDecl, config); | ||
| for (let i = 0; i < trailing.chunks.length; i++) { | ||
| chunks.push(trailing.chunks[i]); | ||
| metadatas.push(trailing.metadatas[i]); |
There was a problem hiding this comment.
Rebase trailing metadata to original document offsets
The trailing non-declaration path chunks pendingNonDecl as a standalone string and then appends trailing.metadatas directly, but those indices are relative to pendingNonDecl (starting near 0), not the original code buffer. This produces incorrect offsets for trailing chunks (e.g., trailing comments), breaking metadata correctness whenever there is non-declaration content at the end of a file.
Useful? React with 👍 / 👎.
… baseline (merge sync)
b927928 to
3c603d6
Compare
Summary
用 tree-sitter 取代純 character-based split,防止程式碼在 function/class 中間被切斷,造成語意破壞。
Problem
chunker.ts的findSplitEnd()純依 character count 切斷,不管語法邊界。真實破壞案例:導致 vector embedding 代表「不完整的登入函式片段」,模型看到時語意混亂。
Solution
Phase 1:在 JS/TS/Python 程式碼區塊使用 tree-sitter 解析 AST,按 declaration boundary split(function/class/method),每個 chunk 的
{ }和def/class結構完整。Changes
src/chunker.tsdetectCodeLanguage()astChunk()subChunk()smartChunk()路由astAwareCodeSplit === true且偵測為 code,走astChunk()否則走現有邏輯package.jsontree-sitter+tree-sitter-javascript+tree-sitter-python— native addon,CI 已驗證 Windows build 可用test/ast-code-chunking.test.mjsPhase 1 任務 vs. 實作對照
detectCodeLanguage()JS/TS/Python/Go/RustastChunk()for JS/TSastChunk()for PythonsmartChunk()路由修改astAwareCodeSplit: true對抗審查發現的額外問題(已修復)
astAwareCodeSplit !== false太寬鬆(null/0也走 AST)=== truechunkDocumentPhase 2 規劃(不在本 PR 範圍)
Testing
node --test test/ast-code-chunking.test.mjs # tests 20, pass 20, fail 0破壞案例驗證
keeps a function intact (verifyPassword case)keeps nested braces balancedhandles oversized single declaration as one atomic chunkConfig
Breaking Changes
無。
smartChunk()signature 不變,astAwareCodeSplit為可選參數,不影响現有 call sites。