Skip to content

feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745

Open
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:james/issue-692-ast-chunking
Open

feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:james/issue-692-ast-chunking

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

用 tree-sitter 取代純 character-based split,防止程式碼在 function/class 中間被切斷,造成語意破壞。


Problem

chunker.tsfindSplitEnd() 純依 character count 切斷,不管語法邊界。真實破壞案例:

設定 maxChunkSize = 4000,handleUserLogin 函式被切成:

Chunk A(不完整的 if-block):
"async function handleUserLogin(...) { ... if (!user) {
    return { success: false, error: 'USER_NOT_FOUND' };"  ← if-block 未關閉

Chunk B(脫離語境的 }):
"}  // ← 來自 Chunk A 的 orphan closing brace
const passwordValid = await this.verifyPassword(...)"

導致 vector embedding 代表「不完整的登入函式片段」,模型看到時語意混亂。


Solution

Phase 1:在 JS/TS/Python 程式碼區塊使用 tree-sitter 解析 AST,按 declaration boundary split(function/class/method),每個 chunk 的 { }def/class 結構完整。


Changes

src/chunker.ts

新增 說明
detectCodeLanguage() 從程式碼內容識別 JS/TS/Python/Go/Rust,只看前 400 字元避免被結尾干擾
astChunk() tree-sitter 解析,按 declaration boundary split,收集 comment/import 併入相鄰 declaration
subChunk() oversized declaration 的 statement-level sub-split(Phase 2 完整實作,現為 fallback)
smartChunk() 路由 astAwareCodeSplit === true 且偵測為 code,走 astChunk() 否則走現有邏輯

package.json

  • tree-sitter + tree-sitter-javascript + tree-sitter-python — native addon,CI 已驗證 Windows build 可用

test/ast-code-chunking.test.mjs

  • 20 個單元測試,覆蓋所有破壞場景

Phase 1 任務 vs. 實作對照

任務 內容 狀態
T1-1 detectCodeLanguage() JS/TS/Python/Go/Rust
T1-2 astChunk() for JS/TS
T1-3 astChunk() for Python
T1-4 smartChunk() 路由修改
T1-5 Config astAwareCodeSplit: true
T1-6 Unit tests 破壞案例反轉 ✅ (20/20 pass)

對抗審查發現的額外問題(已修復)

問題 修復
astAwareCodeSplit !== false 太寬鬆(null/0 也走 AST) 改為 === true
comment/import 等 non-declaration 被靜默 skip 收集並 prepend 到相鄰 declaration
TypeScript interface parse error → ERROR node → fallback 有 ERROR nodes 即 fallback chunkDocument

Phase 2 規劃(不在本 PR 範圍)

任務 內容
T2-1 Oversized declaration sub-split at statement level
T2-2 Go、Rust 支援
T2-3 Embedding quality benchmark vs. character split

Testing

node --test test/ast-code-chunking.test.mjs
# tests 20, pass 20, fail 0

破壞案例驗證

測試 內容
keeps a function intact (verifyPassword case) verifyPassword / handleUserLogin 完整落在同一 chunk
keeps nested braces balanced if/else 巢狀 { } 不会被切在中途
handles oversized single declaration as one atomic chunk 超大 function 不被暴力切斷

Config

interface ChunkerConfig {
  // ...
  /** Use AST-aware splitting for code blocks (default: true). */
  astAwareCodeSplit?: boolean;  // Phase 1 預設 true
}

Breaking Changes

無。smartChunk() signature 不變,astAwareCodeSplit 為可選參數,不影响現有 call sites。

…issue CortexReach#692)

Adds tree-sitter-based chunking to prevent splitting code mid-function.

Changes:
- detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content
- astChunk(): split code at declaration boundaries (function/class/method)
- smartChunk(): route through astChunk when file is detected as code
- 20 unit tests covering all destructive split scenarios

Test: node --test test/ast-code-chunking.test.mjs (20/20 pass)
Config: astAwareCodeSplit defaults to true
@jlin53882 jlin53882 force-pushed the james/issue-692-ast-chunking branch from c9bda50 to 0be5d00 Compare May 4, 2026 16:34
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9bda50358

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/chunker.ts
Comment on lines +383 to +384
startIndex: child.startIndex,
endIndex: child.endIndex,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rebase metadata after prepending non-declaration text

When pendingNonDecl is prepended to a declaration, the emitted fullText chunk no longer starts at child.startIndex, but metadata still reports startIndex/endIndex from the declaration node only. Any consumer that uses metadatas to map chunk text back to original source will get wrong spans for chunks that include leading imports/comments, which can corrupt offset-based downstream logic.

Useful? React with 👍 / 👎.

Comment thread src/chunker.ts
const trailing = chunkDocument(pendingNonDecl, config);
for (let i = 0; i < trailing.chunks.length; i++) {
chunks.push(trailing.chunks[i]);
metadatas.push(trailing.metadatas[i]);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Rebase trailing metadata to original document offsets

The trailing non-declaration path chunks pendingNonDecl as a standalone string and then appends trailing.metadatas directly, but those indices are relative to pendingNonDecl (starting near 0), not the original code buffer. This produces incorrect offsets for trailing chunks (e.g., trailing comments), breaking metadata correctness whenever there is non-declaration content at the end of a file.

Useful? React with 👍 / 👎.

@jlin53882 jlin53882 force-pushed the james/issue-692-ast-chunking branch from b927928 to 3c603d6 Compare May 5, 2026 13:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant