feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692) by jlin53882 · Pull Request #745 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-05-04T16:31:14Z

Summary

用 tree-sitter 取代純 character-based split，防止程式碼在 function/class 中間被切斷，造成語意破壞。

Problem

chunker.ts 的 findSplitEnd() 純依 character count 切斷，不管語法邊界。真實破壞案例：

設定 maxChunkSize = 4000，handleUserLogin 函式被切成：

Chunk A（不完整的 if-block）:
"async function handleUserLogin(...) { ... if (!user) {
    return { success: false, error: 'USER_NOT_FOUND' };"  ← if-block 未關閉

Chunk B（脫離語境的 }）:
"}  // ← 來自 Chunk A 的 orphan closing brace
const passwordValid = await this.verifyPassword(...)"

導致 vector embedding 代表「不完整的登入函式片段」，模型看到時語意混亂。

Solution

Phase 1：在 JS/TS/Python 程式碼區塊使用 tree-sitter 解析 AST，按 declaration boundary split（function/class/method），每個 chunk 的 { } 和 def/class 結構完整。

Changes

`src/chunker.ts`

新增	說明
`detectCodeLanguage()`	從程式碼內容識別 JS/TS/Python/Go/Rust，只看前 400 字元避免被結尾干擾
`astChunk()`	tree-sitter 解析，按 declaration boundary split，收集 comment/import 併入相鄰 declaration
`subChunk()`	oversized declaration 的 statement-level sub-split（Phase 2 完整實作，現為 fallback）
`smartChunk()` 路由	當 `astAwareCodeSplit === true` 且偵測為 code，走 `astChunk()` 否則走現有邏輯

`package.json`

tree-sitter + tree-sitter-javascript + tree-sitter-python — native addon，CI 已驗證 Windows build 可用

`test/ast-code-chunking.test.mjs`

20 個單元測試，覆蓋所有破壞場景

Phase 1 任務 vs. 實作對照

任務	內容	狀態
T1-1	`detectCodeLanguage()` JS/TS/Python/Go/Rust	✅
T1-2	`astChunk()` for JS/TS	✅
T1-3	`astChunk()` for Python	✅
T1-4	`smartChunk()` 路由修改	✅
T1-5	Config `astAwareCodeSplit: true`	✅
T1-6	Unit tests 破壞案例反轉	✅ (20/20 pass)

對抗審查發現的額外問題（已修復）

問題	修復
`astAwareCodeSplit !== false` 太寬鬆（`null`/`0` 也走 AST）	改為 `=== true`
comment/import 等 non-declaration 被靜默 skip	收集並 prepend 到相鄰 declaration
TypeScript interface parse error → ERROR node → fallback	有 ERROR nodes 即 fallback `chunkDocument`

Phase 2 規劃（不在本 PR 範圍）

任務	內容
T2-1	Oversized declaration sub-split at statement level
T2-2	Go、Rust 支援
T2-3	Embedding quality benchmark vs. character split

Testing

node --test test/ast-code-chunking.test.mjs
# tests 20, pass 20, fail 0

破壞案例驗證

測試	內容
`keeps a function intact (verifyPassword case)`	verifyPassword / handleUserLogin 完整落在同一 chunk
`keeps nested braces balanced`	if/else 巢狀 { } 不会被切在中途
`handles oversized single declaration as one atomic chunk`	超大 function 不被暴力切斷

Config

interface ChunkerConfig {
  // ...
  /** Use AST-aware splitting for code blocks (default: true). */
  astAwareCodeSplit?: boolean;  // Phase 1 預設 true
}

Breaking Changes

無。smartChunk() signature 不變，astAwareCodeSplit 為可選參數，不影响現有 call sites。

…issue CortexReach#692) Adds tree-sitter-based chunking to prevent splitting code mid-function. Changes: - detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content - astChunk(): split code at declaration boundaries (function/class/method) - smartChunk(): route through astChunk when file is detected as code - 20 unit tests covering all destructive split scenarios Test: node --test test/ast-code-chunking.test.mjs (20/20 pass) Config: astAwareCodeSplit defaults to true

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c9bda50358

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T16:34:33Z

+        startIndex: child.startIndex,
+        endIndex: child.endIndex,


Rebase metadata after prepending non-declaration text

When pendingNonDecl is prepended to a declaration, the emitted fullText chunk no longer starts at child.startIndex, but metadata still reports startIndex/endIndex from the declaration node only. Any consumer that uses metadatas to map chunk text back to original source will get wrong spans for chunks that include leading imports/comments, which can corrupt offset-based downstream logic.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T16:34:33Z

+    const trailing = chunkDocument(pendingNonDecl, config);
+    for (let i = 0; i < trailing.chunks.length; i++) {
+      chunks.push(trailing.chunks[i]);
+      metadatas.push(trailing.metadatas[i]);


Rebase trailing metadata to original document offsets

The trailing non-declaration path chunks pendingNonDecl as a standalone string and then appends trailing.metadatas directly, but those indices are relative to pendingNonDecl (starting near 0), not the original code buffer. This produces incorrect offsets for trailing chunks (e.g., trailing comments), breaking metadata correctness whenever there is non-declaration content at the end of a file.

Useful? React with 👍 / 👎.

…CortexReach#692)

…Reach#713)

…fest (PR CortexReach#713)

… baseline (merge sync)

jlin53882 mentioned this pull request May 4, 2026

feat: AST-based semantic chunking for code blocks #692

Open

jlin53882 force-pushed the james/issue-692-ast-chunking branch from c9bda50 to 0be5d00 Compare May 4, 2026 16:34

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

jlin53882 and others added 6 commits May 5, 2026 20:16

fix(ci): register ast-code-chunking.test.mjs in verify baseline (issue …

6b40ba7

…CortexReach#692)

fix(ci): add issue606 to verify baseline to match manifest (PR Cortex…

96f7eee

…Reach#713)

ci: retrigger

b1b28bf

Merge b1b28bf into 09965e3

695288c

fix(ci): remove ast-code-chunking from baseline, add issue606 to mani…

ea81818

…fest (PR CortexReach#713)

fix(ci): add ast-code-chunking + issue606 to manifest, rebuild verify…

3c603d6

… baseline (merge sync)

jlin53882 force-pushed the james/issue-692-ast-chunking branch from b927928 to 3c603d6 Compare May 5, 2026 13:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745

feat(chunker): AST-based semantic chunking for JS/TS/Python (issue #692)#745
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:james/issue-692-ast-chunking

jlin53882 commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented May 4, 2026

Summary

Problem

Solution

Changes

src/chunker.ts

package.json

test/ast-code-chunking.test.mjs

Phase 1 任務 vs. 實作對照

對抗審查發現的額外問題（已修復）

Phase 2 規劃（不在本 PR 範圍）

Testing

破壞案例驗證

Config

Breaking Changes

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`src/chunker.ts`

`package.json`

`test/ast-code-chunking.test.mjs`