Skip to content

feat: AST-based semantic chunking for code blocks #692

@jlin53882

Description

@jlin53882

Problem Statement

現狀:Character Split 會切壞程式碼語意

當使用者貼一段長程式碼進對話,該程式碼最終會進入萃取 pipeline,並可能被 chunker.tssmartChunk() 處理。chunker.ts 目前是純 character-based split,split 邏輯在 findSplitEnd()

// src/chunker.ts — findSplitEnd()
if (config.semanticSplit) {
  // Prefer a sentence boundary near the end.
  for (let i = safeMaxEnd - 1; i >= safeMinEnd; i--) {
    if (SENTENCE_ENDING.test(text[i])) { ... return j; }
  }
  // Next best: newline boundary.
  for (let i = safeMaxEnd - 1; i >= safeMinEnd; i--) {
    if (text[i] === "\n") return i + 1;
  }
}

這段邏輯對自然語言有效,但對程式碼是災難。

真實破壞案例

假設有這段 TypeScript 被萃取後需要 chunk:

async function handleUserLogin(userId: string, credentials: LoginCredentials): Promise<AuthResult> {
    const user = await this.userRepository.findById(userId);
    if (!user) {
        return { success: false, error: 'USER_NOT_FOUND' };
    }
    
    const passwordValid = await this.verifyPassword(credentials.password, user.passwordHash);
    if (!passwordValid) {
        return { success: false, error: 'INVALID_PASSWORD' };
    }
    
    if (user.mfaEnabled) {
        const mfaToken = await this.generateMFAToken(user.id);
        return { success: false, error: 'MFA_REQUIRED', mfaToken };
    }
    
    const session = await this.createSession(user);
    return { success: true, session };
}

async function verifyPassword(inputPassword: string, storedHash: string): Promise<boolean> {
    const bcrypt = await import('bcrypt');
    return bcrypt.compare(inputPassword, storedHash);
}

設定 maxChunkSize = 4000 characters,發生 split:

Chunk A(~3800字):
"async function handleUserLogin(userId: string, credentials: LoginCredentials): Promise<AuthResult> {\n"
"    const user = await this.userRepository.findById(userId);\n"
"    if (!user) {\n"
"        return { success: false, error: 'USER_NOT_FOUND' };"

Chunk B(~900字):
"    }\n"
"    \n"
"    const passwordValid = await this.verifyPassword(credentials.password, user.passwordHash);\n"
"    if (!passwordValid) {\n"
"        return { success: false, error: 'INVALID_PASSWORD' };\n"
"    }\n"
"    if (user.mfaEnabled) {\n"
"        const mfaToken = await this.generateMFAToken(user.id);\n"
"        return { success: false, error: 'MFA_REQUIRED', mfaToken };\n"
"    }\n"
"    \n"
"    const session = await this.createSession(user);\n"
"    return { success: true, session };\n"
"}"

問題:

  • Chunk A 結尾在 return { success: false, error: 'USER_NOT_FOUND' };不完整的 if-block
  • Chunk B 開頭是 }脫離語境的 closing brace
  • verifyPassword 函式定義被切成兩段,跨越 Chunk A 和 Chunk B

後續影響鏈

Embedding Chunk A → vector 代表「不完整的登入函式片段」
Embedding Chunk B → vector 代表「密碼驗證函式」
↓
使用者問:「我登入失敗時是怎麼處理的?」
↓
Vector search → 找到 Chunk A 和 Chunk B
↓
模型看到:「... return { success: false, error: 'USER_NOT_FOUND' };\n    }\n    \n    const passwordValid = await...」
↓
模型困惑:if-block 被切斷、函式不完整 → 回覆殘缺或錯誤

規模感

情境 影響
對話含 1 個 50 行 function 被切成 2-3 個 chunk,語意碎片化
使用者問:「那個函式怎麼處理密碼錯誤的?」 只匹配到 Chunk B,缺少前因後果
使用者問:「MFA 流程是什麼?」 Chunk B 中段匹配,但 handleUserLogin 的 if 結構被切斷
10 個這類問題 至少 7-8 個回覆品質明顯下降

現有對比:claude-context 的做法

zilliztech/claude-contextast-splitter.ts 已經實作了這個概念。他們使用 tree-sitter 解析 AST,以 function_declaration / class_declaration / method_definition 等語法節點為邊界切 chunk,不支援的語言 fallback 到 LangChain character splitter。邏輯如下:

// claude-context ast-splitter.ts 的核心邏輯
const SPLITTABLE_NODE_TYPES = {
  javascript: ['function_declaration', 'arrow_function', 'class_declaration', 'method_definition', 'export_statement'],
  typescript: ['function_declaration', 'arrow_function', 'class_declaration', 'method_definition', 'export_statement', 'interface_declaration', 'type_alias_declaration'],
  python: ['function_definition', 'class_definition', 'decorated_definition'],
  // ...
};
// 每個 declaration = 1 個 semantic chunk
// 超過 size limit → sub-split within that declaration at statement level

Proposed Solution

chunker.ts 加入 AST-aware 分支:

實作方向

// 新增:detectCodeLanguage() — 判斷是否為程式碼
function detectCodeLanguage(text: string): string | null {
  // JS/TS: function/class/import/export/const + arrow function
  // Python: def / class / import / from
  // Go: func / package / import
  // Rust: fn / impl / let mut / pub fn
  const codePatterns = [
    { pattern: /\b(function|const|let|var|=>|import|export)\b/, lang: 'javascript' },
    { pattern: /\bdef |class |import |from |print(/, lang: 'python' },
    { pattern: /\bfunc |package |import /, lang: 'go' },
    { pattern: /\bfn |impl |pub fn |let mut /, lang: 'rust' },
  ];
  // ...
}

// 新增:astChunk() — tree-sitter based split
export async function astChunk(
  code: string,
  language: string,
  config: ChunkerConfig
): Promise<AstChunkResult> {
  // 1. Parse AST with tree-sitter
  // 2. Walk top-level declarations (function_declaration, class_decl, etc.)
  // 3. 每個 declaration = 1 個 chunk(size within limit)
  // 4. 超過 limit 的 declaration → sub-split at statement level
  // 5. Fallback: 無法 parse 的語言 → 走現有 character split
}

觸發時機

smartChunk(text)
   detectCodeLanguage(text)
     null     走現有 character split(自然語言)
     "python"  astChunk(text, "python")
     "javascript"  astChunk(text, "javascript")

依賴(Phase 1)

{
  "dependencies": {
    "tree-sitter": "^0.21.0",
    "tree-sitter-javascript": "^0.21.0",
    "tree-sitter-python": "^0.21.0"
  }
}

Phase 1 優先級

語言 優先級 原因
JavaScript / TypeScript P0 最常見
Python P0 常見
Go P1 常用
Rust P1 常用
其他 P2 fallback 到現有邏輯

Impact

維度 改善
程式碼理解準確度 Chunk 保留完整函式,回覆不再有「片段」困惑
跨函式 context 同一 function 的程式碼在同一 chunk,避免斷裂
維護性 AST split 比 character split 更符合開發者直覺
向量品質 每個 chunk 向量代表完整語義單位,相似度計算更準

Questions for Maintainers

  1. tree-sitter 的額外依賴是否值得?還是有更輕量的替代方案(如只用 regex 偵測函式邊界)?
  2. AST chunking 是否應該預設開啟?還是需要 config 開關?
  3. 對於非 TS/JS/Python 的語言,現有的 semantic split 是否已足夠?

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions