feat: AST-based semantic chunking for code blocks

## Problem Statement

### 現狀：Character Split 會切壞程式碼語意

當使用者貼一段長程式碼進對話，該程式碼最終會進入萃取 pipeline，並可能被 `chunker.ts` 的 `smartChunk()` 處理。`chunker.ts` 目前是純 character-based split，split 邏輯在 `findSplitEnd()`：

```typescript
// src/chunker.ts — findSplitEnd()
if (config.semanticSplit) {
  // Prefer a sentence boundary near the end.
  for (let i = safeMaxEnd - 1; i >= safeMinEnd; i--) {
    if (SENTENCE_ENDING.test(text[i])) { ... return j; }
  }
  // Next best: newline boundary.
  for (let i = safeMaxEnd - 1; i >= safeMinEnd; i--) {
    if (text[i] === "\n") return i + 1;
  }
}
```

這段邏輯對**自然語言**有效，但對**程式碼**是災難。

### 真實破壞案例

假設有這段 TypeScript 被萃取後需要 chunk：

```typescript
async function handleUserLogin(userId: string, credentials: LoginCredentials): Promise<AuthResult> {
    const user = await this.userRepository.findById(userId);
    if (!user) {
        return { success: false, error: 'USER_NOT_FOUND' };
    }
    
    const passwordValid = await this.verifyPassword(credentials.password, user.passwordHash);
    if (!passwordValid) {
        return { success: false, error: 'INVALID_PASSWORD' };
    }
    
    if (user.mfaEnabled) {
        const mfaToken = await this.generateMFAToken(user.id);
        return { success: false, error: 'MFA_REQUIRED', mfaToken };
    }
    
    const session = await this.createSession(user);
    return { success: true, session };
}

async function verifyPassword(inputPassword: string, storedHash: string): Promise<boolean> {
    const bcrypt = await import('bcrypt');
    return bcrypt.compare(inputPassword, storedHash);
}
```

設定 `maxChunkSize = 4000` characters，發生 split：

```
Chunk A（~3800字）：
"async function handleUserLogin(userId: string, credentials: LoginCredentials): Promise<AuthResult> {\n"
"    const user = await this.userRepository.findById(userId);\n"
"    if (!user) {\n"
"        return { success: false, error: 'USER_NOT_FOUND' };"

Chunk B（~900字）：
"    }\n"
"    \n"
"    const passwordValid = await this.verifyPassword(credentials.password, user.passwordHash);\n"
"    if (!passwordValid) {\n"
"        return { success: false, error: 'INVALID_PASSWORD' };\n"
"    }\n"
"    if (user.mfaEnabled) {\n"
"        const mfaToken = await this.generateMFAToken(user.id);\n"
"        return { success: false, error: 'MFA_REQUIRED', mfaToken };\n"
"    }\n"
"    \n"
"    const session = await this.createSession(user);\n"
"    return { success: true, session };\n"
"}"
```

**問題：**
- Chunk A 結尾在 `return { success: false, error: 'USER_NOT_FOUND' };` — **不完整的 if-block**
- Chunk B 開頭是 `}` — **脫離語境的 closing brace**
- `verifyPassword` 函式定義被切成兩段，跨越 Chunk A 和 Chunk B

### 後續影響鏈

```
Embedding Chunk A → vector 代表「不完整的登入函式片段」
Embedding Chunk B → vector 代表「密碼驗證函式」
↓
使用者問：「我登入失敗時是怎麼處理的？」
↓
Vector search → 找到 Chunk A 和 Chunk B
↓
模型看到：「... return { success: false, error: 'USER_NOT_FOUND' };\n    }\n    \n    const passwordValid = await...」
↓
模型困惑：if-block 被切斷、函式不完整 → 回覆殘缺或錯誤
```

### 規模感

| 情境 | 影響 |
|------|------|
| 對話含 1 個 50 行 function | 被切成 2-3 個 chunk，語意碎片化 |
| 使用者問：「那個函式怎麼處理密碼錯誤的？」 | 只匹配到 Chunk B，缺少前因後果 |
| 使用者問：「MFA 流程是什麼？」 | Chunk B 中段匹配，但 `handleUserLogin` 的 if 結構被切斷 |
| 10 個這類問題 | 至少 7-8 個回覆品質明顯下降 |

### 現有對比：claude-context 的做法

`zilliztech/claude-context` 的 `ast-splitter.ts` 已經實作了這個概念。他們使用 tree-sitter 解析 AST，以 `function_declaration` / `class_declaration` / `method_definition` 等語法節點為邊界切 chunk，不支援的語言 fallback 到 LangChain character splitter。邏輯如下：

```typescript
// claude-context ast-splitter.ts 的核心邏輯
const SPLITTABLE_NODE_TYPES = {
  javascript: ['function_declaration', 'arrow_function', 'class_declaration', 'method_definition', 'export_statement'],
  typescript: ['function_declaration', 'arrow_function', 'class_declaration', 'method_definition', 'export_statement', 'interface_declaration', 'type_alias_declaration'],
  python: ['function_definition', 'class_definition', 'decorated_definition'],
  // ...
};
// 每個 declaration = 1 個 semantic chunk
// 超過 size limit → sub-split within that declaration at statement level
```

---

## Proposed Solution

在 `chunker.ts` 加入 AST-aware 分支：

### 實作方向

```typescript
// 新增：detectCodeLanguage() — 判斷是否為程式碼
function detectCodeLanguage(text: string): string | null {
  // JS/TS: function/class/import/export/const + arrow function
  // Python: def / class / import / from
  // Go: func / package / import
  // Rust: fn / impl / let mut / pub fn
  const codePatterns = [
    { pattern: /\b(function|const|let|var|=>|import|export)\b/, lang: 'javascript' },
    { pattern: /\bdef |class |import |from |print(/, lang: 'python' },
    { pattern: /\bfunc |package |import /, lang: 'go' },
    { pattern: /\bfn |impl |pub fn |let mut /, lang: 'rust' },
  ];
  // ...
}

// 新增：astChunk() — tree-sitter based split
export async function astChunk(
  code: string,
  language: string,
  config: ChunkerConfig
): Promise<AstChunkResult> {
  // 1. Parse AST with tree-sitter
  // 2. Walk top-level declarations (function_declaration, class_decl, etc.)
  // 3. 每個 declaration = 1 個 chunk（size within limit）
  // 4. 超過 limit 的 declaration → sub-split at statement level
  // 5. Fallback: 無法 parse 的語言 → 走現有 character split
}
```

### 觸發時機

```typescript
smartChunk(text)
  → detectCodeLanguage(text)
    → null    → 走現有 character split（自然語言）
    → "python" → astChunk(text, "python")
    → "javascript" → astChunk(text, "javascript")
```

### 依賴（Phase 1）

```json
{
  "dependencies": {
    "tree-sitter": "^0.21.0",
    "tree-sitter-javascript": "^0.21.0",
    "tree-sitter-python": "^0.21.0"
  }
}
```

### Phase 1 優先級

| 語言 | 優先級 | 原因 |
|------|--------|------|
| JavaScript / TypeScript | P0 | 最常見 |
| Python | P0 | 常見 |
| Go | P1 | 常用 |
| Rust | P1 | 常用 |
| 其他 | P2 | fallback 到現有邏輯 |

---

## Impact

| 維度 | 改善 |
|------|------|
| 程式碼理解準確度 | Chunk 保留完整函式，回覆不再有「片段」困惑 |
| 跨函式 context | 同一 function 的程式碼在同一 chunk，避免斷裂 |
| 維護性 | AST split 比 character split 更符合開發者直覺 |
| 向量品質 | 每個 chunk 向量代表完整語義單位，相似度計算更準 |

---

## Questions for Maintainers

1. tree-sitter 的額外依賴是否值得？還是有更輕量的替代方案（如只用 regex 偵測函式邊界）？
2. AST chunking 是否應該預設開啟？還是需要 config 開關？
3. 對於非 TS/JS/Python 的語言，現有的 semantic split 是否已足夠？

---

## References

- `zilliztech/claude-context` `ast-splitter.ts`（已研究）
- tree-sitter 官方文檔：https://tree-sitter.github.io/tree-sitter/


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: AST-based semantic chunking for code blocks #692

Problem Statement

現狀：Character Split 會切壞程式碼語意

真實破壞案例

後續影響鏈

規模感

現有對比：claude-context 的做法

Proposed Solution

實作方向

觸發時機

依賴（Phase 1）

Phase 1 優先級

Impact

Questions for Maintainers

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

情境	影響
對話含 1 個 50 行 function	被切成 2-3 個 chunk，語意碎片化
使用者問：「那個函式怎麼處理密碼錯誤的？」	只匹配到 Chunk B，缺少前因後果
使用者問：「MFA 流程是什麼？」	Chunk B 中段匹配，但 `handleUserLogin` 的 if 結構被切斷
10 個這類問題	至少 7-8 個回覆品質明顯下降

語言	優先級	原因
JavaScript / TypeScript	P0	最常見
Python	P0	常見
Go	P1	常用
Rust	P1	常用
其他	P2	fallback 到現有邏輯

維度	改善
程式碼理解準確度	Chunk 保留完整函式，回覆不再有「片段」困惑
跨函式 context	同一 function 的程式碼在同一 chunk，避免斷裂
維護性	AST split 比 character split 更符合開發者直覺
向量品質	每個 chunk 向量代表完整語義單位，相似度計算更準

feat: AST-based semantic chunking for code blocks #692

Description

Problem Statement

現狀：Character Split 會切壞程式碼語意

真實破壞案例

後續影響鏈

規模感

現有對比：claude-context 的做法

Proposed Solution

實作方向

觸發時機

依賴（Phase 1）

Phase 1 優先級

Impact

Questions for Maintainers

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions