Skip to content

feat(import-markdown): batch processing pipeline + test rewrite#730

Closed
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import-v2
Closed

feat(import-markdown): batch processing pipeline + test rewrite#730
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import-v2

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented Apr 30, 2026

Summary

Phase 2b batch pipeline (embedBatchPassage + bulkStore)

  • Replace per-entry embedPassage + store() with embedBatchPassage + bulkStore() pipeline
  • Configurable batchSize (default: 10) to control embedder memory pressure
  • FLUSH_THRESHOLD=100: accumulate entries and flush via bulkStore in bulk, reducing lock contention
  • Parallel dedup check via retriever.retrieve() in chunks of 50 (Phase 2a)

Bug fixes

  • P1: Wrap bulkStore in try/catch — one batch failure no longer aborts remaining batches
  • P2: Dedup now runs before dry-run report, so --dry-run shows accurate preview with correct skippedDedup count (dedup check is NOT skipped in dry-run mode)
  • P3: parseArgs converts --dedup=false to string 'false' (truthy); fix with !== 'false' check

Test rewrite

  • Rewrite all import-markdown tests against real cli.ts behavior
  • Add batch pipeline, bulkStore failure, scope inference, return fields coverage
  • Use real bullet content from James workspace (all > minTextLength=5)
  • Remove non-existent directory-filter behavior tests
  • 19 tests, 0 fail, 0 skip

Behavior: dry-run vs real execution

--dry-run (current)

  • Dedup check runs via retrieve() (embedQuery + reranker per entry)
  • Would import = entries that pass dedup check (pendingEntries.length)
  • Would skip (dedup) = entries caught by dedup (skippedDedup)
  • skippedDedup is accurately counted and returned

--dry-run --no-dedup (current)

  • Dedup check is skipped
  • Would import = allEntries.length
  • skippedDedup = 0

Real execution (current)

  • Phase 2a dedup runs via retrieve() (embedQuery + reranker per entry)
  • pendingEntries = entries that pass dedup check
  • imported = pendingEntries.length (actual write count)
  • skippedDedup accurately reflects dedup hits

Future (with Bloom Filter, not in this PR)

  • --dry-run can use Bloom Filter O(1) check without embed API calls
  • Would import = pendingEntries.length (matches real execution)
  • Would skip (dedup) = bloom-filter-hit count
  • No semantic change to real execution path

Testing

node --test test/import-markdown/import-markdown.test.mjs
=> 19 pass, 0 fail, 0 skip

… James workspace data

- Fix parseArgs converts --dedup=false to string 'false' (truthy) bug
- Rewrite all import-markdown tests against real cli.ts behavior
- Add batch pipeline, bulkStore failure, scope inference, return fields tests
- Use real bullet content from James's workspace (all > minTextLength=5)
- Remove non-existent directory-filter behavior tests
- 19 tests pass, 0 fail
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27e05d9d7b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cli.ts
Comment on lines +667 to +668
const dedupEnabled = (options.dedup as unknown as string | undefined | boolean) !== false
&& (options.dedup as unknown as string | undefined | boolean) !== "false";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Keep dedup off when --dedup is not provided

This condition enables deduplication when options.dedup is undefined, because only false and "false" are treated as disabled. The command defines --dedup as an optional boolean flag, so omitted values can be unset; with the current check, dedup runs by default and changes import behavior (unexpected duplicate skipping and much heavier retrieval work) even when users did not ask for dedup.

Useful? React with 👍 / 👎.

Comment thread cli.ts Outdated
console.log(`\u2022 Skipped (short): ${skippedShort}`);
console.log(`\u2022 Skipped (dedup): ${dryRunDedupSkipped.length}`);
console.log(`\u2022 Elapsed: ${elapsed}ms`);
return { imported, skipped: skippedShort, foundFiles, skippedShort, skippedDedup: dryRunDedupSkipped.length, errorCount: parseErrors, elapsedMs: elapsed };
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Report full skipped count in dry-run return value

The dry-run branch returns skipped: skippedShort, which excludes dedup skips and read/parse skips already counted in skipped. This makes skipped inconsistent with non-dry-run results and with previous behavior, so automation that relies on skipped for preview-vs-run parity will undercount skipped entries in dry-run mode.

Useful? React with 👍 / 👎.

Before: skipped: skippedShort (excludes dedup hits from count)
After:  skipped: skipped (includes both short + dedup skips)

Dry-run console output was correct; only the return value was wrong.
@jlin53882
Copy link
Copy Markdown
Contributor Author

追加修復:dry-run 回傳值 Bug(commit e518afd

問題:啟用 --dedup 時,dry-run 的回傳物件中 skipped 欄位只計算 short lines,漏掉 dedup hits。

// cli.ts:777 — 修復前
return { imported, skipped: skippedShort, ... };

// cli.ts:777 — 修復後
return { imported, skipped, ... };

原因skipped 在 Phase 1b(short lines)和 Phase 2a(dedup)都會遞增,但原本只回傳 Phase 1b 的 skippedShort

影響:只有 API 回傳值錯誤,console 輸出和實際行為都正確,且現有測試沒有 assert result.skipped,所以一直沒被發現。

驗證:19/19 測試全部通過。

@jlin53882
Copy link
Copy Markdown
Contributor Author

關於刪除的 2 個目錄保護測試

test/import-markdown/import-markdown.test.mjs 中,以下兩個測試被移除:

  • skips a directory named YYYY-MM-DD.md without aborting import
  • skips a .md directory in flatMemoryDir without aborting import

為什麼可以移除

CLI 掃描層(scanAgentMd / flatMemoryDir)已使用 f.isFile() 過濾:

// cli.ts — mdFiles collection
if (f.isFile() && f.name.endsWith(.md)) {
  // ...
}

f.isFile() 在 Node.js 會區分「檔案」與「目錄」——.md 目錄在 readdir + withFileTypes: true 時,isFile() 會正確返回 false,根本不會被加進 mdFiles 清單。因此不需要 defensive skip logic,也不會有 EISDIR 錯誤。

結論

這兩個測試是 defensive 性質的 regression test,在 f.isFile() 已保證正確行為的前提下是冗餘的,移除是合理的。若日後有人移除 withFileTypes: true 或改用 readdirSync() 而不搭配 isFile(),這兩個 test 才需要被加回來。

jlin53882 added 2 commits May 1, 2026 01:37
- CLI: add --no-dedup negation flag; dedup now defaults to enabled
- Tests: add dedup: false explicitly to all non-dedup tests so behavior
  is intentional and not dependent on mock return values
- P3 comment: correct misleading 'parseArgs' reference to clarify
  the != 'false' guard exists for the test helper path (Commander.js
  rejects --dedup=false as unknown option)
…leanup, remove source:cli

- Phase 1a: move foundFiles++ inside try block (only count successfully read files)
- Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted)
- Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup)
- Remove unused source:'cli' from retrieve() call
@jlin53882
Copy link
Copy Markdown
Contributor Author

目前測試,dedup/dry run 計數還有一些相關內容還在調整中,先不要review,如果有想法歡迎一起討論

@jlin53882
Copy link
Copy Markdown
Contributor Author

Closed — superseded by #735 (clean rebuild)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant