feat(import-markdown): batch processing pipeline + test rewrite by jlin53882 · Pull Request #730 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-04-30T16:00:31Z

Summary

Phase 2b batch pipeline (embedBatchPassage + bulkStore)

Replace per-entry embedPassage + store() with embedBatchPassage + bulkStore() pipeline
Configurable batchSize (default: 10) to control embedder memory pressure
FLUSH_THRESHOLD=100: accumulate entries and flush via bulkStore in bulk, reducing lock contention
Parallel dedup check via retriever.retrieve() in chunks of 50 (Phase 2a)

Bug fixes

P1: Wrap bulkStore in try/catch — one batch failure no longer aborts remaining batches
P2: Dedup now runs before dry-run report, so --dry-run shows accurate preview with correct skippedDedup count (dedup check is NOT skipped in dry-run mode)
P3: parseArgs converts --dedup=false to string 'false' (truthy); fix with !== 'false' check

Test rewrite

Rewrite all import-markdown tests against real cli.ts behavior
Add batch pipeline, bulkStore failure, scope inference, return fields coverage
Use real bullet content from James workspace (all > minTextLength=5)
Remove non-existent directory-filter behavior tests
19 tests, 0 fail, 0 skip

Behavior: dry-run vs real execution

--dry-run (current)

Dedup check runs via retrieve() (embedQuery + reranker per entry)
Would import = entries that pass dedup check (pendingEntries.length)
Would skip (dedup) = entries caught by dedup (skippedDedup)
skippedDedup is accurately counted and returned

--dry-run --no-dedup (current)

Dedup check is skipped
Would import = allEntries.length
skippedDedup = 0

Real execution (current)

Phase 2a dedup runs via retrieve() (embedQuery + reranker per entry)
pendingEntries = entries that pass dedup check
imported = pendingEntries.length (actual write count)
skippedDedup accurately reflects dedup hits

Future (with Bloom Filter, not in this PR)

--dry-run can use Bloom Filter O(1) check without embed API calls
Would import = pendingEntries.length (matches real execution)
Would skip (dedup) = bloom-filter-hit count
No semantic change to real execution path

Testing

node --test test/import-markdown/import-markdown.test.mjs
=> 19 pass, 0 fail, 0 skip

…+ bulkStore + parallel dedup

… James workspace data - Fix parseArgs converts --dedup=false to string 'false' (truthy) bug - Rewrite all import-markdown tests against real cli.ts behavior - Add batch pipeline, bulkStore failure, scope inference, return fields tests - Use real bullet content from James's workspace (all > minTextLength=5) - Remove non-existent directory-filter behavior tests - 19 tests pass, 0 fail

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27e05d9d7b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T16:06:06Z

+  const dedupEnabled = (options.dedup as unknown as string | undefined | boolean) !== false
+    && (options.dedup as unknown as string | undefined | boolean) !== "false";


Keep dedup off when --dedup is not provided

This condition enables deduplication when options.dedup is undefined, because only false and "false" are treated as disabled. The command defines --dedup as an optional boolean flag, so omitted values can be unset; with the current check, dedup runs by default and changes import behavior (unexpected duplicate skipping and much heavier retrieval work) even when users did not ask for dedup.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-30T16:06:06Z

+    console.log(`\u2022 Skipped (short): ${skippedShort}`);
+    console.log(`\u2022 Skipped (dedup): ${dryRunDedupSkipped.length}`);
+    console.log(`\u2022 Elapsed: ${elapsed}ms`);
+    return { imported, skipped: skippedShort, foundFiles, skippedShort, skippedDedup: dryRunDedupSkipped.length, errorCount: parseErrors, elapsedMs: elapsed };


Report full skipped count in dry-run return value

The dry-run branch returns skipped: skippedShort, which excludes dedup skips and read/parse skips already counted in skipped. This makes skipped inconsistent with non-dry-run results and with previous behavior, so automation that relies on skipped for preview-vs-run parity will undercount skipped entries in dry-run mode.

Useful? React with 👍 / 👎.

Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.

jlin53882 · 2026-04-30T16:45:53Z

追加修復：dry-run 回傳值 Bug（commit `e518afd`）

問題：啟用 --dedup 時，dry-run 的回傳物件中 skipped 欄位只計算 short lines，漏掉 dedup hits。

// cli.ts:777 — 修復前
return { imported, skipped: skippedShort, ... };

// cli.ts:777 — 修復後
return { imported, skipped, ... };

原因：skipped 在 Phase 1b（short lines）和 Phase 2a（dedup）都會遞增，但原本只回傳 Phase 1b 的 skippedShort。

影響：只有 API 回傳值錯誤，console 輸出和實際行為都正確，且現有測試沒有 assert result.skipped，所以一直沒被發現。

驗證：19/19 測試全部通過。

jlin53882 · 2026-04-30T17:17:54Z

關於刪除的 2 個目錄保護測試

在 test/import-markdown/import-markdown.test.mjs 中，以下兩個測試被移除：

skips a directory named YYYY-MM-DD.md without aborting import
skips a .md directory in flatMemoryDir without aborting import

為什麼可以移除

CLI 掃描層（scanAgentMd / flatMemoryDir）已使用 f.isFile() 過濾：

// cli.ts — mdFiles collection
if (f.isFile() && f.name.endsWith(.md)) {
  // ...
}

f.isFile() 在 Node.js 會區分「檔案」與「目錄」——.md 目錄在 readdir + withFileTypes: true 時，isFile() 會正確返回 false，根本不會被加進 mdFiles 清單。因此不需要 defensive skip logic，也不會有 EISDIR 錯誤。

結論

這兩個測試是 defensive 性質的 regression test，在 f.isFile() 已保證正確行為的前提下是冗餘的，移除是合理的。若日後有人移除 withFileTypes: true 或改用 readdirSync() 而不搭配 isFile()，這兩個 test 才需要被加回來。

- CLI: add --no-dedup negation flag; dedup now defaults to enabled - Tests: add dedup: false explicitly to all non-dedup tests so behavior is intentional and not dependent on mock return values - P3 comment: correct misleading 'parseArgs' reference to clarify the != 'false' guard exists for the test helper path (Commander.js rejects --dedup=false as unknown option)

…leanup, remove source:cli - Phase 1a: move foundFiles++ inside try block (only count successfully read files) - Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted) - Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup) - Remove unused source:'cli' from retrieve() call

jlin53882 · 2026-04-30T19:02:54Z

目前測試，dedup/dry run 計數還有一些相關內容還在調整中，先不要review,如果有想法歡迎一起討論

jlin53882 · 2026-05-01T11:54:13Z

Closed — superseded by #735 (clean rebuild)

jlin53882 added 4 commits April 30, 2026 23:57

feat(import-markdown): batch processing pipeline — embedBatchPassage …

5fd95a2

…+ bulkStore + parallel dedup

chore: default batch-size=10, no cap; add pipeline comments

5a969c1

fix: P2 dry-run runs dedup first; P1 wrap bulkStore in try/catch

34472ca

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

fix(import-markdown): dry-run return value uses correct skipped count

e518afd

Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.

jlin53882 added 2 commits May 1, 2026 01:37

jlin53882 mentioned this pull request Apr 30, 2026

enhancement: import-markdown CLI 過程監控 — 預設顯示相關 log #715

Open

jlin53882 closed this May 1, 2026

jlin53882 mentioned this pull request May 1, 2026

fix(import-markdown): batch pipeline Phase 1+2a+2b + retrieve() dedup + PR719 log alignment (#730+#732+#719+#720) #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(import-markdown): batch processing pipeline + test rewrite#730

feat(import-markdown): batch processing pipeline + test rewrite#730
jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import-v2

jlin53882 commented Apr 30, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

jlin53882 commented Apr 30, 2026

Uh oh!

jlin53882 commented Apr 30, 2026

Uh oh!

jlin53882 commented Apr 30, 2026

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		const dedupEnabled = (options.dedup as unknown as string \| undefined \| boolean) !== false
		&& (options.dedup as unknown as string \| undefined \| boolean) !== "false";

Conversation

jlin53882 commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phase 2b batch pipeline (embedBatchPassage + bulkStore)

Bug fixes

Test rewrite

Behavior: dry-run vs real execution

--dry-run (current)

--dry-run --no-dedup (current)

Real execution (current)

Future (with Bloom Filter, not in this PR)

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented Apr 30, 2026

追加修復：dry-run 回傳值 Bug（commit e518afd）

Uh oh!

jlin53882 commented Apr 30, 2026

關於刪除的 2 個目錄保護測試

為什麼可以移除

結論

Uh oh!

jlin53882 commented Apr 30, 2026

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jlin53882 commented Apr 30, 2026 •

edited

Loading

追加修復：dry-run 回傳值 Bug（commit `e518afd`）