fix(import-markdown): dedup by default + retrieve() + stats (#719+#720) by jlin53882 · Pull Request #733 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-05-01T10:24:09Z

Summary

Combines #719 (dedup default + stats) and #720 (retrieve() instead of bm25Search) into one clean PR.

Changes

Feature	Detail
Dedup default	→ dedup ON by default, use to disable
Dedup method	instead of — hybrid search + rerank pipeline
Stats tracking	, , in return value and console output
CLI option	Changed from to (dedup now default-ON)

Why retrieve()?

runs the full hybrid search pipeline (BM25/vector + rerank), then exact-match on text. More accurate than the old which only checked rank-1.

Testing

14/14 import-markdown tests pass
Added to test mock for the interface

…ied summary - --dedup flag inverted: now --no-dedup disables dedup (default true) - Return type expanded: added skippedShort, skippedDedup, errorCount - Early return with no files now returns all 6 fields (TypeError fix) - Summary unified: single 'Memory Import Status:' format replaces dual dry-run/real modes - Log tags: [scan], [would-import], [skip] for consistent output - dedup skip log always shown (removed if(!dryRun) guard) - errorCount tracked separately; error log uses [skip] error: prefix - TotalEntries computed and reported in summary Based on PR CortexReach#717 intent (Issue CortexReach#715), rebuilt cleanly from upstream/master

…eve() + dedup stats - Replace ctx.store.bm25Search with ctx.retriever.retrieve() (from CortexReach#720) - Default dedup=TRUE (--no-dedup to disable) (from CortexReach#719) - Add skippedShort/skippedDedup/errorCount stats (from CortexReach#719) - Add mockRetriever to test mocks for retrieve() interface PR CortexReach#719: import-markdown stats + --no-dedup default PR CortexReach#720: retrieve() instead of bm25Search for dedup check

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1b15a7e98

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T10:28:44Z

+            scopeFilter: [effectiveScope],
+            source: "cli",
+          });
+          if (results.length > 0 && results[0].entry.text === text) {


Check all retrieved candidates for exact dedup match

The new dedup path only inspects results[0], but ctx.retriever.retrieve() runs hybrid fusion, reranking, and post-processing, so an exact existing entry can be present without being top-ranked; in that case this branch misses the duplicate and imports it again even though dedup is enabled by default. This silently reintroduces duplicate records for realistic inputs (e.g., a fresher semantically similar memory outranking the exact text), so the check should scan the retrieved set for any exact text match (or use a dedicated exact-match query) before importing.

Useful? React with 👍 / 👎.

jlin53882 · 2026-05-01T16:40:59Z

PR #733 已關閉 — 內容已被 PR #735 吸收

為什麼 PR #733 可以關閉

PR #733 是 PR #719（dedup stats + --no-dedup）和 PR #720（retrieve() 取代 bm25Search）的合併結果。

PR #735 的 pr730_clean branch 已經包含了 PR #719 + PR #720 的所有功能，並且架構更完整：

功能	PR #719	PR #720	PR #733	PR #735（最終版）
`ctx.retriever.retrieve()`	—	✅	✅	✅
`--no-dedup` flag	✅	—	✅	✅
`skippedShort`/`skippedDedup`/`errorCount`	✅	—	✅	✅
`skipped` 包含 dedup hits	✅	—	✅	✅ 對齊
Phase 1 平行讀檔	—	—	—	✅
Phase 2a 批次 dedup + `batchSize`	—	—	—	✅
Phase 2b 批次 embed + bulkStore	—	—	—	✅
Scan log（`[scan]` Phase 1a/1b）	✅	—	✅	✅ 對齊
Summary `Memory Import Status:` 格式	✅	—	✅	✅ 對齊
Backup dir `openclaw.json` 解析	—	—	—	✅

重要：skipped 語意對齊

PR #733 原本有一個 skipped 計數差異（dedup hits 只進 skippedDedup，不進 skipped）。PR #735 已對齊 PR #719 的原始語意：

// PR 735 Phase 2a — dedup hit 時：
skipped++;       // PR 719 語意：dedup 命中也算進 skipped
skippedDedup++;
console.log(`[skip] dedup [${e.effectiveScope}]: ...`);

結論

PR #733 的所有內容（PR #719 + PR #720）已經完整移植到 PR #735，並且在 PR #735 中有更完整的批次架構（Phase 1/2a/2b）和對齊後的 log 格式。不需要合併 PR #733，直接 review 和 merge PR #735 即可。

jlin53882 and others added 2 commits April 29, 2026 09:18

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

This was referenced May 1, 2026

fix(import-markdown): dedup uses retrieve() hybrid search (fixes rank-1 gap) #720

Closed

feat(import-markdown): --no-dedup by default, new return fields, unified summary (closes #715) #719

Closed

jlin53882 closed this May 1, 2026

This was referenced May 1, 2026

enhancement: import-markdown CLI 過程監控 — 預設顯示相關 log #715

Open

fix(import-markdown): batch pipeline Phase 1+2a+2b + retrieve() dedup + PR719 log alignment (#730+#732+#719+#720) #735

Open

jlin53882 reopened this May 1, 2026

jlin53882 closed this May 1, 2026

jlin53882 reopened this May 1, 2026

jlin53882 closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733

fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733
jlin53882 wants to merge 2 commits intoCortexReach:masterfrom
jlin53882:pr719

jlin53882 commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented May 1, 2026

Summary

Changes

Why retrieve()?

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented May 1, 2026

PR #733 已關閉 — 內容已被 PR #735 吸收

為什麼 PR #733 可以關閉

重要：skipped 語意對齊

結論

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant