Skip to content

fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733

Closed
jlin53882 wants to merge 2 commits intoCortexReach:masterfrom
jlin53882:pr719
Closed

fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733
jlin53882 wants to merge 2 commits intoCortexReach:masterfrom
jlin53882:pr719

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Combines #719 (dedup default + stats) and #720 (retrieve() instead of bm25Search) into one clean PR.

Changes

Feature Detail
Dedup default → dedup ON by default, use to disable
Dedup method instead of — hybrid search + rerank pipeline
Stats tracking , , in return value and console output
CLI option Changed from to (dedup now default-ON)

Why retrieve()?

runs the full hybrid search pipeline (BM25/vector + rerank), then exact-match on text. More accurate than the old which only checked rank-1.

Testing

  • 14/14 import-markdown tests pass
  • Added to test mock for the interface

jlin53882 and others added 2 commits April 29, 2026 09:18
…ied summary

- --dedup flag inverted: now --no-dedup disables dedup (default true)
- Return type expanded: added skippedShort, skippedDedup, errorCount
- Early return with no files now returns all 6 fields (TypeError fix)
- Summary unified: single 'Memory Import Status:' format replaces dual dry-run/real modes
- Log tags: [scan], [would-import], [skip] for consistent output
- dedup skip log always shown (removed if(!dryRun) guard)
- errorCount tracked separately; error log uses [skip] error: prefix
- TotalEntries computed and reported in summary

Based on PR CortexReach#717 intent (Issue CortexReach#715), rebuilt cleanly from upstream/master
…eve() + dedup stats

- Replace ctx.store.bm25Search with ctx.retriever.retrieve() (from CortexReach#720)
- Default dedup=TRUE (--no-dedup to disable) (from CortexReach#719)
- Add skippedShort/skippedDedup/errorCount stats (from CortexReach#719)
- Add mockRetriever to test mocks for retrieve() interface

PR CortexReach#719: import-markdown stats + --no-dedup default
PR CortexReach#720: retrieve() instead of bm25Search for dedup check
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e1b15a7e98

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cli.ts
scopeFilter: [effectiveScope],
source: "cli",
});
if (results.length > 0 && results[0].entry.text === text) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Check all retrieved candidates for exact dedup match

The new dedup path only inspects results[0], but ctx.retriever.retrieve() runs hybrid fusion, reranking, and post-processing, so an exact existing entry can be present without being top-ranked; in that case this branch misses the duplicate and imports it again even though dedup is enabled by default. This silently reintroduces duplicate records for realistic inputs (e.g., a fresher semantically similar memory outranking the exact text), so the check should scan the retrieved set for any exact text match (or use a dedicated exact-match query) before importing.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

PR #733 已關閉 — 內容已被 PR #735 吸收

為什麼 PR #733 可以關閉

PR #733 是 PR #719(dedup stats + --no-dedup)和 PR #720retrieve() 取代 bm25Search)的合併結果。

PR #735pr730_clean branch 已經包含了 PR #719 + PR #720所有功能,並且架構更完整:

功能 PR #719 PR #720 PR #733 PR #735(最終版)
ctx.retriever.retrieve()
--no-dedup flag
skippedShort/skippedDedup/errorCount
skipped 包含 dedup hits ✅ 對齊
Phase 1 平行讀檔
Phase 2a 批次 dedup + batchSize
Phase 2b 批次 embed + bulkStore
Scan log([scan] Phase 1a/1b) ✅ 對齊
Summary Memory Import Status: 格式 ✅ 對齊
Backup dir openclaw.json 解析

重要:skipped 語意對齊

PR #733 原本有一個 skipped 計數差異(dedup hits 只進 skippedDedup,不進 skipped)。PR #735 已對齊 PR #719 的原始語意:

// PR 735 Phase 2a — dedup hit 時:
skipped++;       // PR 719 語意:dedup 命中也算進 skipped
skippedDedup++;
console.log(`[skip] dedup [${e.effectiveScope}]: ...`);

結論

PR #733 的所有內容(PR #719 + PR #720)已經完整移植到 PR #735,並且在 PR #735 中有更完整的批次架構(Phase 1/2a/2b)和對齊後的 log 格式。不需要合併 PR #733,直接 review 和 merge PR #735 即可。

@jlin53882 jlin53882 closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant