fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733
fix(import-markdown): dedup by default + retrieve() + stats (#719+#720)#733jlin53882 wants to merge 2 commits intoCortexReach:masterfrom
Conversation
…ied summary - --dedup flag inverted: now --no-dedup disables dedup (default true) - Return type expanded: added skippedShort, skippedDedup, errorCount - Early return with no files now returns all 6 fields (TypeError fix) - Summary unified: single 'Memory Import Status:' format replaces dual dry-run/real modes - Log tags: [scan], [would-import], [skip] for consistent output - dedup skip log always shown (removed if(!dryRun) guard) - errorCount tracked separately; error log uses [skip] error: prefix - TotalEntries computed and reported in summary Based on PR CortexReach#717 intent (Issue CortexReach#715), rebuilt cleanly from upstream/master
…eve() + dedup stats - Replace ctx.store.bm25Search with ctx.retriever.retrieve() (from CortexReach#720) - Default dedup=TRUE (--no-dedup to disable) (from CortexReach#719) - Add skippedShort/skippedDedup/errorCount stats (from CortexReach#719) - Add mockRetriever to test mocks for retrieve() interface PR CortexReach#719: import-markdown stats + --no-dedup default PR CortexReach#720: retrieve() instead of bm25Search for dedup check
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e1b15a7e98
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| scopeFilter: [effectiveScope], | ||
| source: "cli", | ||
| }); | ||
| if (results.length > 0 && results[0].entry.text === text) { |
There was a problem hiding this comment.
Check all retrieved candidates for exact dedup match
The new dedup path only inspects results[0], but ctx.retriever.retrieve() runs hybrid fusion, reranking, and post-processing, so an exact existing entry can be present without being top-ranked; in that case this branch misses the duplicate and imports it again even though dedup is enabled by default. This silently reintroduces duplicate records for realistic inputs (e.g., a fresher semantically similar memory outranking the exact text), so the check should scan the retrieved set for any exact text match (or use a dedicated exact-match query) before importing.
Useful? React with 👍 / 👎.
PR #733 已關閉 — 內容已被 PR #735 吸收為什麼 PR #733 可以關閉PR #733 是 PR #719(dedup stats + PR #735 的
重要:skipped 語意對齊PR #733 原本有一個 // PR 735 Phase 2a — dedup hit 時:
skipped++; // PR 719 語意:dedup 命中也算進 skipped
skippedDedup++;
console.log(`[skip] dedup [${e.effectiveScope}]: ...`);結論PR #733 的所有內容(PR #719 + PR #720)已經完整移植到 PR #735,並且在 PR #735 中有更完整的批次架構(Phase 1/2a/2b)和對齊後的 log 格式。不需要合併 PR #733,直接 review 和 merge PR #735 即可。 |
Summary
Combines #719 (dedup default + stats) and #720 (retrieve() instead of bm25Search) into one clean PR.
Changes
Why retrieve()?
runs the full hybrid search pipeline (BM25/vector + rerank), then exact-match on text. More accurate than the old which only checked rank-1.
Testing