Skip to content

fix(import-markdown): dedup uses retrieve() hybrid search (fixes rank-1 gap)#720

Closed
jlin53882 wants to merge 1 commit intoCortexReach:masterfrom
jlin53882:fix/import-markdown-dedup-bug
Closed

fix(import-markdown): dedup uses retrieve() hybrid search (fixes rank-1 gap)#720
jlin53882 wants to merge 1 commit intoCortexReach:masterfrom
jlin53882:fix/import-markdown-dedup-bug

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Problem

The dedup check in import-markdown used ctx.store.bm25Search(text, 5) and only compared against the top result (existing[0]). When an identical text existed in the same scope but was ranked 2–5 by BM25, it was not detected — causing duplicate imports.

Fix

Replace bm25Search with ctx.retriever.retrieve() with limit: 20, then exact-match results[0].entry.text. This runs the full hybrid pipeline (BM25 + vector + reranker) against 20 candidates before exact-matching the top result — the same pipeline used for normal retrieval.

Before (Bug):

const existing = await ctx.store.bm25Search(text, 5, [effectiveScope]);
if (existing.length > 0 && existing[0].entry.text === text) { // only checked rank 1

After (Fix):

const results = await ctx.retriever.retrieve({
  query: text,
  limit: 20,
  scopeFilter: [effectiveScope],
  source: "cli",
});
if (results.length > 0 && results[0].entry.text === text) { // checks top of hybrid pipeline

Scope

  • Single file changed: cli.ts
  • +10 / -3 lines
  • No changes to return type or CLI interface
  • No new tests required (exact-match logic unchanged, just more thorough search)

Refs

…bm25Search rank-1 only

Replaces ctx.store.bm25Search(text, 5) with ctx.retriever.retrieve() for the
dedup check in import-markdown. The old approach only checked the top BM25
result, missing duplicates when the true match was ranked 2-5. The new
approach runs the full hybrid pipeline (BM25 + vector + rerank) against 20
candidates, then exact-matches the top result — the same pipeline used for
normal retrieval, ensuring consistent dedup coverage.

Fixes: deduplication gap in import-markdown where identical texts in
scope were not detected when the BM25 rank of the true duplicate was > 1.

Refs: Issue CortexReach#715
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@jlin53882
Copy link
Copy Markdown
Contributor Author

Closed — superseded by #733 (combines #719 + #720)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant