fix(regex-fallback): #675 bulkStore + batch/fallback dedup + mdMirror isolation + metadata failure isolation#723
Conversation
…tor entries Bug #3 (regression from PR CortexReach#678 / Issue CortexReach#675 fix): When the regex-fallback path captures multiple texts in a single batch and none of them exist in the database yet, the per-entry dedup pre-check passes for all of them. This allows near-duplicate texts (e.g. two different reformulations of the same fact) to both be written. Fix: before pushing to capturedEntries, compare the new embedding vector against all entries already accumulated in this batch using cosine similarity. Skip the entry if dot-product > 0.90 with any prior entry. Signed-off-by: James Signed-off-by: James <james@example.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2bde62de2b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| ); | ||
|
|
||
| ); |
There was a problem hiding this comment.
Remove orphan
); statements to restore valid syntax
These two standalone ); tokens make index.ts unparsable, so the plugin fails before runtime logic executes. Running tsc --noEmit index.ts reports TS1128 at these lines, and parsing then cascades into further syntax errors. This is a release-blocking regression because it occurs whenever the module is loaded.
Useful? React with 👍 / 👎.
| let dot = 0; | ||
| for (let i = 0; i < vector.length; i++) dot += prev.vector[i] * vector[i]; | ||
| if (dot > 0.90) { duplicateInBatch = true; break; } |
There was a problem hiding this comment.
Use cosine similarity for in-batch dedup threshold
The new in-batch dedup compares vectors with a raw dot product (dot > 0.90), but the surrounding dedup path uses cosine-based scores (vectorSearch(...).score > 0.90). Dot product only matches cosine when vectors are unit-normalized; otherwise, norm differences can cause false duplicate drops or missed duplicates. This can corrupt capture behavior for providers/configs that do not guarantee normalized embeddings.
Useful? React with 👍 / 👎.
…exReach#723 review fixes) P0: Remove three orphaned ');' tokens introduced during bulkStore refactor that caused TS1128 (unparsable module syntax). P1: Replace raw dot-product threshold with cosine similarity for batch-internal dedup. The DB dedup path uses vectorSearch().score (cosine), so the in-batch dedup must also use cosine = dot/(|a||b|) to be consistent across providers/configs that don't guarantee unit-normalized embeddings. - Compute L2 norms of both vectors - Fall back to dot if either norm is zero (guard against zero vectors) - Keep threshold at 0.90 (same as DB dedup)
…ch#723 review) Align test with the same cosine-similarity fix applied to index.ts: - Replace raw dot-product threshold with explicit cosine = dot/(|a||b|) - Guard against zero-norm vectors (fall back to dot) - Update comments to reflect P1 fix rationale
… review 1. Fallback path now re-applies DB dedup before each store.store() call. Without this, a bulkStore failure would bypass the vectorSearch pre-check and write all capturedEntries (including near-duplicates) individually. 2. metadata construction wrapped in try-catch. If stringifySmartMetadata/buildSmartMetadata throws mid-loop, the exception no longer propagates and corrupts capturedEntries. The entry is skipped with a log warning instead.
…h+fallback interaction, cosine boundary (PR CortexReach#723)
本地測試結果 ✅測試方式: 手動觸發 regex fallback(說出包含 trigger keyword 的句子) Log 截錄: 結果:
環境: Windows native, Ollama CPU mode, extension master (b298adb + PR #723 patch) |
…ith production fallback (0.9 → 0.1) Fallback path in production (index.ts:3139) uses 0.1 as the fail-open pre-filter minScore. The NEW pattern test's DB dedup pre-check was using 0.9, creating an inconsistency — fixing to 0.1. OLD pattern test threshold (0.9) left unchanged; it intentionally simulates the buggy old behavior and is not part of this fix.
D7 Test Fix Applied:
|
Summary
Fixes Bug #3 — a regression introduced by PR #678 / Issue #675.
Problem
When the regex-fallback path captures multiple texts in a single batch and none exist in the database yet, the per-entry dedup pre-check passes for all of them. This allows near-duplicate texts (e.g. two reformulations of the same fact) to both be written to the database.
Fix
Before pushing each entry to capturedEntries, compare the new embedding vector against all entries already accumulated in this batch using cosine similarity. Skip the entry if cosine > 0.90 (strict) with any prior entry.
Additional fixes discovered during adversarial code review:
Changes
Test Coverage (16 tests, all passing)
Tests 8-16 are new in this PR revision, addressing gaps identified during adversarial code review.
Test results
tests 16
pass 16
fail 0
duration_ms ~4600
Run via: node --test test/regex-fallback-bulk-store.test.mjs
Or: npm test (includes this test in the full suite)
Related: #675, #678