Skip to content

fix(regex-fallback): #675 bulkStore + batch/fallback dedup + mdMirror isolation + metadata failure isolation#723

Open
jlin53882 wants to merge 9 commits intoCortexReach:masterfrom
jlin53882:fix/regex-fallback-batch-dedup-clean
Open

fix(regex-fallback): #675 bulkStore + batch/fallback dedup + mdMirror isolation + metadata failure isolation#723
jlin53882 wants to merge 9 commits intoCortexReach:masterfrom
jlin53882:fix/regex-fallback-batch-dedup-clean

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

@jlin53882 jlin53882 commented Apr 29, 2026

Summary

Fixes Bug #3 — a regression introduced by PR #678 / Issue #675.

Problem

When the regex-fallback path captures multiple texts in a single batch and none exist in the database yet, the per-entry dedup pre-check passes for all of them. This allows near-duplicate texts (e.g. two reformulations of the same fact) to both be written to the database.

Fix

Before pushing each entry to capturedEntries, compare the new embedding vector against all entries already accumulated in this batch using cosine similarity. Skip the entry if cosine > 0.90 (strict) with any prior entry.

Additional fixes discovered during adversarial code review:

  1. Fallback DB dedup: When bulkStore fails and the fallback path writes entries individually, DB dedup is now re-applied per entry (prevents near-duplicates from bypassing the check via fallback).
  2. Metadata failure isolation: stringifySmartMetadata/buildSmartMetadata is wrapped in try-catch; if it throws mid-loop, the entry is skipped without corrupting capturedEntries state.

Changes

  • index.ts: ~40 lines added to the regex-fallback bulkStore loop
  • test/regex-fallback-bulk-store.test.mjs: 16 test cases (was 7)
  • package.json: registered test in npm test script

Test Coverage (16 tests, all passing)

# Test Case Coverage缺口
1 OLD pattern: N texts = N store.store() calls (confirmed buggy) Baseline: old behavior
2 NEW pattern: N texts = 1 bulkStore() call (fixed) bulkStore batching
3 Single text: bulkStore called once (not store.store()) Edge case: single entry
4 Empty texts: no store or bulkStore called Edge case: empty input
5 Dedup skips dup-text, remaining batched in bulkStore DB dedup pre-check
6 Real MemoryStore: NEW pattern uses fewer locks (1 vs N) Performance verification
7 Batch-internal dedup: second near-duplicate skipped within same batch Core fix (Bug #3)
8 Fallback: bulkStore failure triggers individual store.store() calls Fallback path activation
9 Fallback: DB dedup skips entries already in DB during fallback loop Fallback DB dedup
10 Fallback: zero capturedEntries (all skipped by DB dedup) - no fallback Edge case: empty capturedEntries
11 Metadata failure: entry skipped, loop continues to next text Metadata failure isolation
12 Metadata failure: does not corrupt capturedEntries state State integrity
13 Batch+fallback: batch dedup first, then fallback dedup re-checks DB Two-layer dedup interaction
14 Batch dedup: different vector dimensions - cosine check skipped, entry still added Graceful dim mismatch handling
15 Cosine threshold: exactly 0.90 - NOT considered duplicate (strict >) Boundary condition
16 Cosine threshold: 0.91 - IS considered duplicate Boundary condition

Tests 8-16 are new in this PR revision, addressing gaps identified during adversarial code review.

Test results

tests 16

pass 16

fail 0

duration_ms ~4600

Run via: node --test test/regex-fallback-bulk-store.test.mjs
Or: npm test (includes this test in the full suite)

Related: #675, #678

…tor entries

Bug #3 (regression from PR CortexReach#678 / Issue CortexReach#675 fix):
When the regex-fallback path captures multiple texts in a single batch
and none of them exist in the database yet, the per-entry dedup pre-check
passes for all of them.  This allows near-duplicate texts (e.g. two
different reformulations of the same fact) to both be written.

Fix: before pushing to capturedEntries, compare the new embedding vector
against all entries already accumulated in this batch using cosine
similarity.  Skip the entry if dot-product > 0.90 with any prior entry.

Signed-off-by: James
Signed-off-by: James <james@example.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bde62de2b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread index.ts Outdated
Comment on lines +3018 to +3020
);

);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P0 Badge Remove orphan ); statements to restore valid syntax

These two standalone ); tokens make index.ts unparsable, so the plugin fails before runtime logic executes. Running tsc --noEmit index.ts reports TS1128 at these lines, and parsing then cascades into further syntax errors. This is a release-blocking regression because it occurs whenever the module is loaded.

Useful? React with 👍 / 👎.

Comment thread index.ts Outdated
Comment on lines +3064 to +3066
let dot = 0;
for (let i = 0; i < vector.length; i++) dot += prev.vector[i] * vector[i];
if (dot > 0.90) { duplicateInBatch = true; break; }
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use cosine similarity for in-batch dedup threshold

The new in-batch dedup compares vectors with a raw dot product (dot > 0.90), but the surrounding dedup path uses cosine-based scores (vectorSearch(...).score > 0.90). Dot product only matches cosine when vectors are unit-normalized; otherwise, norm differences can cause false duplicate drops or missed duplicates. This can corrupt capture behavior for providers/configs that do not guarantee normalized embeddings.

Useful? React with 👍 / 👎.

…exReach#723 review fixes)

P0: Remove three orphaned ');' tokens introduced during bulkStore refactor
    that caused TS1128 (unparsable module syntax).

P1: Replace raw dot-product threshold with cosine similarity for batch-internal
    dedup.  The DB dedup path uses vectorSearch().score (cosine), so the
    in-batch dedup must also use cosine = dot/(|a||b|) to be consistent
    across providers/configs that don't guarantee unit-normalized embeddings.

    - Compute L2 norms of both vectors
    - Fall back to dot if either norm is zero (guard against zero vectors)
    - Keep threshold at 0.90 (same as DB dedup)
…ch#723 review)

Align test with the same cosine-similarity fix applied to index.ts:
- Replace raw dot-product threshold with explicit cosine = dot/(|a||b|)
- Guard against zero-norm vectors (fall back to dot)
- Update comments to reflect P1 fix rationale
… review

1. Fallback path now re-applies DB dedup before each store.store() call.
   Without this, a bulkStore failure would bypass the vectorSearch pre-check
   and write all capturedEntries (including near-duplicates) individually.

2. metadata construction wrapped in try-catch.
   If stringifySmartMetadata/buildSmartMetadata throws mid-loop,
   the exception no longer propagates and corrupts capturedEntries.
   The entry is skipped with a log warning instead.
@jlin53882 jlin53882 changed the title fix(regex-fallback): batch-internal dedup prevents near-duplicate vector entries fix(regex-fallback): batch-internal dedup + fallback dedup + metadata failure isolation Apr 29, 2026
@jlin53882
Copy link
Copy Markdown
Contributor Author

本地測試結果 ✅

測試方式: 手動觸發 regex fallback(說出包含 trigger keyword 的句子)

Log 截錄:

20:52:48 [plugins] memory-pro: smart-extractor: extracted 1 candidate(s)
20:53:02 [plugins] memory-pro: smart-extractor: skipped [cases] 升級程式死鎖...
20:53:02 [plugins] memory-lancedb-pro: smart extraction produced no persisted memories (created=0, merged=0, skipped=1); falling back to regex capture
20:53:02 [plugins] memory-lancedb-pro: regex fallback found 1 capturable text(s) for agent main
20:53:03 [plugins] memory-lancedb-pro: auto-captured 1 memories for agent main in scope agent:main (bulkStore)

結果:

環境: Windows native, Ollama CPU mode, extension master (b298adb + PR #723 patch)
PR #723 測試通過。

…ith production fallback (0.9 → 0.1)

Fallback path in production (index.ts:3139) uses 0.1 as the fail-open
pre-filter minScore. The NEW pattern test's DB dedup pre-check was
using 0.9, creating an inconsistency — fixing to 0.1.

OLD pattern test threshold (0.9) left unchanged; it intentionally
simulates the buggy old behavior and is not part of this fix.
@jlin53882 jlin53882 changed the title fix(regex-fallback): batch-internal dedup + fallback dedup + metadata failure isolation fix(regex-fallback): #675 bulkStore + batch/fallback dedup + mdMirror isolation + metadata failure isolation May 1, 2026
@jlin53882
Copy link
Copy Markdown
Contributor Author

D7 Test Fix Applied: 0.90.1 alignment

Commit: 962ead6 (fix(regex-fallback): align NEW pattern DB dedup pre-check threshold with production fallback (0.9 → 0.1))

What was fixed

In test/regex-fallback-bulk-store.test.mjs, the NEW pattern DB dedup pre-check was using 0.9 as the minScore parameter:

// Before (inconsistent)
try { existing = await store.vectorSearch(vector, 1, 0.9, [scope]); } catch { /* fail-open */ }

// After (aligned with production)
try { existing = await store.vectorSearch(vector, 1, 0.1, [scope]); } catch { /* fail-open */ }

Why

The production fallback path in index.ts:3139 uses 0.1 as the fail-open pre-filter minScore. The NEW pattern test's DB dedup pre-check was using 0.9, creating an inconsistency between test and production.

What was NOT changed

The OLD pattern test threshold (0.9 at line 76) is intentionally left unchanged — it simulates the buggy old behavior and serves as the baseline "confirmed buggy" reference. This is the correct test design.

PR title updated

Title changed from:

fix(regex-fallback): batch-internal dedup + fallback dedup + metadata failure isolation

To:

fix(regex-fallback): #675 bulkStore + batch/fallback dedup + mdMirror isolation + metadata failure isolation

Reflecting all four distinct fixes in this PR (#675 bulkStore, #Bug-1 mdMirror isolation, batch-internal dedup, metadata failure isolation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant