feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693) by jlin53882 · Pull Request #747 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-05-04T16:54:19Z

Issue #693 — Phase 1: Extraction Write Validation

Problem

After extractAndPersist() calls bulkStore(), there is no verification that writes actually succeeded. A SIGKILL/OOM during table.add() leaves silent data loss.

Solution (Phase 1)

Count-before/after verification after bulkStore() with an optional callback for mismatch notification.

Changes

File	Change
`src/memory-categories.ts`	`ExtractionValidation` type + `OnExtractionValidationFailed` callback interface
`src/smart-extractor.ts`	`countBefore`/`countAfter` validation after `bulkStore()`, callback invoked on mismatch
`test/extraction-validation.test.mjs`	6 scenarios (T1–T6), all passing
`test/dedup-false-alarm.test.mjs`	P0 regression — `bulkStore` does NOT deduplicate
`scripts/ci-test-manifest.mjs`	register both test files

Phase 2 (not included)

UUID-list-based concurrent-safe validation
Dead-letter queue for permanent mismatch alerting
Production alert integration

Note

Count-diff validation assumes single-process extraction (guaranteed by plugin architecture). Concurrent writes can inflate actualCreated — acceptable for Phase 1, will be addressed in Phase 2.

…issue CortexReach#692) Adds tree-sitter-based chunking to prevent splitting code mid-function. Changes: - detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content - astChunk(): split code at declaration boundaries (function/class/method) - smartChunk(): route through astChunk when file is detected as code - 20 unit tests covering all destructive split scenarios Test: node --test test/ast-code-chunking.test.mjs (20/20 pass) Config: astAwareCodeSplit defaults to true

…e/after Issue CortexReach#693 Phase 1 core implementation. Changes: - memory-categories.ts: add ExtractionValidation type and OnExtractionValidationFailed callback interface - smart-extractor.ts: add countBefore/countAfter validation after bulkStore, invoke optional callback on mismatch (extractionStats unchanged — private) - New test: extraction-validation.test.mjs (6 scenarios: T1-T6) - New test: dedup-false-alarm.test.mjs (P0 regression: bulkStore does NOT dedup) - CI manifest: register both new test files Note: count-diff validation assumes single-process extraction (plugin architecture guarantees this). Concurrent writes can inflate actualCreated and cause false negatives — acceptable for Phase 1, Phase 2 will add UUID-list-based concurrent-safe validation.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 601a1b415a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-04T16:58:48Z

+    pattern: /\b(def\s|class\s|import\s|from\s|async\s+def\s|print\()/,
+    lang: 'python',


Prioritize JS/TS imports over Python keyword matching

The Python detector currently matches import/from before the JavaScript pattern, so common JS/TS modules like import x from 'y' are classified as Python. In that path astChunk attempts a Python parse and usually falls back to chunkDocument, which silently disables the AST chunking fix for many real JS/TS files and reintroduces mid-function splits.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T16:58:48Z

+    const fullText = pendingNonDecl.length > 0 ? pendingNonDecl + '\n' + text : text;
+    pendingNonDecl = ''; // reset
+
+    if (fullText.length <= config.maxChunkSize) {
+      chunks.push(fullText);
+      metadatas.push({
+        startIndex: child.startIndex,
+        endIndex: child.endIndex,
+        length: fullText.length,


Preserve source offsets when prepending non-declaration text

When pendingNonDecl is prepended to a declaration, the emitted chunk contains text that starts earlier than child.startIndex, but metadata still reports startIndex: child.startIndex/endIndex: child.endIndex. This produces incorrect offsets for returned chunks, so any consumer that maps chunk metadata back to original source positions will get wrong spans.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-04T16:58:48Z

+        options.onExtractionValidationFailed?.({
+          expected: expectedCreated,
+          actual: actualCreated,
+          mismatch,
+          sessionKey,
+        });


Guard validation callback failures from aborting extraction

The mismatch callback is invoked directly without isolation, so if onExtractionValidationFailed throws (or an async callback rejects), extractAndPersist rejects even though bulkStore already completed. This turns an optional observability hook into a write-path failure mode and can cause callers to treat successful writes as failed extractions.

Useful? React with 👍 / 👎.

Missing mock method was causing: this.store.count is not a function (TypeError) at smart-extractor.ts:470 Affected tests: - smart-extractor-scope-filter.test.mjs - preference-slots.test.mjs (2 mock stores) - smart-extractor-batch-embed.test.mjs - smart-extractor-bulk-store.test.mjs - smart-extractor-bulk-store-edge-cases.test.mjs

jlin53882 added 2 commits May 5, 2026 00:33

jlin53882 mentioned this pull request May 4, 2026

fix: add LanceDB row-count validation after extraction to prevent poison state #693

Open

chatgpt-codex-connector Bot reviewed May 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747

feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:issue/693-validation

jlin53882 commented May 4, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		pattern: /\b(def\s\|class\s\|import\s\|from\s\|async\s+def\s\|print\()/,
		lang: 'python',

Conversation

jlin53882 commented May 4, 2026

Issue #693 — Phase 1: Extraction Write Validation

Problem

Solution (Phase 1)

Changes

Phase 2 (not included)

Note

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant