Skip to content

feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747

Open
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:issue/693-validation
Open

feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:issue/693-validation

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Issue #693 — Phase 1: Extraction Write Validation

Problem

After extractAndPersist() calls bulkStore(), there is no verification that writes actually succeeded. A SIGKILL/OOM during table.add() leaves silent data loss.

Solution (Phase 1)

Count-before/after verification after bulkStore() with an optional callback for mismatch notification.

Changes

File Change
src/memory-categories.ts ExtractionValidation type + OnExtractionValidationFailed callback interface
src/smart-extractor.ts countBefore/countAfter validation after bulkStore(), callback invoked on mismatch
test/extraction-validation.test.mjs 6 scenarios (T1–T6), all passing
test/dedup-false-alarm.test.mjs P0 regression — bulkStore does NOT deduplicate
scripts/ci-test-manifest.mjs register both test files

Phase 2 (not included)

  • UUID-list-based concurrent-safe validation
  • Dead-letter queue for permanent mismatch alerting
  • Production alert integration

Note

Count-diff validation assumes single-process extraction (guaranteed by plugin architecture). Concurrent writes can inflate actualCreated — acceptable for Phase 1, will be addressed in Phase 2.

jlin53882 added 2 commits May 5, 2026 00:33
…issue CortexReach#692)

Adds tree-sitter-based chunking to prevent splitting code mid-function.

Changes:
- detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content
- astChunk(): split code at declaration boundaries (function/class/method)
- smartChunk(): route through astChunk when file is detected as code
- 20 unit tests covering all destructive split scenarios

Test: node --test test/ast-code-chunking.test.mjs (20/20 pass)
Config: astAwareCodeSplit defaults to true
…e/after

Issue CortexReach#693 Phase 1 core implementation.

Changes:
- memory-categories.ts: add ExtractionValidation type and
  OnExtractionValidationFailed callback interface
- smart-extractor.ts: add countBefore/countAfter validation after bulkStore,
  invoke optional callback on mismatch (extractionStats unchanged — private)
- New test: extraction-validation.test.mjs (6 scenarios: T1-T6)
- New test: dedup-false-alarm.test.mjs (P0 regression: bulkStore does NOT dedup)
- CI manifest: register both new test files

Note: count-diff validation assumes single-process extraction (plugin
architecture guarantees this). Concurrent writes can inflate actualCreated
and cause false negatives — acceptable for Phase 1, Phase 2 will add
UUID-list-based concurrent-safe validation.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 601a1b415a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/chunker.ts
Comment on lines +202 to +203
pattern: /\b(def\s|class\s|import\s|from\s|async\s+def\s|print\()/,
lang: 'python',
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Prioritize JS/TS imports over Python keyword matching

The Python detector currently matches import/from before the JavaScript pattern, so common JS/TS modules like import x from 'y' are classified as Python. In that path astChunk attempts a Python parse and usually falls back to chunkDocument, which silently disables the AST chunking fix for many real JS/TS files and reintroduces mid-function splits.

Useful? React with 👍 / 👎.

Comment thread src/chunker.ts
Comment on lines +377 to +385
const fullText = pendingNonDecl.length > 0 ? pendingNonDecl + '\n' + text : text;
pendingNonDecl = ''; // reset

if (fullText.length <= config.maxChunkSize) {
chunks.push(fullText);
metadatas.push({
startIndex: child.startIndex,
endIndex: child.endIndex,
length: fullText.length,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve source offsets when prepending non-declaration text

When pendingNonDecl is prepended to a declaration, the emitted chunk contains text that starts earlier than child.startIndex, but metadata still reports startIndex: child.startIndex/endIndex: child.endIndex. This produces incorrect offsets for returned chunks, so any consumer that maps chunk metadata back to original source positions will get wrong spans.

Useful? React with 👍 / 👎.

Comment thread src/smart-extractor.ts
Comment on lines +478 to +483
options.onExtractionValidationFailed?.({
expected: expectedCreated,
actual: actualCreated,
mismatch,
sessionKey,
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Guard validation callback failures from aborting extraction

The mismatch callback is invoked directly without isolation, so if onExtractionValidationFailed throws (or an async callback rejects), extractAndPersist rejects even though bulkStore already completed. This turns an optional observability hook into a write-path failure mode and can cause callers to treat successful writes as failed extractions.

Useful? React with 👍 / 👎.

Missing mock method was causing:
  this.store.count is not a function (TypeError)
  at smart-extractor.ts:470

Affected tests:
- smart-extractor-scope-filter.test.mjs
- preference-slots.test.mjs (2 mock stores)
- smart-extractor-batch-embed.test.mjs
- smart-extractor-bulk-store.test.mjs
- smart-extractor-bulk-store-edge-cases.test.mjs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant