feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747
feat(extract): Phase 1 — extraction write validation with count-before/after (issue #693)#747jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
Conversation
…issue CortexReach#692) Adds tree-sitter-based chunking to prevent splitting code mid-function. Changes: - detectCodeLanguage(): identify JS/TS/Python/Go/Rust from code content - astChunk(): split code at declaration boundaries (function/class/method) - smartChunk(): route through astChunk when file is detected as code - 20 unit tests covering all destructive split scenarios Test: node --test test/ast-code-chunking.test.mjs (20/20 pass) Config: astAwareCodeSplit defaults to true
…e/after Issue CortexReach#693 Phase 1 core implementation. Changes: - memory-categories.ts: add ExtractionValidation type and OnExtractionValidationFailed callback interface - smart-extractor.ts: add countBefore/countAfter validation after bulkStore, invoke optional callback on mismatch (extractionStats unchanged — private) - New test: extraction-validation.test.mjs (6 scenarios: T1-T6) - New test: dedup-false-alarm.test.mjs (P0 regression: bulkStore does NOT dedup) - CI manifest: register both new test files Note: count-diff validation assumes single-process extraction (plugin architecture guarantees this). Concurrent writes can inflate actualCreated and cause false negatives — acceptable for Phase 1, Phase 2 will add UUID-list-based concurrent-safe validation.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 601a1b415a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| pattern: /\b(def\s|class\s|import\s|from\s|async\s+def\s|print\()/, | ||
| lang: 'python', |
There was a problem hiding this comment.
Prioritize JS/TS imports over Python keyword matching
The Python detector currently matches import/from before the JavaScript pattern, so common JS/TS modules like import x from 'y' are classified as Python. In that path astChunk attempts a Python parse and usually falls back to chunkDocument, which silently disables the AST chunking fix for many real JS/TS files and reintroduces mid-function splits.
Useful? React with 👍 / 👎.
| const fullText = pendingNonDecl.length > 0 ? pendingNonDecl + '\n' + text : text; | ||
| pendingNonDecl = ''; // reset | ||
|
|
||
| if (fullText.length <= config.maxChunkSize) { | ||
| chunks.push(fullText); | ||
| metadatas.push({ | ||
| startIndex: child.startIndex, | ||
| endIndex: child.endIndex, | ||
| length: fullText.length, |
There was a problem hiding this comment.
Preserve source offsets when prepending non-declaration text
When pendingNonDecl is prepended to a declaration, the emitted chunk contains text that starts earlier than child.startIndex, but metadata still reports startIndex: child.startIndex/endIndex: child.endIndex. This produces incorrect offsets for returned chunks, so any consumer that maps chunk metadata back to original source positions will get wrong spans.
Useful? React with 👍 / 👎.
| options.onExtractionValidationFailed?.({ | ||
| expected: expectedCreated, | ||
| actual: actualCreated, | ||
| mismatch, | ||
| sessionKey, | ||
| }); |
There was a problem hiding this comment.
Guard validation callback failures from aborting extraction
The mismatch callback is invoked directly without isolation, so if onExtractionValidationFailed throws (or an async callback rejects), extractAndPersist rejects even though bulkStore already completed. This turns an optional observability hook into a write-path failure mode and can cause callers to treat successful writes as failed extractions.
Useful? React with 👍 / 👎.
Missing mock method was causing: this.store.count is not a function (TypeError) at smart-extractor.ts:470 Affected tests: - smart-extractor-scope-filter.test.mjs - preference-slots.test.mjs (2 mock stores) - smart-extractor-batch-embed.test.mjs - smart-extractor-bulk-store.test.mjs - smart-extractor-bulk-store-edge-cases.test.mjs
Issue #693 — Phase 1: Extraction Write Validation
Problem
After
extractAndPersist()callsbulkStore(), there is no verification that writes actually succeeded. A SIGKILL/OOM duringtable.add()leaves silent data loss.Solution (Phase 1)
Count-before/after verification after
bulkStore()with an optional callback for mismatch notification.Changes
src/memory-categories.tsExtractionValidationtype +OnExtractionValidationFailedcallback interfacesrc/smart-extractor.tscountBefore/countAftervalidation afterbulkStore(), callback invoked on mismatchtest/extraction-validation.test.mjstest/dedup-false-alarm.test.mjsbulkStoredoes NOT deduplicatescripts/ci-test-manifest.mjsPhase 2 (not included)
Note
Count-diff validation assumes single-process extraction (guaranteed by plugin architecture). Concurrent writes can inflate
actualCreated— acceptable for Phase 1, will be addressed in Phase 2.