feat(import-markdown): batch processing pipeline + test rewrite#730
feat(import-markdown): batch processing pipeline + test rewrite#730jlin53882 wants to merge 7 commits intoCortexReach:masterfrom
Conversation
…+ bulkStore + parallel dedup
… James workspace data - Fix parseArgs converts --dedup=false to string 'false' (truthy) bug - Rewrite all import-markdown tests against real cli.ts behavior - Add batch pipeline, bulkStore failure, scope inference, return fields tests - Use real bullet content from James's workspace (all > minTextLength=5) - Remove non-existent directory-filter behavior tests - 19 tests pass, 0 fail
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 27e05d9d7b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| const dedupEnabled = (options.dedup as unknown as string | undefined | boolean) !== false | ||
| && (options.dedup as unknown as string | undefined | boolean) !== "false"; |
There was a problem hiding this comment.
Keep dedup off when --dedup is not provided
This condition enables deduplication when options.dedup is undefined, because only false and "false" are treated as disabled. The command defines --dedup as an optional boolean flag, so omitted values can be unset; with the current check, dedup runs by default and changes import behavior (unexpected duplicate skipping and much heavier retrieval work) even when users did not ask for dedup.
Useful? React with 👍 / 👎.
| console.log(`\u2022 Skipped (short): ${skippedShort}`); | ||
| console.log(`\u2022 Skipped (dedup): ${dryRunDedupSkipped.length}`); | ||
| console.log(`\u2022 Elapsed: ${elapsed}ms`); | ||
| return { imported, skipped: skippedShort, foundFiles, skippedShort, skippedDedup: dryRunDedupSkipped.length, errorCount: parseErrors, elapsedMs: elapsed }; |
There was a problem hiding this comment.
Report full skipped count in dry-run return value
The dry-run branch returns skipped: skippedShort, which excludes dedup skips and read/parse skips already counted in skipped. This makes skipped inconsistent with non-dry-run results and with previous behavior, so automation that relies on skipped for preview-vs-run parity will undercount skipped entries in dry-run mode.
Useful? React with 👍 / 👎.
Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.
追加修復:dry-run 回傳值 Bug(commit e518afd)問題:啟用 // cli.ts:777 — 修復前
return { imported, skipped: skippedShort, ... };
// cli.ts:777 — 修復後
return { imported, skipped, ... };原因: 影響:只有 API 回傳值錯誤,console 輸出和實際行為都正確,且現有測試沒有 assert 驗證:19/19 測試全部通過。 |
關於刪除的 2 個目錄保護測試在
為什麼可以移除CLI 掃描層( // cli.ts — mdFiles collection
if (f.isFile() && f.name.endsWith(.md)) {
// ...
}
結論這兩個測試是 defensive 性質的 regression test,在 |
- CLI: add --no-dedup negation flag; dedup now defaults to enabled - Tests: add dedup: false explicitly to all non-dedup tests so behavior is intentional and not dependent on mock return values - P3 comment: correct misleading 'parseArgs' reference to clarify the != 'false' guard exists for the test helper path (Commander.js rejects --dedup=false as unknown option)
…leanup, remove source:cli - Phase 1a: move foundFiles++ inside try block (only count successfully read files) - Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted) - Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup) - Remove unused source:'cli' from retrieve() call
|
目前測試,dedup/dry run 計數還有一些相關內容還在調整中,先不要review,如果有想法歡迎一起討論 |
|
Closed — superseded by #735 (clean rebuild) |
Summary
Phase 2b batch pipeline (embedBatchPassage + bulkStore)
Bug fixes
Test rewrite
Behavior: dry-run vs real execution
--dry-run (current)
--dry-run --no-dedup (current)
Real execution (current)
Future (with Bloom Filter, not in this PR)
Testing
node --test test/import-markdown/import-markdown.test.mjs
=> 19 pass, 0 fail, 0 skip