fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734
fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734jlin53882 wants to merge 8 commits intoCortexReach:masterfrom
Conversation
…+ bulkStore + parallel dedup
… James workspace data - Fix parseArgs converts --dedup=false to string 'false' (truthy) bug - Rewrite all import-markdown tests against real cli.ts behavior - Add batch pipeline, bulkStore failure, scope inference, return fields tests - Use real bullet content from James's workspace (all > minTextLength=5) - Remove non-existent directory-filter behavior tests - 19 tests pass, 0 fail
Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.
- CLI: add --no-dedup negation flag; dedup now defaults to enabled - Tests: add dedup: false explicitly to all non-dedup tests so behavior is intentional and not dependent on mock return values - P3 comment: correct misleading 'parseArgs' reference to clarify the != 'false' guard exists for the test helper path (Commander.js rejects --dedup=false as unknown option)
…leanup, remove source:cli - Phase 1a: move foundFiles++ inside try block (only count successfully read files) - Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted) - Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup) - Remove unused source:'cli' from retrieve() call
Backup path was derived from resolvedDbPath/../backups, which crashed with ERR_INVALID_ARG_TYPE when dbPath was unset (undefined). Now reads openclaw.json directly to resolve openclawHome, respecting the config's 'home' field, with fallback to OPENCLAW_HOME/~/.openclaw/. This also fixes the path=undefined root cause and co-locates the backup directory next to openclaw.json regardless of dbPath config. Fixes: backup TypeError when path is undefined
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d691237a91
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| for (const { e, hits, ok } of results) { | ||
| if (!ok || hits.length === 0 || hits[0].entry.text !== e.text) { | ||
| pendingEntries.push(e); |
There was a problem hiding this comment.
Deduplicate entries within the current import batch
When dedup is enabled, this branch only checks each candidate against records already in the store before any new entries are written, so duplicate bullets inside the same import run are both kept in pendingEntries and later stored. A file containing the same memory line twice (or duplicates across files in the same run) will now import duplicates, whereas the previous sequential flow naturally skipped later duplicates after the first insert.
Useful? React with 👍 / 👎.
| const fileContents = await Promise.all( | ||
| mdFiles.map(async ({ filePath, scope: discoveredScope }) => { | ||
| try { |
There was a problem hiding this comment.
Throttle parallel file reads to avoid EMFILE failures
This launches readFile for every markdown file at once, which can exceed the process file-descriptor limit on large workspaces and produce EMFILE/read failures. Those failures are only counted as parse errors and the corresponding files are skipped, so imports silently lose data under load. A bounded-concurrency reader is needed to keep large imports reliable.
Useful? React with 👍 / 👎.
|
Closed — superseded by #735 (clean rebuild) |
Summary
Batch import pipeline (Phase 1 parallel read → Phase 2a chunk dedup → Phase 2b batch embed) with robust backup path resolution.
Phase breakdown
fs.readFilefor all markdown filesctx.retriever.retrieve()chunk dedup (CHUNK=50)embedBatchPassage()+bulkStore()with FLUSH_THRESHOLD=100Backup path fix (from #732)
/backupnow readsopenclaw.jsonfor thehomefield before falling back toOPENCLAW_HOMEor~/.openclaw/. FixesTypeError: path must be stringwhendbPathis undefined.Dedup behavior
ctx.retriever.retrieve()(hybrid vector+BM25 + rerank), exact-match on text--no-dedupboolean falseandstring "false"(Commander.js quirk)Testing
resolveOpenclawHomeFromConfig()for robustnessCloses #729, #732