fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json by jlin53882 · Pull Request #734 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-05-01T10:31:48Z

Summary

Batch import pipeline (Phase 1 parallel read → Phase 2a chunk dedup → Phase 2b batch embed) with robust backup path resolution.

Phase breakdown

Phase	What it does
Phase 1	Parallel `fs.readFile` for all markdown files
Phase 2a	`ctx.retriever.retrieve()` chunk dedup (CHUNK=50)
Phase 2b	`embedBatchPassage()` + `bulkStore()` with FLUSH_THRESHOLD=100

Backup path fix (from #732)

/backup now reads openclaw.json for the home field before falling back to OPENCLAW_HOME or ~/.openclaw/. Fixes TypeError: path must be string when dbPath is undefined.

Dedup behavior

Default: ON — ctx.retriever.retrieve() (hybrid vector+BM25 + rerank), exact-match on text
Disable: --no-dedup
Handles both boolean false and string "false" (Commander.js quirk)

Testing

19/19 import-markdown tests pass
Backup path uses resolveOpenclawHomeFromConfig() for robustness

Closes #729, #732

…+ bulkStore + parallel dedup

… James workspace data - Fix parseArgs converts --dedup=false to string 'false' (truthy) bug - Rewrite all import-markdown tests against real cli.ts behavior - Add batch pipeline, bulkStore failure, scope inference, return fields tests - Use real bullet content from James's workspace (all > minTextLength=5) - Remove non-existent directory-filter behavior tests - 19 tests pass, 0 fail

Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.

- CLI: add --no-dedup negation flag; dedup now defaults to enabled - Tests: add dedup: false explicitly to all non-dedup tests so behavior is intentional and not dependent on mock return values - P3 comment: correct misleading 'parseArgs' reference to clarify the != 'false' guard exists for the test helper path (Commander.js rejects --dedup=false as unknown option)

…leanup, remove source:cli - Phase 1a: move foundFiles++ inside try block (only count successfully read files) - Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted) - Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup) - Remove unused source:'cli' from retrieve() call

Backup path was derived from resolvedDbPath/../backups, which crashed with ERR_INVALID_ARG_TYPE when dbPath was unset (undefined). Now reads openclaw.json directly to resolve openclawHome, respecting the config's 'home' field, with fallback to OPENCLAW_HOME/~/.openclaw/. This also fixes the path=undefined root cause and co-locates the backup directory next to openclaw.json regardless of dbPath config. Fixes: backup TypeError when path is undefined

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d691237a91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-01T10:36:27Z

+      for (const { e, hits, ok } of results) {
+        if (!ok || hits.length === 0 || hits[0].entry.text !== e.text) {
+          pendingEntries.push(e);


Deduplicate entries within the current import batch

When dedup is enabled, this branch only checks each candidate against records already in the store before any new entries are written, so duplicate bullets inside the same import run are both kept in pendingEntries and later stored. A file containing the same memory line twice (or duplicates across files in the same run) will now import duplicates, whereas the previous sequential flow naturally skipped later duplicates after the first insert.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-01T10:36:27Z

+  const fileContents = await Promise.all(
+    mdFiles.map(async ({ filePath, scope: discoveredScope }) => {
+      try {


Throttle parallel file reads to avoid EMFILE failures

This launches readFile for every markdown file at once, which can exceed the process file-descriptor limit on large workspaces and produce EMFILE/read failures. Those failures are only counted as parse errors and the corresponding files are skipped, so imports silently lose data under load. A bounded-concurrency reader is needed to keep large imports reliable.

Useful? React with 👍 / 👎.

jlin53882 · 2026-05-01T11:54:18Z

Closed — superseded by #735 (clean rebuild)

jlin53882 and others added 8 commits April 30, 2026 23:57

feat(import-markdown): batch processing pipeline — embedBatchPassage …

5fd95a2

…+ bulkStore + parallel dedup

chore: default batch-size=10, no cap; add pipeline comments

5a969c1

fix: P2 dry-run runs dedup first; P1 wrap bulkStore in try/catch

34472ca

fix(import-markdown): dry-run return value uses correct skipped count

e518afd

Before: skipped: skippedShort (excludes dedup hits from count) After: skipped: skipped (includes both short + dedup skips) Dry-run console output was correct; only the return value was wrong.

This was referenced May 1, 2026

feat(import-markdown): batch processing pipeline #729

Closed

fix(backup): resolve backup dir from openclaw.json home field #732

Closed

chatgpt-codex-connector Bot reviewed May 1, 2026

View reviewed changes

jlin53882 closed this May 1, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734

fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734
jlin53882 wants to merge 8 commits intoCortexReach:masterfrom
jlin53882:pr730batch

jlin53882 commented May 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented May 1, 2026

Summary

Phase breakdown

Backup path fix (from #732)

Dedup behavior

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 1, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant