Skip to content

fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734

Closed
jlin53882 wants to merge 8 commits intoCortexReach:masterfrom
jlin53882:pr730batch
Closed

fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json#734
jlin53882 wants to merge 8 commits intoCortexReach:masterfrom
jlin53882:pr730batch

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Batch import pipeline (Phase 1 parallel read → Phase 2a chunk dedup → Phase 2b batch embed) with robust backup path resolution.

Phase breakdown

Phase What it does
Phase 1 Parallel fs.readFile for all markdown files
Phase 2a ctx.retriever.retrieve() chunk dedup (CHUNK=50)
Phase 2b embedBatchPassage() + bulkStore() with FLUSH_THRESHOLD=100

Backup path fix (from #732)

/backup now reads openclaw.json for the home field before falling back to OPENCLAW_HOME or ~/.openclaw/. Fixes TypeError: path must be string when dbPath is undefined.

Dedup behavior

  • Default: ONctx.retriever.retrieve() (hybrid vector+BM25 + rerank), exact-match on text
  • Disable: --no-dedup
  • Handles both boolean false and string "false" (Commander.js quirk)

Testing

  • 19/19 import-markdown tests pass
  • Backup path uses resolveOpenclawHomeFromConfig() for robustness

Closes #729, #732

jlin53882 and others added 8 commits April 30, 2026 23:57
… James workspace data

- Fix parseArgs converts --dedup=false to string 'false' (truthy) bug
- Rewrite all import-markdown tests against real cli.ts behavior
- Add batch pipeline, bulkStore failure, scope inference, return fields tests
- Use real bullet content from James's workspace (all > minTextLength=5)
- Remove non-existent directory-filter behavior tests
- 19 tests pass, 0 fail
Before: skipped: skippedShort (excludes dedup hits from count)
After:  skipped: skipped (includes both short + dedup skips)

Dry-run console output was correct; only the return value was wrong.
- CLI: add --no-dedup negation flag; dedup now defaults to enabled
- Tests: add dedup: false explicitly to all non-dedup tests so behavior
  is intentional and not dependent on mock return values
- P3 comment: correct misleading 'parseArgs' reference to clarify
  the != 'false' guard exists for the test helper path (Commander.js
  rejects --dedup=false as unknown option)
…leanup, remove source:cli

- Phase 1a: move foundFiles++ inside try block (only count successfully read files)
- Phase 1b: remove duplicate foundFiles++ in error branch (Phase 1a already counted)
- Phase 2a: remove skipped++ from dedup hit (skipped is Phase 1b short entries only; dedup tracked via skippedDedup)
- Remove unused source:'cli' from retrieve() call
Backup path was derived from resolvedDbPath/../backups, which crashed
with ERR_INVALID_ARG_TYPE when dbPath was unset (undefined).

Now reads openclaw.json directly to resolve openclawHome, respecting
the config's 'home' field, with fallback to OPENCLAW_HOME/~/.openclaw/.
This also fixes the path=undefined root cause and co-locates the
backup directory next to openclaw.json regardless of dbPath config.

Fixes: backup TypeError when path is undefined
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d691237a91

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cli.ts
Comment on lines +745 to +747
for (const { e, hits, ok } of results) {
if (!ok || hits.length === 0 || hits[0].entry.text !== e.text) {
pendingEntries.push(e);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Deduplicate entries within the current import batch

When dedup is enabled, this branch only checks each candidate against records already in the store before any new entries are written, so duplicate bullets inside the same import run are both kept in pendingEntries and later stored. A file containing the same memory line twice (or duplicates across files in the same run) will now import duplicates, whereas the previous sequential flow naturally skipped later duplicates after the first insert.

Useful? React with 👍 / 👎.

Comment thread cli.ts
Comment on lines +684 to +686
const fileContents = await Promise.all(
mdFiles.map(async ({ filePath, scope: discoveredScope }) => {
try {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Throttle parallel file reads to avoid EMFILE failures

This launches readFile for every markdown file at once, which can exceed the process file-descriptor limit on large workspaces and produce EMFILE/read failures. Those failures are only counted as parse errors and the corresponding files are skipped, so imports silently lose data under load. A bounded-concurrency reader is needed to keep large imports reliable.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

Closed — superseded by #735 (clean rebuild)

@jlin53882 jlin53882 closed this May 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant