feat(import-markdown): batch processing pipeline by jlin53882 · Pull Request #729 · CortexReach/memory-lancedb-pro

jlin53882 · 2026-04-30T14:00:25Z

Summary

Implement a batch processing pipeline for import-markdown that replaces per-entry embedPassage() + store() with:

Phase 1 — Parallel file reads (Promise.all) → flat entry list
Phase 2a — Chunk-based (50) dedup via retriever.retrieve() in parallel
Phase 2b — Batch embedding via embedBatchPassage() + bulkStore() flush every 100 entries

Key Changes

	Before	After
File reads	Sequential `for` loop	`Promise.all` parallel
Dedup	`ctx.store.bm25Search` per entry	`retriever.retrieve()` in chunks of 50
Embedding	`embedPassage()` per entry	`embedBatchPassage()` per batch
Storage	`ctx.store.store()` per entry	`bulkStore()` every 100 entries
Return value	`{imported, skipped, foundFiles}`	+ `skippedShort`, `skippedDedup`, `errorCount`, `elapsedMs`

New CLI Flag

--batch-size <n>   Embedding batch size (default: 32, max: 128)

API Dependencies

Requires embedBatchPassage() and bulkStore() APIs (already merged in PR #716).

…+ bulkStore + parallel dedup

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 11c1da3fdb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-30T14:03:34Z

  if (options.dryRun) {
+    for (const e of allEntries) {
+      console.log(`  [dry-run] would import: ${e.text.slice(0, 80)}${e.text.length > 80 ? "..." : ""}`);
+      imported++;
+    }


Perform dedup simulation before returning dry-run results

When --dry-run is set, this early return runs before the dedup phase, so duplicate entries are always counted as “would import” even if dedupEnabled is true. That makes dry-run output inaccurate for the exact mode users rely on to preview real import behavior (--dry-run --dedup), and the summary can claim dedup is enabled while not applying it.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-04-30T14:03:34Z

+  async function flushPending(): Promise<void> {
+    if (pendingFlush.length === 0) return;
+    const batch = pendingFlush.splice(0, pendingFlush.length);
+    await ctx.store.bulkStore(batch);


Handle bulkStore failures per flush instead of aborting import

A single bulkStore error now throws out of runImportMarkdown and terminates the entire command, leaving the remaining parsed entries unprocessed. In the previous per-entry flow, store failures were counted and the import continued; this regression means transient DB write errors (or one bad batch) can fail a large import outright instead of skipping failed items and finishing.

Useful? React with 👍 / 👎.

jlin53882 · 2026-05-01T10:31:59Z

Closed — superseded by #734 (Phase 1+2a+2b batch import)

jlin53882 added 2 commits April 30, 2026 21:59

feat(import-markdown): batch processing pipeline — embedBatchPassage …

11c1da3

…+ bulkStore + parallel dedup

chore: default batch-size=10, no cap; add pipeline comments

05916b6

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

fix: P2 dry-run runs dedup first; P1 wrap bulkStore in try/catch

9630f73

jlin53882 mentioned this pull request May 1, 2026

fix(import-markdown): batch import Phase 1+2a+2b + backup dir from openclaw.json #734

Closed

jlin53882 closed this May 1, 2026

jlin53882 mentioned this pull request May 1, 2026

fix(import-markdown): batch pipeline Phase 1+2a+2b + retrieve() dedup + PR719 log alignment (#730+#732+#719+#720) #735

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(import-markdown): batch processing pipeline#729

feat(import-markdown): batch processing pipeline#729
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import

jlin53882 commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jlin53882 commented Apr 30, 2026

Summary

Key Changes

New CLI Flag

API Dependencies

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

jlin53882 commented May 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant