Skip to content

feat(import-markdown): batch processing pipeline#729

Closed
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import
Closed

feat(import-markdown): batch processing pipeline#729
jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
jlin53882:feat/batch-import

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Implement a batch processing pipeline for import-markdown that replaces per-entry embedPassage() + store() with:

  • Phase 1 — Parallel file reads (Promise.all) → flat entry list
  • Phase 2a — Chunk-based (50) dedup via retriever.retrieve() in parallel
  • Phase 2b — Batch embedding via embedBatchPassage() + bulkStore() flush every 100 entries

Key Changes

Before After
File reads Sequential for loop Promise.all parallel
Dedup ctx.store.bm25Search per entry retriever.retrieve() in chunks of 50
Embedding embedPassage() per entry embedBatchPassage() per batch
Storage ctx.store.store() per entry bulkStore() every 100 entries
Return value {imported, skipped, foundFiles} + skippedShort, skippedDedup, errorCount, elapsedMs

New CLI Flag

--batch-size <n>   Embedding batch size (default: 32, max: 128)

API Dependencies

Requires embedBatchPassage() and bulkStore() APIs (already merged in PR #716).

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 11c1da3fdb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread cli.ts Outdated
Comment on lines +716 to +720
if (options.dryRun) {
for (const e of allEntries) {
console.log(` [dry-run] would import: ${e.text.slice(0, 80)}${e.text.length > 80 ? "..." : ""}`);
imported++;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Perform dedup simulation before returning dry-run results

When --dry-run is set, this early return runs before the dedup phase, so duplicate entries are always counted as “would import” even if dedupEnabled is true. That makes dry-run output inaccurate for the exact mode users rely on to preview real import behavior (--dry-run --dedup), and the summary can claim dedup is enabled while not applying it.

Useful? React with 👍 / 👎.

Comment thread cli.ts Outdated
async function flushPending(): Promise<void> {
if (pendingFlush.length === 0) return;
const batch = pendingFlush.splice(0, pendingFlush.length);
await ctx.store.bulkStore(batch);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Handle bulkStore failures per flush instead of aborting import

A single bulkStore error now throws out of runImportMarkdown and terminates the entire command, leaving the remaining parsed entries unprocessed. In the previous per-entry flow, store failures were counted and the import continued; this regression means transient DB write errors (or one bad batch) can fail a large import outright instead of skipping failed items and finishing.

Useful? React with 👍 / 👎.

@jlin53882
Copy link
Copy Markdown
Contributor Author

Closed — superseded by #734 (Phase 1+2a+2b batch import)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant