feat(import-markdown): batch processing pipeline#729
feat(import-markdown): batch processing pipeline#729jlin53882 wants to merge 3 commits intoCortexReach:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 11c1da3fdb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (options.dryRun) { | ||
| for (const e of allEntries) { | ||
| console.log(` [dry-run] would import: ${e.text.slice(0, 80)}${e.text.length > 80 ? "..." : ""}`); | ||
| imported++; | ||
| } |
There was a problem hiding this comment.
Perform dedup simulation before returning dry-run results
When --dry-run is set, this early return runs before the dedup phase, so duplicate entries are always counted as “would import” even if dedupEnabled is true. That makes dry-run output inaccurate for the exact mode users rely on to preview real import behavior (--dry-run --dedup), and the summary can claim dedup is enabled while not applying it.
Useful? React with 👍 / 👎.
| async function flushPending(): Promise<void> { | ||
| if (pendingFlush.length === 0) return; | ||
| const batch = pendingFlush.splice(0, pendingFlush.length); | ||
| await ctx.store.bulkStore(batch); |
There was a problem hiding this comment.
Handle bulkStore failures per flush instead of aborting import
A single bulkStore error now throws out of runImportMarkdown and terminates the entire command, leaving the remaining parsed entries unprocessed. In the previous per-entry flow, store failures were counted and the import continued; this regression means transient DB write errors (or one bad batch) can fail a large import outright instead of skipping failed items and finishing.
Useful? React with 👍 / 👎.
|
Closed — superseded by #734 (Phase 1+2a+2b batch import) |
Summary
Implement a batch processing pipeline for
import-markdownthat replaces per-entryembedPassage()+store()with:Promise.all) → flat entry listretriever.retrieve()in parallelembedBatchPassage()+bulkStore()flush every 100 entriesKey Changes
forloopPromise.allparallelctx.store.bm25Searchper entryretriever.retrieve()in chunks of 50embedPassage()per entryembedBatchPassage()per batchctx.store.store()per entrybulkStore()every 100 entries{imported, skipped, foundFiles}skippedShort,skippedDedup,errorCount,elapsedMsNew CLI Flag
API Dependencies
Requires
embedBatchPassage()andbulkStore()APIs (already merged in PR #716).