fix(auto-recall): thread AbortSignal + BM25-only mode + fix post-timeout rejection#746
fix(auto-recall): thread AbortSignal + BM25-only mode + fix post-timeout rejection#746jlin53882 wants to merge 5 commits intoCortexReach:masterfrom
Conversation
…ejection When the AUTO_RECALL_TIMEOUT_MS timer wins Promise.race and aborts recallWork via recallAbort.abort(), the in-flight recallWork() can still reject after the timeout path has already resolved undefined. Without a .catch() handler, that rejection escapes as an unhandled process rejection. Fix: add a .catch() on the recallWork branch of Promise.race that only suppresses the rejection when OUR timeout controller won the race (recallAbort.signal.aborted). Other errors still propagate normally. Follows maintainer feedback on PR CortexReach#668.
Problem: the Promise.race-based timeout in the auto-recall before_prompt_build hook resolves the outer promise with undefined after AUTO_RECALL_TIMEOUT_MS, but it does not cancel the underlying embedding HTTP call. The underlying work keeps running, holding the per-agent memory-runtime mutex (and lancedb connection), so the next hook invocation serializes behind it. Combined with the gateway's session-write-lock max-hold of (timeoutMs + 120s grace, min 300s), a single slow embedding call can hold the session lock for several minutes before the watchdog force-releases — seen in production as "stuck session: state=processing age=219s queueDepth=0". Fix: - Create an AbortController in the hook and pass its signal into recallWork(signal). - recallWork threads the signal through retrieveWithRetry() into Retriever.retrieve() via a new RetrievalContext.signal field. - Retriever.retrieve() forwards the signal to the embedQuery() calls in both vectorOnlyRetrieval() and hybridRetrieval(). The embedder already accepts AbortSignal throughout (src/embedder.ts), so no downstream changes were needed. - When the setTimeout fires, the hook calls controller.abort() before resolving undefined. In-flight embedding fetch() requests terminate promptly; retriever.retrieve() rejects with AbortError which the retryWithRetry early-return handles (aborts skip the 75ms retry). - The catch branch downgrades AbortError to a debug log so we don't double-log the timeout warning the setTimeout path already emitted. Out of scope (follow-up): - Jina rerank HTTP calls (a separate code path in retriever.ts) should also accept the signal. Left for a focused follow-up since it touches the rerank client interface. - BM25 path does not call embedQuery and has no long-running HTTP call, so it does not need signal plumbing. Note: tsc --noEmit reports a pre-existing TS2353 error on retrieveWithRetry where callers pass `source: "auto-recall"` but the function's inline type did not include `source` (introduced by CortexReach#535, not this PR). Non-blocking — tsx runtime transpile does not check types. Can be fixed with a one-line addition to the param type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot P2 feedback on PR CortexReach#668. The catch block used `err.name === "AbortError" || /abort/i.test(err.message)` to decide whether to downgrade recall errors to debug. That was too broad: if the embedder's own internal timeout AbortController fired (src/embedder.ts withTimeout), its AbortError would propagate up and match the same pattern, silencing a real infrastructure failure. Key the debug-downgrade to `abortController.signal.aborted` instead. That flag is owned by this hook's own controller and is true only when our timeout callback fired. Aborts from any other controller (embedder, SDK client, upstream cancellation) now stay at warn. Also tighten the `RetrievalContext.signal` docstring: the embedder's withTimeout rejects with `externalSignal.reason ?? new Error("aborted")` when the signal is already aborted, which can be a plain Error rather than an AbortError. "rejects due to abort (often with AbortError or the signal's abort reason)" matches observed behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
memory-lancedb-pro/src/retriever.ts
Line 598 in b0d03ec
The retrieve() method now opens a try block but never follows it with catch or finally, so TypeScript parsing fails (TS1472) and the plugin cannot load at all. This is a hard runtime blocker for any environment that imports src/retriever.ts.
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| results = await this.hybridRetrieval( | ||
| query, safeLimit, scopeFilter, category, trace, signal, |
There was a problem hiding this comment.
Pass AbortSignal in the correct retrieval argument slot
This call passes signal as the 6th positional argument, but hybridRetrieval expects source and diagnostics before signal, so the abort signal never reaches embedder.embedQuery(...). In vector/hybrid paths that provide a signal, cancellation will not work and timeouts can still block while remote embedding requests run to completion.
Useful? React with 👍 / 👎.
| const target = join(params.workspaceDir, params2.relPath); | ||
| if (!target.startsWith(params.workspaceDir)) throw new Error(`memory-lancedb-pro: invalid relPath ${params2.relPath}`); |
There was a problem hiding this comment.
Normalize memory_get paths before workspace boundary check
The boundary check uses join(workspaceDir, relPath) plus startsWith(workspaceDir), which can be bypassed with paths like ../workspaceDir2/... that still share the same prefix string. In compat readFile, this allows reading files outside the configured workspace when relPath is user-controlled.
Useful? React with 👍 / 👎.
b0d03ec to
44445dd
Compare
…+ isCliRegistrationMode
44445dd to
96f7a61
Compare
Summary
Consolidates two closed/in-conflict PRs into a single reviewable branch:
AbortControllerthreaded throughrecallWork → retrieveWithRetry → embedder.embedQuery.Promise.racepost-timeout rejection now consumed via.catch()so it does not become an unhandled process rejection.Commits
dc7ec59fix(auto-recall): thread AbortSignal to cancel embedding on timeout9f4c86cfix(auto-recall): narrow abort classification to our controller3dc8644fix: restore cli compat and unblock auto-recall latencyb0d03ecfix(auto-recall): consume post-timeout rejection to avoid unhandled rejectionMaintainer feedback addressed
Rebased onto latest
origin/master(47b635d) — no conflicts.