Skip to content

fix(auto-recall): thread AbortSignal + BM25-only mode + fix post-timeout rejection#746

Open
jlin53882 wants to merge 5 commits intoCortexReach:masterfrom
jlin53882:james/handle-pr668-pr708
Open

fix(auto-recall): thread AbortSignal + BM25-only mode + fix post-timeout rejection#746
jlin53882 wants to merge 5 commits intoCortexReach:masterfrom
jlin53882:james/handle-pr668-pr708

Conversation

@jlin53882
Copy link
Copy Markdown
Contributor

Summary

Consolidates two closed/in-conflict PRs into a single reviewable branch:

Commits

  1. dc7ec59 fix(auto-recall): thread AbortSignal to cancel embedding on timeout
  2. 9f4c86c fix(auto-recall): narrow abort classification to our controller
  3. 3dc8644 fix: restore cli compat and unblock auto-recall latency
  4. b0d03ec fix(auto-recall): consume post-timeout rejection to avoid unhandled rejection

Maintainer feedback addressed

From Issue Status
PR #668 review Post-timeout rejection needs explicit handling Fixed in b0d03ec

Rebased onto latest origin/master (47b635d) — no conflicts.

jlin53882 and others added 4 commits May 5, 2026 00:12
…ejection

When the AUTO_RECALL_TIMEOUT_MS timer wins Promise.race and aborts
recallWork via recallAbort.abort(), the in-flight recallWork() can still
reject after the timeout path has already resolved undefined. Without
a .catch() handler, that rejection escapes as an unhandled process
rejection.

Fix: add a .catch() on the recallWork branch of Promise.race that only
suppresses the rejection when OUR timeout controller won the race
(recallAbort.signal.aborted). Other errors still propagate normally.

Follows maintainer feedback on PR CortexReach#668.
Problem: the Promise.race-based timeout in the auto-recall
before_prompt_build hook resolves the outer promise with undefined
after AUTO_RECALL_TIMEOUT_MS, but it does not cancel the underlying
embedding HTTP call. The underlying work keeps running, holding the
per-agent memory-runtime mutex (and lancedb connection), so the next
hook invocation serializes behind it. Combined with the gateway's
session-write-lock max-hold of (timeoutMs + 120s grace, min 300s),
a single slow embedding call can hold the session lock for several
minutes before the watchdog force-releases — seen in production as
"stuck session: state=processing age=219s queueDepth=0".

Fix:
- Create an AbortController in the hook and pass its signal into
  recallWork(signal).
- recallWork threads the signal through retrieveWithRetry() into
  Retriever.retrieve() via a new RetrievalContext.signal field.
- Retriever.retrieve() forwards the signal to the embedQuery() calls
  in both vectorOnlyRetrieval() and hybridRetrieval(). The embedder
  already accepts AbortSignal throughout (src/embedder.ts), so no
  downstream changes were needed.
- When the setTimeout fires, the hook calls controller.abort() before
  resolving undefined. In-flight embedding fetch() requests terminate
  promptly; retriever.retrieve() rejects with AbortError which the
  retryWithRetry early-return handles (aborts skip the 75ms retry).
- The catch branch downgrades AbortError to a debug log so we don't
  double-log the timeout warning the setTimeout path already emitted.

Out of scope (follow-up):
- Jina rerank HTTP calls (a separate code path in retriever.ts) should
  also accept the signal. Left for a focused follow-up since it
  touches the rerank client interface.
- BM25 path does not call embedQuery and has no long-running HTTP
  call, so it does not need signal plumbing.

Note: tsc --noEmit reports a pre-existing TS2353 error on
retrieveWithRetry where callers pass `source: "auto-recall"` but the
function's inline type did not include `source` (introduced by CortexReach#535,
not this PR). Non-blocking — tsx runtime transpile does not check
types. Can be fixed with a one-line addition to the param type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address Copilot P2 feedback on PR CortexReach#668.

The catch block used `err.name === "AbortError" || /abort/i.test(err.message)`
to decide whether to downgrade recall errors to debug. That was too
broad: if the embedder's own internal timeout AbortController fired
(src/embedder.ts withTimeout), its AbortError would propagate up and
match the same pattern, silencing a real infrastructure failure.

Key the debug-downgrade to `abortController.signal.aborted` instead.
That flag is owned by this hook's own controller and is true only
when our timeout callback fired. Aborts from any other controller
(embedder, SDK client, upstream cancellation) now stay at warn.

Also tighten the `RetrievalContext.signal` docstring: the embedder's
withTimeout rejects with `externalSignal.reason ?? new Error("aborted")`
when the signal is already aborted, which can be a plain Error rather
than an AbortError. "rejects due to abort (often with AbortError or
the signal's abort reason)" matches observed behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review


P0 Badge Restore catch/finally for retrieve try block

The retrieve() method now opens a try block but never follows it with catch or finally, so TypeScript parsing fails (TS1472) and the plugin cannot load at all. This is a hard runtime blocker for any environment that imports src/retriever.ts.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/retriever.ts
Comment on lines +623 to +624
results = await this.hybridRetrieval(
query, safeLimit, scopeFilter, category, trace, signal,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Pass AbortSignal in the correct retrieval argument slot

This call passes signal as the 6th positional argument, but hybridRetrieval expects source and diagnostics before signal, so the abort signal never reaches embedder.embedQuery(...). In vector/hybrid paths that provide a signal, cancellation will not work and timeouts can still block while remote embedding requests run to completion.

Useful? React with 👍 / 👎.

Comment thread index.ts
Comment on lines +1813 to +1814
const target = join(params.workspaceDir, params2.relPath);
if (!target.startsWith(params.workspaceDir)) throw new Error(`memory-lancedb-pro: invalid relPath ${params2.relPath}`);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Normalize memory_get paths before workspace boundary check

The boundary check uses join(workspaceDir, relPath) plus startsWith(workspaceDir), which can be bypassed with paths like ../workspaceDir2/... that still share the same prefix string. In compat readFile, this allows reading files outside the configured workspace when relPath is user-controlled.

Useful? React with 👍 / 👎.

@jlin53882 jlin53882 force-pushed the james/handle-pr668-pr708 branch from b0d03ec to 44445dd Compare May 5, 2026 11:42
@jlin53882 jlin53882 force-pushed the james/handle-pr668-pr708 branch from 44445dd to 96f7a61 Compare May 5, 2026 12:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants