RLM as the default answer mode (RAG deprecated)#43
Merged
Conversation
get_by_content_sha256 was querying without tenant/workspace filter, causing sources from other workspaces to be treated as idempotent hits. 3072-d ingests into -large workspaces were deduped against 1536-d data in the default workspace, leaving canon_chunk_vectors_3072 empty. Also explicitly pass Azure credentials in docker-compose environment so the worker correctly resolves them at startup.
…spec docs: RLM integration design spec
An absolute key escaped the LocalFs root because Path(root) / '/abs' discards the left operand; the S3 backend passed an absolute-style key straight through with an empty prefix. Add absolute-key rejection tests for both backends, and harden LocalFsObjectStore._path to reject leading-slash keys and verify the resolved path stays within the root.
With an empty prefix an absolute-style key was passed straight to S3 as the object key, breaking the shared tenant/workspace key layout. Apply the same leading-slash rejection as the LocalFs backend.
feat(storage): ObjectStore port with LocalFs + S3 backends
…tkey feat(sources): add canon_sources.object_store_key column + migration
feat(query): port the RLM engine (CodeAct REPL + LLM client) into flycanon
…inals feat(sources): persist original documents on ingest for RLM
feat(query): make the sandboxed subprocess the default RLM executor
Replace the bespoke fitz extract_pdf_pages with the loader registry's PdfLoader so the RLM corpus gains its Tesseract OCR fallback for scanned/image-only PDFs and stays consistent with the ingest path. _pages_for now regroups the loader's per-page sections into the per-page list the engine cites into: paginated sections group by 1-based page, non-paginated sections stay one-per-section. Delete extract_pdf_pages and the orphaned fitz import.
Replace the extract_pdf_pages monkeypatch (the function is gone) with a fake loader registered in the registry for SourceKind.pdf. Add coverage for per-page section grouping (2-page doc -> 2 strings), multiple sections joined on the same page, empty-page skipping, the non-paginated (page is None) per-section split, the raw_text single-page fallback, and a source with no extractable text being skipped.
…se-loader refactor(query): RLM corpus reuses the PdfLoader (gains OCR) instead of raw fitz
…tream-shared refactor(web): share the answer-SSE stream generator across both controllers
…ild-death to TERMINATED
…session death tests
The dead-sandbox degradation broke before appending a tool_result for the tool_use that just ran (and any later tool_use in the same turn), leaving a dangling tool_use block. The fallback chat_raw then sent an assistant tool_use with no matching tool_result in the next user message, which the real Messages API rejects with HTTP 400. Append an is_error tool_result for every unanswered tool_use so the transcript stays well-formed before falling through.
The terminated-degradation test only checked local control flow and masked the HTTP 400 dangling-tool_use bug. Add _assert_tool_uses_answered, which verifies every tool_use block in an assistant turn is answered by a matching tool_result in the next user message, and apply it to the terminated test plus a new multi-tool case where an earlier tool_use in the same turn would also be left dangling. Both assertions fail against the pre-fix engine and pass with it.
…tness fix(query): RLM sandbox inactivity timeout + graceful child-death handling
Final sandbox-mode run (rlm-final): Answer-Correctness 0.510 (> champion 0.497, > hybrid RAG 0.434), 0 sandbox failures, ~34s/query. Documents the ~84min/184-filing ingestion cost, which is embedding-dominated and unused by RLM (an RLM-only ingest could skip embedding).
docs: RLM FinanceBench 50/50 benchmark + ingestion time
Rename rlm-benchmark.md -> rlm-vs-rag-benchmark.md; match the experiments README format; RLM vs hybrid vector RAG across both datasets with retrieval + generation metrics, time (incl. the embedding-heavy vector ingest vs RLM's lazy no-ingest), and cost; PageIndex omitted (RLM-vs-RAG view).
…mark docs: RLM vs RAG benchmark (FinanceBench 50/50 + full)
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Makes RLM (Recursive Language Model) the default answer mode in flycanon (
FLYCANON_ANSWER_MODE=rlm); the legacy hybrid-RAG path is preserved as a deprecated opt-in (=rag, with a warning log +X-Flycanon-Deprecationheader). RLM reasons over whole documents in a sandboxed CodeAct REPL instead of retrieving chunks. Delivered as ~25 small PRs onto this branch.What's included
CanonDocStorewhole-doc corpus reusing flycanon'sPdfLoader(OCR for scanned PDFs); full filter parity incl.knowledge_item_ids→ source resolution.execruns in a scrubbed-env, resource-limited child process withdocs/llm/rlm/finalas capability-RPC to the parent — no secrets/network/infra in the child; escape blast-radius = the in-scope corpus only. Includes an inactivity-timeout + graceful child-death fix.canon_cost_events./query,/query/stream(per-turn status frames), agent endpoints; structured no-answer; richer page-faithful citations.26.6.18.Benchmark — FinanceBench 50/50 + full (doc:
docs/rlm-vs-rag-benchmark.md)RLM vs best RAG (
azure-large-exp-sonnet— the top vector config by Answer-Correctness), head-to-head on both datasets:RLM leads answer quality on both datasets with zero embedding ingest; RAG's higher Faithfulness is the snippet/abstention artifact (see doc).
Production-verified: re-running RLM 50/50 inside the shipped flycanon with the default subprocess sandbox reproduced the result (AC 0.510, median latency 32.5 s) with 0/81 sandbox failures — the security sandbox costs nothing in answer quality.
Ingestion time: ~84 min for 184 filings (~27.5 s/filing). It is embedding-dominated (
azure:text-embedding-3-large) — a cost that benefits hybrid RAG but which the RLM answer path never uses, so an RLM-only ingest could skip embedding for a large speedup (noted as future work).Notes
main).