Skip to content

RLM as the default answer mode (RAG deprecated)#43

Merged
miguelgfierro merged 178 commits into
mainfrom
feat/rlm-integration
Jun 18, 2026
Merged

RLM as the default answer mode (RAG deprecated)#43
miguelgfierro merged 178 commits into
mainfrom
feat/rlm-integration

Conversation

@miguelgfierro

@miguelgfierro miguelgfierro commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes RLM (Recursive Language Model) the default answer mode in flycanon (FLYCANON_ANSWER_MODE=rlm); the legacy hybrid-RAG path is preserved as a deprecated opt-in (=rag, with a warning log + X-Flycanon-Deprecation header). RLM reasons over whole documents in a sandboxed CodeAct REPL instead of retrieving chunks. Delivered as ~25 small PRs onto this branch.

What's included

  • Engine + corpus: ported RLM engine; original documents persisted on ingest (ObjectStore: LocalFs/S3); CanonDocStore whole-doc corpus reusing flycanon's PdfLoader (OCR for scanned PDFs); full filter parity incl. knowledge_item_ids → source resolution.
  • Performance: lazy corpus load + tiered in-proc-LRU/Redis page cache + Anthropic prompt caching — query latency ~114 s → ~32–34 s.
  • Security sandbox (default): each turn's model-written exec runs in a scrubbed-env, resource-limited child process with docs/llm/rlm/final as capability-RPC to the parent — no secrets/network/infra in the child; escape blast-radius = the in-scope corpus only. Includes an inactivity-timeout + graceful child-death fix.
  • Observability: per-query token/cost recorded to canon_cost_events.
  • Parity + UX: /query, /query/stream (per-turn status frames), agent endpoints; structured no-answer; richer page-faithful citations.
  • Version 26.6.18.

Benchmark — FinanceBench 50/50 + full (doc: docs/rlm-vs-rag-benchmark.md)

RLM vs best RAG (azure-large-exp-sonnet — the top vector config by Answer-Correctness), head-to-head on both datasets:

Metric 50/50 — RLM 50/50 — best RAG full — RLM full — best RAG
Answer Correctness (RAGAS) 0.497 0.434 0.501 0.422
Answer Relevancy (RAGAS) 0.775 0.674 0.781 0.686
Contains Answer (custom) 0.774 0.703 0.811 0.689
Faithfulness (RAGAS) 0.202 0.305 0.190 0.315
Ingest time none (lazy) ~1h 16m none (lazy) ~2h 36m
Est. cost / run ~$14 ~$5.0 ~$25 ~$6.5

RLM leads answer quality on both datasets with zero embedding ingest; RAG's higher Faithfulness is the snippet/abstention artifact (see doc).

Production-verified: re-running RLM 50/50 inside the shipped flycanon with the default subprocess sandbox reproduced the result (AC 0.510, median latency 32.5 s) with 0/81 sandbox failures — the security sandbox costs nothing in answer quality.

Ingestion time: ~84 min for 184 filings (~27.5 s/filing). It is embedding-dominated (azure:text-embedding-3-large) — a cost that benefits hybrid RAG but which the RLM answer path never uses, so an RLM-only ingest could skip embedding for a large speedup (noted as future work).

Notes

  • All CI checks green. Awaiting maintainer approval — do not merge without sign-off (only the maintainer merges to main).

miguelgfierro and others added 30 commits June 12, 2026 14:29
get_by_content_sha256 was querying without tenant/workspace filter,
causing sources from other workspaces to be treated as idempotent
hits. 3072-d ingests into -large workspaces were deduped against 1536-d
data in the default workspace, leaving canon_chunk_vectors_3072 empty.

Also explicitly pass Azure credentials in docker-compose environment
so the worker correctly resolves them at startup.
An absolute key escaped the LocalFs root because Path(root) / '/abs'
discards the left operand; the S3 backend passed an absolute-style key
straight through with an empty prefix. Add absolute-key rejection tests
for both backends, and harden LocalFsObjectStore._path to reject
leading-slash keys and verify the resolved path stays within the root.
With an empty prefix an absolute-style key was passed straight to S3 as
the object key, breaking the shared tenant/workspace key layout. Apply
the same leading-slash rejection as the LocalFs backend.
feat(storage): ObjectStore port with LocalFs + S3 backends
…tkey

feat(sources): add canon_sources.object_store_key column + migration
feat(query): port the RLM engine (CodeAct REPL + LLM client) into flycanon
…inals

feat(sources): persist original documents on ingest for RLM
miguelgfierro and others added 22 commits June 18, 2026 13:17
feat(query): make the sandboxed subprocess the default RLM executor
Replace the bespoke fitz extract_pdf_pages with the loader registry's
PdfLoader so the RLM corpus gains its Tesseract OCR fallback for
scanned/image-only PDFs and stays consistent with the ingest path.
_pages_for now regroups the loader's per-page sections into the per-page
list the engine cites into: paginated sections group by 1-based page,
non-paginated sections stay one-per-section. Delete extract_pdf_pages
and the orphaned fitz import.
Replace the extract_pdf_pages monkeypatch (the function is gone) with a
fake loader registered in the registry for SourceKind.pdf. Add coverage
for per-page section grouping (2-page doc -> 2 strings), multiple
sections joined on the same page, empty-page skipping, the non-paginated
(page is None) per-section split, the raw_text single-page fallback, and
a source with no extractable text being skipped.
…se-loader

refactor(query): RLM corpus reuses the PdfLoader (gains OCR) instead of raw fitz
…tream-shared

refactor(web): share the answer-SSE stream generator across both controllers
The dead-sandbox degradation broke before appending a tool_result for the
tool_use that just ran (and any later tool_use in the same turn), leaving a
dangling tool_use block. The fallback chat_raw then sent an assistant tool_use
with no matching tool_result in the next user message, which the real Messages
API rejects with HTTP 400. Append an is_error tool_result for every unanswered
tool_use so the transcript stays well-formed before falling through.
The terminated-degradation test only checked local control flow and masked the
HTTP 400 dangling-tool_use bug. Add _assert_tool_uses_answered, which verifies
every tool_use block in an assistant turn is answered by a matching tool_result
in the next user message, and apply it to the terminated test plus a new
multi-tool case where an earlier tool_use in the same turn would also be left
dangling. Both assertions fail against the pre-fix engine and pass with it.
…tness

fix(query): RLM sandbox inactivity timeout + graceful child-death handling
Final sandbox-mode run (rlm-final): Answer-Correctness 0.510 (> champion
0.497, > hybrid RAG 0.434), 0 sandbox failures, ~34s/query. Documents the
~84min/184-filing ingestion cost, which is embedding-dominated and unused
by RLM (an RLM-only ingest could skip embedding).
miguelgfierro and others added 2 commits June 18, 2026 16:10
docs: RLM FinanceBench 50/50 benchmark + ingestion time
Rename rlm-benchmark.md -> rlm-vs-rag-benchmark.md; match the experiments
README format; RLM vs hybrid vector RAG across both datasets with retrieval
+ generation metrics, time (incl. the embedding-heavy vector ingest vs RLM's
lazy no-ingest), and cost; PageIndex omitted (RLM-vs-RAG view).
miguelgfierro and others added 2 commits June 18, 2026 16:16
…mark

docs: RLM vs RAG benchmark (FinanceBench 50/50 + full)
Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>
@miguelgfierro miguelgfierro merged commit 781e95e into main Jun 18, 2026
7 checks passed
@miguelgfierro miguelgfierro deleted the feat/rlm-integration branch June 18, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant