RLM as the default answer mode (RAG deprecated) by miguelgfierro · Pull Request #43 · firefly-operationOS/flycanon

miguelgfierro · 2026-06-18T01:56:50Z

Summary

Makes RLM (Recursive Language Model) the default answer mode in flycanon (FLYCANON_ANSWER_MODE=rlm); the legacy hybrid-RAG path is preserved as a deprecated opt-in (=rag, with a warning log + X-Flycanon-Deprecation header). RLM reasons over whole documents in a sandboxed CodeAct REPL instead of retrieving chunks. Delivered as ~25 small PRs onto this branch.

What's included

Engine + corpus: ported RLM engine; original documents persisted on ingest (ObjectStore: LocalFs/S3); CanonDocStore whole-doc corpus reusing flycanon's PdfLoader (OCR for scanned PDFs); full filter parity incl. knowledge_item_ids → source resolution.
Performance: lazy corpus load + tiered in-proc-LRU/Redis page cache + Anthropic prompt caching — query latency ~114 s → ~32–34 s.
Security sandbox (default): each turn's model-written exec runs in a scrubbed-env, resource-limited child process with docs/llm/rlm/final as capability-RPC to the parent — no secrets/network/infra in the child; escape blast-radius = the in-scope corpus only. Includes an inactivity-timeout + graceful child-death fix.
Observability: per-query token/cost recorded to canon_cost_events.
Parity + UX: /query, /query/stream (per-turn status frames), agent endpoints; structured no-answer; richer page-faithful citations.
Version 26.6.18.

Benchmark — FinanceBench 50/50 + full (doc: `docs/rlm-vs-rag-benchmark.md`)

RLM vs best RAG (azure-large-exp-sonnet — the top vector config by Answer-Correctness), head-to-head on both datasets:

Metric	50/50 — RLM	50/50 — best RAG	full — RLM	full — best RAG
Answer Correctness (RAGAS)	0.497	0.434	0.501	0.422
Answer Relevancy (RAGAS)	0.775	0.674	0.781	0.686
Contains Answer (custom)	0.774	0.703	0.811	0.689
Faithfulness (RAGAS)	0.202	0.305	0.190	0.315
Ingest time	none (lazy)	~1h 16m	none (lazy)	~2h 36m
Est. cost / run	~$14	~$5.0	~$25	~$6.5

RLM leads answer quality on both datasets with zero embedding ingest; RAG's higher Faithfulness is the snippet/abstention artifact (see doc).

Production-verified: re-running RLM 50/50 inside the shipped flycanon with the default subprocess sandbox reproduced the result (AC 0.510, median latency 32.5 s) with 0/81 sandbox failures — the security sandbox costs nothing in answer quality.

Ingestion time: ~84 min for 184 filings (~27.5 s/filing). It is embedding-dominated (azure:text-embedding-3-large) — a cost that benefits hybrid RAG but which the RLM answer path never uses, so an RLM-only ingest could skip embedding for a large speedup (noted as future work).

Notes

All CI checks green. Awaiting maintainer approval — do not merge without sign-off (only the maintainer merges to main).

get_by_content_sha256 was querying without tenant/workspace filter, causing sources from other workspaces to be treated as idempotent hits. 3072-d ingests into -large workspaces were deduped against 1536-d data in the default workspace, leaving canon_chunk_vectors_3072 empty. Also explicitly pass Azure credentials in docker-compose environment so the worker correctly resolves them at startup.

…spec docs: RLM integration design spec

…ettings

An absolute key escaped the LocalFs root because Path(root) / '/abs' discards the left operand; the S3 backend passed an absolute-style key straight through with an empty prefix. Add absolute-key rejection tests for both backends, and harden LocalFsObjectStore._path to reject leading-slash keys and verify the resolved path stays within the root.

With an empty prefix an absolute-style key was passed straight to S3 as the object key, breaking the shared tenant/workspace key layout. Apply the same leading-slash rejection as the LocalFs backend.

feat(storage): ObjectStore port with LocalFs + S3 backends

…tkey feat(sources): add canon_sources.object_store_key column + migration

feat(query): port the RLM engine (CodeAct REPL + LLM client) into flycanon

…inals feat(sources): persist original documents on ingest for RLM

feat(query): make the sandboxed subprocess the default RLM executor

Replace the bespoke fitz extract_pdf_pages with the loader registry's PdfLoader so the RLM corpus gains its Tesseract OCR fallback for scanned/image-only PDFs and stays consistent with the ingest path. _pages_for now regroups the loader's per-page sections into the per-page list the engine cites into: paginated sections group by 1-based page, non-paginated sections stay one-per-section. Delete extract_pdf_pages and the orphaned fitz import.

Replace the extract_pdf_pages monkeypatch (the function is gone) with a fake loader registered in the registry for SourceKind.pdf. Add coverage for per-page section grouping (2-page doc -> 2 strings), multiple sections joined on the same page, empty-page skipping, the non-paginated (page is None) per-section split, the raw_text single-page fallback, and a source with no extractable text being skipped.

…se-loader refactor(query): RLM corpus reuses the PdfLoader (gains OCR) instead of raw fitz

…tream-shared refactor(web): share the answer-SSE stream generator across both controllers

…ild-death to TERMINATED

…ild dies

…session death tests

… dies mid-query

The dead-sandbox degradation broke before appending a tool_result for the tool_use that just ran (and any later tool_use in the same turn), leaving a dangling tool_use block. The fallback chat_raw then sent an assistant tool_use with no matching tool_result in the next user message, which the real Messages API rejects with HTTP 400. Append an is_error tool_result for every unanswered tool_use so the transcript stays well-formed before falling through.

The terminated-degradation test only checked local control flow and masked the HTTP 400 dangling-tool_use bug. Add _assert_tool_uses_answered, which verifies every tool_use block in an assistant turn is answered by a matching tool_result in the next user message, and apply it to the terminated test plus a new multi-tool case where an earlier tool_use in the same turn would also be left dangling. Both assertions fail against the pre-fix engine and pass with it.

…tness fix(query): RLM sandbox inactivity timeout + graceful child-death handling

Final sandbox-mode run (rlm-final): Answer-Correctness 0.510 (> champion 0.497, > hybrid RAG 0.434), 0 sandbox failures, ~34s/query. Documents the ~84min/184-filing ingestion cost, which is embedding-dominated and unused by RLM (an RLM-only ingest could skip embedding).

docs: RLM FinanceBench 50/50 benchmark + ingestion time

Rename rlm-benchmark.md -> rlm-vs-rag-benchmark.md; match the experiments README format; RLM vs hybrid vector RAG across both datasets with retrieval + generation metrics, time (incl. the embedding-heavy vector ingest vs RLM's lazy no-ingest), and cost; PageIndex omitted (RLM-vs-RAG view).

…mark docs: RLM vs RAG benchmark (FinanceBench 50/50 + full)

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

miguelgfierro and others added 30 commits June 12, 2026 14:29

docs: RLM integration design spec

c7ab775

Merge pull request #30 from firefly-operationOS/docs/rlm-integration-…

76951dc

…spec docs: RLM integration design spec

feat(storage): add ObjectStore port

70aeb68

feat(storage): add LocalFsObjectStore backend

cc2aa6b

feat(storage): add S3ObjectStore backend behind optional s3 extra

ee950e8

feat(storage): add object-store factory and FLYCANON_OBJECT_STORE_* s…

1c8d436

…ettings

style(storage): apply ruff format to s3 backend

b30ac5d

test(storage): unit tests for ObjectStore backends and factory

0f1a8df

Reject absolute keys in S3ObjectStore._full_key

96c0164

With an empty prefix an absolute-style key was passed straight to S3 as the object key, breaking the shared tenant/workspace key layout. Apply the same leading-slash rejection as the LocalFs backend.

Document absolute-key rejection in ObjectStore port contract

4cc6131

Merge pull request #31 from firefly-operationOS/feat/rlm-objectstore

6089262

feat(storage): ObjectStore port with LocalFs + S3 backends

feat(sources): add object_store_key column to SourceRow model

b30f0e0

feat(migrations): add 0015 adding canon_sources.object_store_key

bd4636c

test(sources): cover 0015 object_store_key migration + model default

d6b2f27

feat(config): add FLYCANON_RLM_* engine settings

b660d6f

feat(query): add synchronous Anthropic client for the RLM engine

be05b9e

feat(query): add RLMSession CodeAct REPL engine (corpus-agnostic)

98d54d0

Merge pull request #32 from firefly-operationOS/feat/rlm-source-objec…

201328e

…tkey feat(sources): add canon_sources.object_store_key column + migration

test(rlm): unit-test the RLM client and CodeAct REPL with mocked LLM

0d00ef2

Merge pull request #33 from firefly-operationOS/feat/rlm-engine

5c0eb0f

feat(query): port the RLM engine (CodeAct REPL + LLM client) into flycanon

feat(config): add FLYCANON_STORE_ORIGINALS setting

44725ef

feat(configuration): add object_store bean provider

eb7b6b0

feat(sources): persist original document bytes on submit/replace

8b91efb

test(intake): pass object_store to IntakeService constructor

fd4a1ec

test(intake): cover original-document persistence on submit/replace

b026b32

feat(query): CanonDocStore + async corpus builder for the RLM engine

5a0a391

test(rlm): cover CanonDocStore + builder with fake store/repo

bade92c

Merge pull request #35 from firefly-operationOS/feat/rlm-persist-orig…

1b404be

…inals feat(sources): persist original documents on ingest for RLM

miguelgfierro and others added 22 commits June 18, 2026 13:17

docs(architecture): add RLM execution sandbox subsection

b5173d3

Merge pull request #59 from firefly-operationOS/feat/rlm-sandbox-default

c6328c1

feat(query): make the sandboxed subprocess the default RLM executor

feat(web): add shared answer-SSE stream generator

df3e563

feat(web): include RLM per-turn bridge in shared answer-SSE generator

de6b89f

refactor(web): delegate user-tier stream route to shared generator

7a40342

refactor(web): delegate agent-tier stream route to shared generator

8a50026

test(web): cover shared answer-SSE generator directly (RLM/RAG/error)

aeda806

Merge pull request #60 from firefly-operationOS/refactor/rlm-corpus-u…

1a06b77

…se-loader refactor(query): RLM corpus reuses the PdfLoader (gains OCR) instead of raw fitz

refactor(web): drop unnecessary __all__ from answer_stream

9904af2

Merge pull request #61 from firefly-operationOS/refactor/sse-answer-s…

bd4b6ea

…tream-shared refactor(web): share the answer-SSE stream generator across both controllers

fix(query): make RLM sandbox timeout an inactivity timeout + route ch…

13d6062

…ild-death to TERMINATED

fix(query): RLM session degrades to plain-text answer when sandbox ch…

ad7a903

…ild dies

test(query): update sandbox tests for terminated outcome on child death

34b83bb

test(query): inactivity-timeout progress, silent-child kill, and mid-…

047ad8c

…session death tests

test(query): session degrades to plain-text answer when sandbox child…

31c87d0

… dies mid-query

style: ruff-format new sandbox test

f533da3

Merge pull request #62 from firefly-operationOS/fix/rlm-sandbox-robus…

825f3e2

…tness fix(query): RLM sandbox inactivity timeout + graceful child-death handling

miguelgfierro mentioned this pull request Jun 18, 2026

docs: RLM FinanceBench 50/50 benchmark + ingestion time #63

Merged

miguelgfierro and others added 2 commits June 18, 2026 16:10

Merge pull request #63 from firefly-operationOS/docs/rlm-benchmark

3f0e6e3

docs: RLM FinanceBench 50/50 benchmark + ingestion time

miguelgfierro mentioned this pull request Jun 18, 2026

docs: RLM vs RAG benchmark (FinanceBench 50/50 + full) #64

Merged

miguelgfierro and others added 2 commits June 18, 2026 16:16

Merge pull request #64 from firefly-operationOS/docs/rlm-vs-rag-bench…

8a7a76c

…mark docs: RLM vs RAG benchmark (FinanceBench 50/50 + full)

docs: add RLM-vs-best-RAG summary table to benchmark doc (#65)

6acb2d9

Co-authored-by: miguelgfierro <miguelgfierro@users.noreply.github.com>

miguelgfierro merged commit 781e95e into main Jun 18, 2026
7 checks passed

miguelgfierro deleted the feat/rlm-integration branch June 18, 2026 14:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RLM as the default answer mode (RAG deprecated)#43

RLM as the default answer mode (RAG deprecated)#43
miguelgfierro merged 178 commits into
mainfrom
feat/rlm-integration

miguelgfierro commented Jun 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

miguelgfierro commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark — FinanceBench 50/50 + full (doc: docs/rlm-vs-rag-benchmark.md)

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

miguelgfierro commented Jun 18, 2026 •

edited

Loading

Benchmark — FinanceBench 50/50 + full (doc: `docs/rlm-vs-rag-benchmark.md`)