feat(ingestion): add swappable DocumentParser boundary with PARSER selection#25
Merged
Conversation
…undary (US-036–045, ADR-0007) Epic C of the Phase-2 PRD. Formalizes the implicit docling function seam into a one-method `DocumentParser` protocol + `build_parser` factory, mirroring the existing `Reranker` / `WebSearchProvider` patterns, and ships one real commercial adapter (LlamaParse) plus an authoring guide. - US-036: verify the docling seam is clean (I5) — docling confined to parsing.py, chunk_text consumes only str. Enforced by test_parsing_seam.py. - US-037: `DocumentParser` ABC — `parse(raw, filename, content_type) -> str`, normalized-markdown-string output contract pinned (v1). - US-038: `DoclingParser` default adapter wrapping today's docling + pypdfium2 fallback; behavior unchanged. - US-039: `PARSER=docling|unstructured|llamaparse` selection via `get_parser_name` + `build_parser`; resolved once at startup so a bad value / missing key fails closed at boot, not first ingest. main.py routes through get_selected_parser(). - US-040: real `LlamaParseParser` (upload → poll → fetch markdown over httpx, no SDK dep); errors surface as documents.error_message like docling. - US-041: live round-trip smoke test, opt-in by LLAMA_CLOUD_API_KEY, SKIPs (exit 0) without the key so keyless CI stays green. - US-042: parser output feeds chunk_text with no parser-specific branch — the markdown str is the only coupling. Proven by test_parser_chunker_contract.py. - US-043: docs/ingestion-parser-adapters.md "write your own adapter" guide. - US-044: F3 capability-matrix rows (OCR-not-default, markdown-string ceiling, unstructured-is-BYO), each citing ADR-0007. - US-045: future structured-output widening documented as a deliberate non-goal seam (ADR-0007 + CONTEXT + guide). Verified: 30 parser unit tests pass; keyless smoke test SKIPs; new code adds no mypy errors (the 2 pre-existing parsing.py errors are unchanged). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…fix poll/doc pointers
…maparse md:null join
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Contributor
Retrieval eval — PR vs
|
| Mode | recall@5 | MRR | nDCG@5 |
|---|---|---|---|
| vector | 0.860 (±0.000) | 0.772 (±0.000) | 0.779 (±0.000) |
| keyword | 0.110 (±0.000) | 0.120 (±0.000) | 0.112 (±0.000) |
| hybrid | 0.860 (±0.000) | 0.759 (±0.000) | 0.769 (±0.000) |
Per-category recall@5
| Mode | single_chunk | multi_hop | adversarial | paraphrase |
|---|---|---|---|---|
| vector | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
| keyword | 0.250 (±0.000) | 0.033 (±0.000) | 0.000 (±0.000) | 0.000 (±0.000) |
| hybrid | 0.900 (±0.000) | 0.933 (±0.000) | 0.600 (±0.000) | 1.000 (±0.000) |
Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Intent
Working through a Phase 2 implementation PRD, the developer wanted to advance a sequence of related user stories (US-036 through US-039) that progressively turn the document-ingestion parsing seam into a clean, swappable parser boundary. For each story they first asked the assistant to explain it "to a beginner," then explicitly asked it to implement the story, and continued one story at a time. The stories aimed to: verify and formalize that docling is confined to a single module behind a plain-Markdown-string seam (US-036), define a DocumentParser abstract base class mirroring the existing Reranker/WebSearchProvider pattern (US-037), wrap the existing docling logic in a concrete DoclingParser adapter while keeping behavior identical (US-038), and add a PARSER environment variable with get_parser_name/build_parser factory functions wired into ingestion (US-039). The developer stressed constraints such as preserving existing behavior by default, failing loudly and early at startup on misconfiguration (never silently falling back to docling), and matching the codebase's house conventions for tests, ADRs, and documentation. They also asked the assistant to tick off the acceptance-criteria checkboxes in the PRD as each story was completed.
What Changed
DocumentParserABC (modeled on the existingReranker/WebSearchProviderpattern) and refactored the existing docling logic into aDoclingParseradapter, confining docling behind a plain-Markdown-string seam so the chunker stays string-only; default ingestion behavior is unchanged.PARSERenv var withget_parser_name/build_parserfactory functions wired into the ingest endpoint that fail loudly at startup on an unknown/misconfigured parser (no silent docling fallback), plus aLlamaParseParseradapter (LlamaParse Cloud API) selectable viaPARSER.asyncio.to_thread, serialized doclingconvert()under a lock, guarded warmup; added ADR-0007, adapter docs,.env.exampleentries, and parser protocol/selection/contract/adapter test suites with multi-format fixtures.Risk Assessment
✅ Low: The change is well-bounded and all prior-round concurrency/correctness concerns are correctly resolved in HEAD; the only remaining issue is a cosmetic stale documentation line pointer.
Testing
Ran the four story-level validation suites for the intent (US-036 seam, US-037 DocumentParser ABC, US-038 DoclingParser adapter, US-039 PARSER factory) plus the downstream boundary stories that shipped in the same commit (US-040 stubbed LlamaParse adapter, US-042 parser→chunker contract, US-041 keyless smoke-skip) — all green, with US-038 confirming the docling adapter produces byte-identical output to the legacy path across real PDF/DOCX/HTML/MD/TXT fixtures. To show the behavior the way an operator experiences it rather than as pass/fail counts, I also drove the public boundary entry points the app uses at startup and on ingest and captured a CLI transcript demonstrating the PARSER config switch, the fail-loud-on-misconfiguration (bogus value at the startup hook; missing key at build) with no silent docling fallback, and a real document flowing PARSER→factory→adapter→parse→chunk_text into non-empty markdown chunks. No UI surface is involved (backend ingestion boundary), so no screenshot/visual artifact applies; the reviewer-visible artifact is the transcript. The worktree stayed clean (only gitignored pycache touched), so no transient artifacts needed removal.
Evidence: Parser boundary end-to-end CLI transcript (operator experience)
1) Default (PARSER unset): get_parser_name()->'docling'; get_selected_parser()->DoclingParser; parse('sample.md')->206 chars markdown ('# Ingestion Boundary Sample'); chunk_text->1 str chunk 2) Default parser on real PDF -> 201 chars, contains 'Ingestion Boundary Sample': True (in-adapter pypdfium2 fallback) 3) PARSER=bogus -> get_selected_parser() ValueError: PARSER must be one of docling|unstructured|llamaparse, got 'bogus' (no silent docling fallback) 4) PARSER=llamaparse, no key -> build_parser ValueError: PARSER=llamaparse requires LLAMA_CLOUD_API_KEY (fails at build, not first ingest) 5) PARSER=llamaparse + key -> LlamaParseParser; is DoclingParser? False; is LlamaParseParser? True (config switch, never docling)Evidence: Demo script driving the public boundary entry points (get_selected_parser/build_parser/parse/chunk_text)
Pipeline
Updates from git push no-mistakes
✅ **intent** - passed
✅ No issues found.
✅ **Rebase** - passed
✅ No issues found.
backend/main.py:1589-get_selected_parser().parse(...)runs synchronously on the asyncio event loop inside the asyncingest_documenthandler, with noasyncio.to_thread/executor offload. ForPARSER=llamaparse,LlamaParseParser.parseissues blocking synchttpx.Clientcalls andtime.sleep(poll_interval)polling bounded bymax_polls*poll_interval(~300s by default, parsing.py:315). That blocks the whole event loop, freezing every other request (health checks, other ingests, chat) for the entire parse duration whenever llamaparse is selected. Offload the blocking parse to a worker thread, e.g.text = await asyncio.to_thread(get_selected_parser().parse, raw, filename=..., content_type=...).backend/parsing.py:474-warmup()unconditionally builds the doclingDocumentConverterregardless ofPARSER. WhenPARSER=llamaparse, startup still pays docling's converter/backend init (and the updated main.py:382 comment now claims it front-loads "the document parser" generically, which is inaccurate — it only warms docling). Guard warmup to the docling-selected path, or delegate to the selected parser, so a non-docling deployment doesn't initialize an unused heavy converter.backend/parsing.py:320-_await_terminalchecks status at the top of each loop iteration, so the status fetched on the final iteration (line 320) is never evaluated before the timeoutraiseat line 321. It performsmax_pollsGETs but only inspectsmax_pollsstatuses counting the upload's — the last fetch is wasted, and a job that reaches a success state on the final poll is reported as a spurious timeout. Restructure to check status immediately after each fetch.CONTEXT.md:36- Stale code pointer: the new text referencesparse_document(...) -> stratbackend/parsing.py:112, but after this refactorparse_documentis at line 428 (line 112 is now inside_make_converter'sformat_options). Update the line reference (or drop it) so the navigation pointer matches.backend/parsing.py:364- The accepted parser names are spelled out three times: theParserNameLiteral (line 353), the validation tuple inget_parser_name(line 364), and the error-message string (line 366). Deriving the tuple fromtyping.get_args(ParserName)would keep them in sync. Minor DRY/simplification; the duplicates are co-located so drift risk is low.🔧 Fix: offload llamaparse parse, guard docling warmup, fix poll/doc pointers
2 issues (1 warning, 1 info) still open:
backend/main.py:1591- The Round-1 fix wrapsget_selected_parser().parseinasyncio.to_thread. This also changes docling's concurrency model: docling parses were previously serialized by the single-threaded event loop, but now multiple concurrent ingests run_get_converter().convert()(parsing.py:190) on the SAME module-level_convertersingleton (parsing.py:120-127) from different worker threads. The singleton comment ('sharing across requests is safe', parsing.py:118-119) was written under the old serial assumption; docling's DocumentConverter/pipeline holds per-format model state and is not documented as thread-safe for concurrentconvert(). Startupwarmup()pre-builds the converter so the lazy-init race is mostly moot, but concurrent uploads can still race on shared pipeline state, risking corrupted output or crashes. Confirm docling's concurrency guarantees, or serialize docling parses (e.g. a Semaphore/lock around the docling convert).backend/parsing.py:340- In_fetch_markdown's page-join fallback,p.get("md", "")returns the""default only when the key is absent. If a page payload carries an explicit"md": null, the value isNone, and"\n\n".join(...)raisesTypeErrorinstead of the intended boundary-contractValueError('not a string'). Harden to(p.get("md") or "").🔧 Fix: serialize docling convert under lock, harden llamaparse md:null join
1 info still open:
CONTEXT.md:36- Stale code pointer re-introduced by the round-3 commit (5f59af2). CONTEXT.md:36 referencesparse_document(...) -> stratbackend/parsing.py:430, butparse_documentis now defined at line 439 — line 430 isreturn _selected_parserinsideget_selected_parser. The round-3 changes added ~9 lines aboveparse_document(thethreadingimport, the expanded_convert_locksingleton comment, the_convert_lockdeclaration, and thewith _convert_lock:wrapper), shifting it from 430 to 439 without updating the pointer. The siblingchunk_textpointer (backend/chunking.py:109) is still correct. Update 430 → 439.✅ **Test** - passed
✅ No issues found.
python3 -m backend.test_parsing_seam(US-036: docling confined to parsing.py; chunker is str-only) — 3 checks passpython3 -m backend.test_parsing_protocol(US-037: DocumentParser ABC abstract; parse(bytes,str,str|None)->str) — 4 checks passpython3 -m backend.test_docling_parser(US-038: DoclingParser parses real PDF/DOCX/HTML/MD/TXT fixtures byte-identical to legacy parse_document) — 6 checks passpython3 -m backend.test_parser_selection(US-039: PARSER env + build_parser factory, fail-loud, no silent docling fallback) — 7 checks passpython3 -m backend.test_llamaparse_parser(US-040 stubbed adapter) — 7 checks passpython3 -m backend.test_parser_chunker_contract(US-042 parser→chunker single coupling) — 3 checks passenv -u LLAMA_CLOUD_API_KEY PARSER=llamaparse python3 -m backend.test_llamaparse_smoke(US-041 keyless skip) — SKIP, exit 0Manual end-to-end transcript via the public boundary entry points (get_selected_parser → .parse → chunk_text): default→docling on real .md/.pdf, PARSER=bogus and keyless llamaparse fail loud, keyed llamaparse builds LlamaParseParser (never docling)✅ **Document** - passed
✅ No issues found.
✅ **Lint** - passed
✅ No issues found.
✅ **Push** - passed
✅ No issues found.