Skip to content

feat(ingestion): add swappable DocumentParser boundary with PARSER selection#25

Merged
hcho22 merged 4 commits into
mainfrom
feat/ingestion-parser-boundary
Jun 22, 2026
Merged

feat(ingestion): add swappable DocumentParser boundary with PARSER selection#25
hcho22 merged 4 commits into
mainfrom
feat/ingestion-parser-boundary

Conversation

@hcho22

@hcho22 hcho22 commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Intent

Working through a Phase 2 implementation PRD, the developer wanted to advance a sequence of related user stories (US-036 through US-039) that progressively turn the document-ingestion parsing seam into a clean, swappable parser boundary. For each story they first asked the assistant to explain it "to a beginner," then explicitly asked it to implement the story, and continued one story at a time. The stories aimed to: verify and formalize that docling is confined to a single module behind a plain-Markdown-string seam (US-036), define a DocumentParser abstract base class mirroring the existing Reranker/WebSearchProvider pattern (US-037), wrap the existing docling logic in a concrete DoclingParser adapter while keeping behavior identical (US-038), and add a PARSER environment variable with get_parser_name/build_parser factory functions wired into ingestion (US-039). The developer stressed constraints such as preserving existing behavior by default, failing loudly and early at startup on misconfiguration (never silently falling back to docling), and matching the codebase's house conventions for tests, ADRs, and documentation. They also asked the assistant to tick off the acceptance-criteria checkboxes in the PRD as each story was completed.

What Changed

  • Introduced a DocumentParser ABC (modeled on the existing Reranker/WebSearchProvider pattern) and refactored the existing docling logic into a DoclingParser adapter, confining docling behind a plain-Markdown-string seam so the chunker stays string-only; default ingestion behavior is unchanged.
  • Added a PARSER env var with get_parser_name/build_parser factory functions wired into the ingest endpoint that fail loudly at startup on an unknown/misconfigured parser (no silent docling fallback), plus a LlamaParseParser adapter (LlamaParse Cloud API) selectable via PARSER.
  • Hardened the seam after review: offloaded the blocking parse off the asyncio event loop via asyncio.to_thread, serialized docling convert() under a lock, guarded warmup; added ADR-0007, adapter docs, .env.example entries, and parser protocol/selection/contract/adapter test suites with multi-format fixtures.

Risk Assessment

✅ Low: The change is well-bounded and all prior-round concurrency/correctness concerns are correctly resolved in HEAD; the only remaining issue is a cosmetic stale documentation line pointer.

Testing

Ran the four story-level validation suites for the intent (US-036 seam, US-037 DocumentParser ABC, US-038 DoclingParser adapter, US-039 PARSER factory) plus the downstream boundary stories that shipped in the same commit (US-040 stubbed LlamaParse adapter, US-042 parser→chunker contract, US-041 keyless smoke-skip) — all green, with US-038 confirming the docling adapter produces byte-identical output to the legacy path across real PDF/DOCX/HTML/MD/TXT fixtures. To show the behavior the way an operator experiences it rather than as pass/fail counts, I also drove the public boundary entry points the app uses at startup and on ingest and captured a CLI transcript demonstrating the PARSER config switch, the fail-loud-on-misconfiguration (bogus value at the startup hook; missing key at build) with no silent docling fallback, and a real document flowing PARSER→factory→adapter→parse→chunk_text into non-empty markdown chunks. No UI surface is involved (backend ingestion boundary), so no screenshot/visual artifact applies; the reviewer-visible artifact is the transcript. The worktree stayed clean (only gitignored pycache touched), so no transient artifacts needed removal.

Evidence: Parser boundary end-to-end CLI transcript (operator experience)

1) Default (PARSER unset): get_parser_name()->'docling'; get_selected_parser()->DoclingParser; parse('sample.md')->206 chars markdown ('# Ingestion Boundary Sample'); chunk_text->1 str chunk 2) Default parser on real PDF -> 201 chars, contains 'Ingestion Boundary Sample': True (in-adapter pypdfium2 fallback) 3) PARSER=bogus -> get_selected_parser() ValueError: PARSER must be one of docling|unstructured|llamaparse, got 'bogus' (no silent docling fallback) 4) PARSER=llamaparse, no key -> build_parser ValueError: PARSER=llamaparse requires LLAMA_CLOUD_API_KEY (fails at build, not first ingest) 5) PARSER=llamaparse + key -> LlamaParseParser; is DoclingParser? False; is LlamaParseParser? True (config switch, never docling)


======================================================================
1) Default (PARSER unset): operator gets docling; real doc -> markdown -> chunks
======================================================================
   get_parser_name()        -> 'docling'
   get_selected_parser()    -> DoclingParser(name='docling')
   parse('sample.md')       -> 206 chars of markdown, first line: '# Ingestion Boundary Sample'
   chunk_text(markdown)     -> 1 chunk(s), all str: True

======================================================================
2) Default parser on a real PDF -> non-empty markdown (in-adapter PDF fallback)
======================================================================
   parse('sample.pdf')      -> 201 chars; contains 'Ingestion Boundary Sample': True

======================================================================
3) PARSER=bogus -> startup hook FAILS LOUD (no silent docling fallback)
======================================================================
   get_selected_parser()    -> ValueError: PARSER must be one of docling|unstructured|llamaparse, got 'bogus'

======================================================================
4) PARSER=llamaparse, no key -> FAILS LOUD at build (not deferred to ingest)
======================================================================
   get_parser_name()        -> 'llamaparse' (accepted name)
   build_parser('llamaparse') -> ValueError: PARSER=llamaparse requires LLAMA_CLOUD_API_KEY

======================================================================
5) PARSER=llamaparse + key -> LlamaParseParser (config switch, never docling)
======================================================================
   build_parser('llamaparse') -> LlamaParseParser(name='llamaparse')
   is a DoclingParser?        -> False  (must be False)
   is a LlamaParseParser?     -> True

======================================================================
DONE — PARSER env -> build_parser factory -> adapter -> parse -> chunk_text, fail-loud on misconfig
======================================================================
Evidence: Demo script driving the public boundary entry points (get_selected_parser/build_parser/parse/chunk_text)
"""End-to-end demonstration of the ingestion parser boundary (US-036–US-039).

Drives the *public* boundary entry points exactly as the app does:

  * startup hook  -> get_selected_parser()      (main.py:_on_startup)
  * ingest path   -> get_selected_parser().parse(raw, ...) -> chunk_text(text)
                                                (main.py:ingest_document)

so the transcript reflects the real operator experience of flipping the
`PARSER` env var, not a bespoke test harness.

Run:  python3 /…/parser_boundary_demo.py
"""

from __future__ import annotations

import os
import sys
from pathlib import Path

REPO = Path("/Users/hcho/.no-mistakes/worktrees/9d29eaea5bc3/01KVRGDE26DNTTR94PFXFRFZ65")
sys.path.insert(0, str(REPO / "backend"))

from chunking import chunk_text  # noqa: E402
from parsing import (  # noqa: E402
    DoclingParser,
    LlamaParseParser,
    build_parser,
    get_parser_name,
    get_selected_parser,
    reset_selected_parser,
)

FIX = REPO / "backend" / "test-fixtures" / "us018"


def banner(t: str) -> None:
    print("\n" + "=" * 70)
    print(t)
    print("=" * 70)


def set_parser(**env: str) -> None:
    for k in ("PARSER", "LLAMA_CLOUD_API_KEY"):
        os.environ.pop(k, None)
    for k, v in env.items():
        os.environ[k] = v
    reset_selected_parser()


# 1. DEFAULT: PARSER unset -> docling. A real .md upload flows through the
#    SELECTED parser -> markdown -> chunk_text, exactly like the ingest path.
banner("1) Default (PARSER unset): operator gets docling; real doc -> markdown -> chunks")
set_parser()
print(f"   get_parser_name()        -> {get_parser_name()!r}")
parser = get_selected_parser()  # this is the literal startup-hook call
print(f"   get_selected_parser()    -> {type(parser).__name__}(name={parser.name!r})")
raw = (FIX / "sample.md").read_bytes()
text = parser.parse(raw, "sample.md", "text/markdown")  # ingest-path call
chunks = chunk_text(text)  # the ONLY coupling: a markdown str
print(f"   parse('sample.md')       -> {len(text)} chars of markdown, first line: {text.splitlines()[0]!r}")
print(f"   chunk_text(markdown)     -> {len(chunks)} chunk(s), all str: {all(isinstance(c, str) for c in chunks)}")

# 2. A real PDF through the same default parser (exercises the in-adapter
#    pypdfium2 fallback when docling's torch pipeline is unavailable).
banner("2) Default parser on a real PDF -> non-empty markdown (in-adapter PDF fallback)")
raw = (FIX / "sample.pdf").read_bytes()
pdf_text = get_selected_parser().parse(raw, "sample.pdf", "application/pdf")
print(f"   parse('sample.pdf')      -> {len(pdf_text)} chars; contains 'Ingestion Boundary Sample': "
      f"{'Ingestion Boundary Sample' in pdf_text}")

# 3. FAIL LOUD: an unknown PARSER value never silently falls back to docling.
banner("3) PARSER=bogus -> startup hook FAILS LOUD (no silent docling fallback)")
set_parser(PARSER="bogus")
try:
    get_selected_parser()
    print("   !! ERROR: expected a loud failure, got a parser")
except ValueError as e:
    print(f"   get_selected_parser()    -> ValueError: {e}")

# 4. FAIL LOUD AT BUILD: commercial adapter selected with no API key.
banner("4) PARSER=llamaparse, no key -> FAILS LOUD at build (not deferred to ingest)")
set_parser(PARSER="llamaparse")
print(f"   get_parser_name()        -> {get_parser_name()!r} (accepted name)")
try:
    build_parser(get_parser_name())
    print("   !! ERROR: expected a loud failure, got a parser")
except ValueError as e:
    print(f"   build_parser('llamaparse') -> ValueError: {e}")

# 5. CONFIG SWITCH: a keyed llamaparse selection builds the real adapter,
#    never docling — swapping the parser is a config switch + a key.
banner("5) PARSER=llamaparse + key -> LlamaParseParser (config switch, never docling)")
set_parser(PARSER="llamaparse", LLAMA_CLOUD_API_KEY="llx-demo-key")
built = build_parser(get_parser_name())
print(f"   build_parser('llamaparse') -> {type(built).__name__}(name={built.name!r})")
print(f"   is a DoclingParser?        -> {isinstance(built, DoclingParser)}  (must be False)")
print(f"   is a LlamaParseParser?     -> {isinstance(built, LlamaParseParser)}")

set_parser()  # restore default
banner("DONE — PARSER env -> build_parser factory -> adapter -> parse -> chunk_text, fail-loud on misconfig")

Pipeline

Updates from git push no-mistakes

✅ **intent** - passed

✅ No issues found.

✅ **Rebase** - passed

✅ No issues found.

⚠️ **Review** - 1 info
  • ⚠️ backend/main.py:1589 - get_selected_parser().parse(...) runs synchronously on the asyncio event loop inside the async ingest_document handler, with no asyncio.to_thread/executor offload. For PARSER=llamaparse, LlamaParseParser.parse issues blocking sync httpx.Client calls and time.sleep(poll_interval) polling bounded by max_polls*poll_interval (~300s by default, parsing.py:315). That blocks the whole event loop, freezing every other request (health checks, other ingests, chat) for the entire parse duration whenever llamaparse is selected. Offload the blocking parse to a worker thread, e.g. text = await asyncio.to_thread(get_selected_parser().parse, raw, filename=..., content_type=...).
  • ℹ️ backend/parsing.py:474 - warmup() unconditionally builds the docling DocumentConverter regardless of PARSER. When PARSER=llamaparse, startup still pays docling's converter/backend init (and the updated main.py:382 comment now claims it front-loads "the document parser" generically, which is inaccurate — it only warms docling). Guard warmup to the docling-selected path, or delegate to the selected parser, so a non-docling deployment doesn't initialize an unused heavy converter.
  • ℹ️ backend/parsing.py:320 - _await_terminal checks status at the top of each loop iteration, so the status fetched on the final iteration (line 320) is never evaluated before the timeout raise at line 321. It performs max_polls GETs but only inspects max_polls statuses counting the upload's — the last fetch is wasted, and a job that reaches a success state on the final poll is reported as a spurious timeout. Restructure to check status immediately after each fetch.
  • ℹ️ CONTEXT.md:36 - Stale code pointer: the new text references parse_document(...) -> str at backend/parsing.py:112, but after this refactor parse_document is at line 428 (line 112 is now inside _make_converter's format_options). Update the line reference (or drop it) so the navigation pointer matches.
  • ℹ️ backend/parsing.py:364 - The accepted parser names are spelled out three times: the ParserName Literal (line 353), the validation tuple in get_parser_name (line 364), and the error-message string (line 366). Deriving the tuple from typing.get_args(ParserName) would keep them in sync. Minor DRY/simplification; the duplicates are co-located so drift risk is low.

🔧 Fix: offload llamaparse parse, guard docling warmup, fix poll/doc pointers
2 issues (1 warning, 1 info) still open:

  • ⚠️ backend/main.py:1591 - The Round-1 fix wraps get_selected_parser().parse in asyncio.to_thread. This also changes docling's concurrency model: docling parses were previously serialized by the single-threaded event loop, but now multiple concurrent ingests run _get_converter().convert() (parsing.py:190) on the SAME module-level _converter singleton (parsing.py:120-127) from different worker threads. The singleton comment ('sharing across requests is safe', parsing.py:118-119) was written under the old serial assumption; docling's DocumentConverter/pipeline holds per-format model state and is not documented as thread-safe for concurrent convert(). Startup warmup() pre-builds the converter so the lazy-init race is mostly moot, but concurrent uploads can still race on shared pipeline state, risking corrupted output or crashes. Confirm docling's concurrency guarantees, or serialize docling parses (e.g. a Semaphore/lock around the docling convert).
  • ℹ️ backend/parsing.py:340 - In _fetch_markdown's page-join fallback, p.get("md", "") returns the "" default only when the key is absent. If a page payload carries an explicit "md": null, the value is None, and "\n\n".join(...) raises TypeError instead of the intended boundary-contract ValueError ('not a string'). Harden to (p.get("md") or "").

🔧 Fix: serialize docling convert under lock, harden llamaparse md:null join
1 info still open:

  • ℹ️ CONTEXT.md:36 - Stale code pointer re-introduced by the round-3 commit (5f59af2). CONTEXT.md:36 references parse_document(...) -> str at backend/parsing.py:430, but parse_document is now defined at line 439 — line 430 is return _selected_parser inside get_selected_parser. The round-3 changes added ~9 lines above parse_document (the threading import, the expanded _convert_lock singleton comment, the _convert_lock declaration, and the with _convert_lock: wrapper), shifting it from 430 to 439 without updating the pointer. The sibling chunk_text pointer (backend/chunking.py:109) is still correct. Update 430 → 439.
✅ **Test** - passed

✅ No issues found.

  • python3 -m backend.test_parsing_seam (US-036: docling confined to parsing.py; chunker is str-only) — 3 checks pass
  • python3 -m backend.test_parsing_protocol (US-037: DocumentParser ABC abstract; parse(bytes,str,str|None)->str) — 4 checks pass
  • python3 -m backend.test_docling_parser (US-038: DoclingParser parses real PDF/DOCX/HTML/MD/TXT fixtures byte-identical to legacy parse_document) — 6 checks pass
  • python3 -m backend.test_parser_selection (US-039: PARSER env + build_parser factory, fail-loud, no silent docling fallback) — 7 checks pass
  • python3 -m backend.test_llamaparse_parser (US-040 stubbed adapter) — 7 checks pass
  • python3 -m backend.test_parser_chunker_contract (US-042 parser→chunker single coupling) — 3 checks pass
  • env -u LLAMA_CLOUD_API_KEY PARSER=llamaparse python3 -m backend.test_llamaparse_smoke (US-041 keyless skip) — SKIP, exit 0
  • Manual end-to-end transcript via the public boundary entry points (get_selected_parser → .parse → chunk_text): default→docling on real .md/.pdf, PARSER=bogus and keyless llamaparse fail loud, keyed llamaparse builds LlamaParseParser (never docling)
✅ **Document** - passed

✅ No issues found.

✅ **Lint** - passed

✅ No issues found.

✅ **Push** - passed

✅ No issues found.

hcho22 and others added 4 commits June 22, 2026 13:25
…undary (US-036–045, ADR-0007)

Epic C of the Phase-2 PRD. Formalizes the implicit docling function seam into a
one-method `DocumentParser` protocol + `build_parser` factory, mirroring the
existing `Reranker` / `WebSearchProvider` patterns, and ships one real
commercial adapter (LlamaParse) plus an authoring guide.

- US-036: verify the docling seam is clean (I5) — docling confined to
  parsing.py, chunk_text consumes only str. Enforced by test_parsing_seam.py.
- US-037: `DocumentParser` ABC — `parse(raw, filename, content_type) -> str`,
  normalized-markdown-string output contract pinned (v1).
- US-038: `DoclingParser` default adapter wrapping today's docling +
  pypdfium2 fallback; behavior unchanged.
- US-039: `PARSER=docling|unstructured|llamaparse` selection via `get_parser_name`
  + `build_parser`; resolved once at startup so a bad value / missing key fails
  closed at boot, not first ingest. main.py routes through get_selected_parser().
- US-040: real `LlamaParseParser` (upload → poll → fetch markdown over httpx,
  no SDK dep); errors surface as documents.error_message like docling.
- US-041: live round-trip smoke test, opt-in by LLAMA_CLOUD_API_KEY, SKIPs
  (exit 0) without the key so keyless CI stays green.
- US-042: parser output feeds chunk_text with no parser-specific branch —
  the markdown str is the only coupling. Proven by test_parser_chunker_contract.py.
- US-043: docs/ingestion-parser-adapters.md "write your own adapter" guide.
- US-044: F3 capability-matrix rows (OCR-not-default, markdown-string ceiling,
  unstructured-is-BYO), each citing ADR-0007.
- US-045: future structured-output widening documented as a deliberate
  non-goal seam (ADR-0007 + CONTEXT + guide).

Verified: 30 parser unit tests pass; keyless smoke test SKIPs; new code adds no
mypy errors (the 2 pre-existing parsing.py errors are unchanged).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 22, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agentic-rag Ready Ready Preview, Comment Jun 22, 2026 9:14pm

@github-actions

Copy link
Copy Markdown
Contributor

Retrieval eval — PR vs main

n = 50 questions × 3 modes (vector, keyword, hybrid) on a 14-chunk corpus. PR ran in 92.28s; main in 69.99s.

Headline (each cell: PR value, Δ vs main)

Mode recall@5 MRR nDCG@5
vector 0.860 (±0.000) 0.772 (±0.000) 0.779 (±0.000)
keyword 0.110 (±0.000) 0.120 (±0.000) 0.112 (±0.000)
hybrid 0.860 (±0.000) 0.759 (±0.000) 0.769 (±0.000)

Per-category recall@5

Mode single_chunk multi_hop adversarial paraphrase
vector 0.900 (±0.000) 0.933 (±0.000) 0.600 (±0.000) 1.000 (±0.000)
keyword 0.250 (±0.000) 0.033 (±0.000) 0.000 (±0.000) 0.000 (±0.000)
hybrid 0.900 (±0.000) 0.933 (±0.000) 0.600 (±0.000) 1.000 (±0.000)

Comment is updated in place on each push by .github/workflows/retrieval-eval.yml (US-035). Comment-only — never blocks the build.

@hcho22 hcho22 merged commit 706d602 into main Jun 22, 2026
3 checks passed
@hcho22 hcho22 deleted the feat/ingestion-parser-boundary branch June 22, 2026 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant