Summary
wiki_capture_source would be more robust with a source-type aware extraction pipeline. Today, fallback behavior can write non-markdown content into extracted.md, including raw PDF bytes or raw XML.
Problem
The packet contract says:
raw/sources/SRC-*/extracted.md — normalized markdown text
But current behavior can produce:
- raw PDF bytes if MarkItDown times out and
curl fallback succeeds,
- raw XML for local
.xml files,
- potentially raw HTML or other text-like formats without normalization.
Suggested design
Use a typed extraction pipeline:
- Identify source type:
- URL extension
Content-Type
- file extension
- magic bytes
- Route to appropriate extractor:
- PDF → download original, run MarkItDown/PDF extractor
- HTML → readability/markdown extraction
- XML → XML-to-markdown or project-provided converter
- Markdown/text → copy as-is
- binary/unknown → write clear extraction failure message
- Record extraction metadata in manifest:
extractor
extraction_status: success | failed | timeout | unsupported
content_type
original_file
- optional error message
Expected behavior
extracted.md should always be human-readable markdown/text, or a clear extraction failure note. It should never contain binary bytes.
Optional extension point
A pluggable extractor interface would let projects register domain-specific converters for structured XML or other specialized source formats.
Summary
wiki_capture_sourcewould be more robust with a source-type aware extraction pipeline. Today, fallback behavior can write non-markdown content intoextracted.md, including raw PDF bytes or raw XML.Problem
The packet contract says:
But current behavior can produce:
curlfallback succeeds,.xmlfiles,Suggested design
Use a typed extraction pipeline:
Content-Typeextractorextraction_status: success | failed | timeout | unsupportedcontent_typeoriginal_fileExpected behavior
extracted.mdshould always be human-readable markdown/text, or a clear extraction failure note. It should never contain binary bytes.Optional extension point
A pluggable extractor interface would let projects register domain-specific converters for structured XML or other specialized source formats.