Skip to content

Add source-type aware extraction and avoid binary/text fallbacks in extracted.md #4

@jfraser

Description

@jfraser

Summary

wiki_capture_source would be more robust with a source-type aware extraction pipeline. Today, fallback behavior can write non-markdown content into extracted.md, including raw PDF bytes or raw XML.

Problem

The packet contract says:

raw/sources/SRC-*/extracted.md — normalized markdown text

But current behavior can produce:

  • raw PDF bytes if MarkItDown times out and curl fallback succeeds,
  • raw XML for local .xml files,
  • potentially raw HTML or other text-like formats without normalization.

Suggested design

Use a typed extraction pipeline:

  1. Identify source type:
    • URL extension
    • Content-Type
    • file extension
    • magic bytes
  2. Route to appropriate extractor:
    • PDF → download original, run MarkItDown/PDF extractor
    • HTML → readability/markdown extraction
    • XML → XML-to-markdown or project-provided converter
    • Markdown/text → copy as-is
    • binary/unknown → write clear extraction failure message
  3. Record extraction metadata in manifest:
    • extractor
    • extraction_status: success | failed | timeout | unsupported
    • content_type
    • original_file
    • optional error message

Expected behavior

extracted.md should always be human-readable markdown/text, or a clear extraction failure note. It should never contain binary bytes.

Optional extension point

A pluggable extractor interface would let projects register domain-specific converters for structured XML or other specialized source formats.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions