Skip to content

Improve ingestion pipeline #49

@shredEngineer

Description

@shredEngineer

We could streamline the ingestion pipeline as implemented through archive_agent/data/FileData and the /archive_agent/data/loaders subpackage.

Options:

Goals:

  • Support more file types
  • Improve ingestion quality, e.g. retain PDF hierarchy (could also tweak currently used prompt)

For Markitdown there are some points to consider:

  • PDF extraction seems to use pdfminer instead of pymupdf, which has been reported to be slower.
  • The image description feature seems very basic compared to combined OCR and entity extraction, so Archive Agent's native method as currently implemented seems to be superior.

For marker, I have to do similar research.


ESSENTIAL:

Make sure the new module is thread-safe.

The currently used pymupdf module is not thread safe, which is a performance hit.
PDFs are currently handled differently in IngestionManager due to this.


Also check out MinerU: https://github.com/opendatalab/MinerU

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions