-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
enhancementNew feature or requestNew feature or request
Description
We could streamline the ingestion pipeline as implemented through archive_agent/data/FileData and the /archive_agent/data/loaders subpackage.
Options:
Goals:
- Support more file types
- Improve ingestion quality, e.g. retain PDF hierarchy (could also tweak currently used prompt)
For Markitdown there are some points to consider:
- PDF extraction seems to use pdfminer instead of pymupdf, which has been reported to be slower.
- The image description feature seems very basic compared to combined OCR and entity extraction, so Archive Agent's native method as currently implemented seems to be superior.
For marker, I have to do similar research.
ESSENTIAL:
Make sure the new module is thread-safe.
The currently used pymupdf module is not thread safe, which is a performance hit.
PDFs are currently handled differently in IngestionManager due to this.
Also check out MinerU: https://github.com/opendatalab/MinerU
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or request