Overview
While we do support query execution over PDFs, we currently rely on the pypdf library for parsing PDF documents into raw text. As we've learned from folks in industry (and academia), this simple parsing rarely works well for documents containing tables and figures. More advanced PDF processing libraries exist (e.g. Marker), and PZ should be able to explore the tradeoff between more complex PDF processing (which yields better outputs at higher latency) and simpler PDF processing (e.g. pypdf, which yields simple outputs at low latency).
Acceptance Criteria
- Modify the
pz.PDFFileDataset class so that it doesn't extract the raw text from the PDF, we will move this logic into the scan operator instead
- Create multiple physical operators for scanning PDF file(s) one for each PDF processor (start with
pypdf and marker-pdf)
- Enable Abacus to optimize over the choice of PDF processor depending on whether the user is optimizing for quality or latency
Speak with @mdr223 before implementing the Abacus support as he I will have tips / suggestions for what needs to be done here.
Overview
While we do support query execution over PDFs, we currently rely on the
pypdflibrary for parsing PDF documents into raw text. As we've learned from folks in industry (and academia), this simple parsing rarely works well for documents containing tables and figures. More advanced PDF processing libraries exist (e.g. Marker), and PZ should be able to explore the tradeoff between more complex PDF processing (which yields better outputs at higher latency) and simpler PDF processing (e.g.pypdf, which yields simple outputs at low latency).Acceptance Criteria
pz.PDFFileDatasetclass so that it doesn't extract the raw text from the PDF, we will move this logic into the scan operator insteadpypdfandmarker-pdf)Speak with @mdr223 before implementing the Abacus support as he I will have tips / suggestions for what needs to be done here.