Skip to content

Improve Support for Query Execution on PDFs #267

@mdr223

Description

@mdr223

Overview

While we do support query execution over PDFs, we currently rely on the pypdf library for parsing PDF documents into raw text. As we've learned from folks in industry (and academia), this simple parsing rarely works well for documents containing tables and figures. More advanced PDF processing libraries exist (e.g. Marker), and PZ should be able to explore the tradeoff between more complex PDF processing (which yields better outputs at higher latency) and simpler PDF processing (e.g. pypdf, which yields simple outputs at low latency).

Acceptance Criteria

  • Modify the pz.PDFFileDataset class so that it doesn't extract the raw text from the PDF, we will move this logic into the scan operator instead
  • Create multiple physical operators for scanning PDF file(s) one for each PDF processor (start with pypdf and marker-pdf)
  • Enable Abacus to optimize over the choice of PDF processor depending on whether the user is optimizing for quality or latency

Speak with @mdr223 before implementing the Abacus support as he I will have tips / suggestions for what needs to be done here.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions