Improve Support for Query Execution on PDFs

## Overview
While we do support query execution over PDFs, we currently rely on the `pypdf` library for parsing PDF documents into raw text. As we've learned from folks in industry (and academia), this simple parsing rarely works well for documents containing tables and figures. More advanced PDF processing libraries exist (e.g. [Marker](https://github.com/datalab-to/marker)), and PZ should be able to explore the tradeoff between more complex PDF processing (which yields better outputs at higher latency) and simpler PDF processing (e.g. `pypdf`, which yields simple outputs at low latency).

## Acceptance Criteria
- Modify the `pz.PDFFileDataset` class so that it doesn't extract the raw text from the PDF, we will move this logic into the scan operator instead
- Create multiple physical operators for scanning PDF file(s) one for each PDF processor (start with `pypdf` and `marker-pdf`)
- Enable Abacus to optimize over the choice of PDF processor depending on whether the user is optimizing for quality or latency

Speak with @mdr223 before implementing the Abacus support as he I will have tips / suggestions for what needs to be done here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Support for Query Execution on PDFs #267

Overview

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve Support for Query Execution on PDFs #267

Description

Overview

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions