Skip to content

Simplify processors#231

Draft
AZOGOAT wants to merge 16 commits into
EPFLiGHT:masterfrom
AZOGOAT:task-simplify-processors
Draft

Simplify processors#231
AZOGOAT wants to merge 16 commits into
EPFLiGHT:masterfrom
AZOGOAT:task-simplify-processors

Conversation

@AZOGOAT

@AZOGOAT AZOGOAT commented Feb 24, 2026

Copy link
Copy Markdown
Member

Closes : #79
Closes : #191

@JCHAVEROT

JCHAVEROT commented Apr 24, 2026

Copy link
Copy Markdown
Collaborator

I'll solve the merge conflicts

Update: done

@fabnemEPFL

Copy link
Copy Markdown
Collaborator

@perrin-arthur

@JCHAVEROT JCHAVEROT force-pushed the task-simplify-processors branch from 383501f to b47a6bf Compare April 27, 2026 16:58
Introduces a new optional PDF processor based on pymupdf4llm, opt-in via
`file_type_processors: { .pdf: PyMuPDF4LLMProcessor }`. Faster than the
default marker-based PDFProcessor and produces LLM-friendly markdown,
without GPU OCR. PDFProcessor remains the default.
Adds an opt-in `pymupdf4llm` backend to PDFProcessor, configured via the
`processors:` block (same style as MediaProcessor). Faster than the
default marker-based extraction and produces LLM-friendly markdown,
without GPU OCR. The marker backend remains the default.

Usage:
  processors:
    PDFProcessor:
      backend: pymupdf4llm
Adds a third PDF extraction backend, `mistral`, exposed under the
`processors:` block alongside `marker` (default) and `pymupdf4llm`.
Adapts the CloudPDFProcessor work from upstream PR (issue EPFLiGHT#15) and
integrates it as a backend of PDFProcessor for a unified config UX.

- Mistral OCR call via the official `mistralai` SDK with PDF base64
  encoding and base64 image decoding for extracted figures.
- Asyncio rate limiter (`mistral_max_calls_per_second`) to maximise
  throughput without hitting Mistral's 429 - Too Many Requests, plus
  exponential backoff retries on transient failures.
- API key read from MISTRAL_API_KEY (loaded via python-dotenv in
  run_process) — no key in the YAML config.
@perrin-arthur perrin-arthur force-pushed the task-simplify-processors branch from 1b215e3 to b35da8a Compare May 8, 2026 13:14
@fabnemEPFL fabnemEPFL mentioned this pull request May 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow having several processors for a given file type and letting the user choose in the processor config Simplify processors

4 participants