Skip to content
This repository was archived by the owner on Jul 11, 2023. It is now read-only.
This repository was archived by the owner on Jul 11, 2023. It is now read-only.

PDF Document Loader #38

@johnnysecond

Description

@johnnysecond

For PDFs:

https://github.com/kartik1998/pdf-images
https://github.com/naptha/tesseract.js#tesseractjs

Spent many hours experimenting with the best way to extract text data from PDFs. Tried a couple different libraries - they all had problems preserving whitespace. This ended up being pretty problematic when I went to query embeddings of this text. The incorrect formatting would be preserved in the answers, which won't do.

The best solution in practice came out to be converting the PDFs to images then using OCR to extract text from the images.
I have this implemented in python for now but will be rewriting in TS for the production app so can contribute that code in the future if someone else doesn't already pick it up

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions