ocr_text option for pdf files gets poor text results #2068

mark-faster-outcomes · 2025-04-14T22:44:03Z

mark-faster-outcomes
Apr 14, 2025

I'm 2.10-SNAPSHOT. I have large pdf files ( >25m) that I need to index. The only pdf_strategy that seems to work properly is "ocr_only". The other options often missing words in the text and/or fail to preserve spacing ( I am preserve_interword_spacing = true ) and page_seg_mode = 6. I used pdftotext from the popper_utils and it extracts the text properly on my sample document. The problem with using ocr_only is that it can be very slow (>20 mins) on a large pdf file. Often the pdfs only contain text but I don't want to run the risk and missing some images that might contain text.

Here are my _settings.yml

ocr:
    # optional: enable or disable OCR
    enabled: true

    # optional: language to use for OCR
    language: "eng"
    # optional: path to the Tesseract binary
    #path: "/path/to/tesseract/if/not/available/in/PATH"
    # optional: path to the Tesseract tessdata directory
    #data_path: "/path/to/tesseract/tessdata/if/needed"
    # optional: the OCR output type, either txt or hocr
    output_type: txt
    # optional: the OCR strategy, either auto, no_ocr, ocr_only, ocr_and_text
    pdf_strategy: no_ocr
    # optional: the mode to use for page orientation from 0 to 13
    page_seg_mode: 6 # automatic mode
    # optional: if true, we will try to preserve interword spacing
    preserve_interword_spacing: true

BTW, I might consider just using pdftotext to get the text out of the pds, but it's not clear to me how I could override the pdf file handler in fscrawler to use something that I cook up.

Thanks!

dadoonet · 2025-04-15T06:31:27Z

dadoonet
Apr 15, 2025
Maintainer

Hey!

Thanks for trying FSCrawler.

The other options often missing words in the text

What do you mean? Did you configure as well the extracted-characters? If not, it's only extracting the first 100 000 characters. Could it be your issue here?

2 replies

mark-e-hoffman Apr 15, 2025

Thanks for your quick reply. I have fs: indexed_chars: "-1", so that doesn't appear to be the issue.

Here's an example of what the text looks like when I use the ocr_only strategy or when I used pdftotext on the the pdf documnet:

`PREFACE
TO THE FIRST EDITION

Iv has long been a subject of complaint, in this state, that
we had no reports of the decisions of our courts of judica-
ture. The importance of having authentic reports of cases
argued and determined in the Supreme Judicial Court, the
only court in the state whose decisions are considered as
authorities, must be obvious to all who have any preten-
sions to information on the subject. From considerations
of this kind, it is presumed, the legislature, by an act passed
some time since, (a) authorized the Supreme Executive to
appoimt some person to be a reporter of those decisions,
whose duty it should be to publish the same annually. In
consequence of this authority, vested in the Executive, the
author of the following pages had the honor of the ap-
pointment. ‘

But when I use **ocr_text**, **no_ocr** or **auto**, I get thistext:
`PREFACE

TO THE FIRST EDITION

state, thatin thisofIt has been a complaint,long subject
ofour courtsofno of the decisions judica-we had reports

of casesauthenticofture. The reportsimportance having
Court, theJudicialand determined in the Supremeargued

asare considereddecisionscourt in the state whoseonly
haveallbe obvious to whoauthorities, preten-must any

From considerationsinformation on thesions to subject.
an actkind, is theof this it passedbylegislature,presumed,

toExecutiveauthorized the(a)since,timesome Supreme
decisions,thosebe a oftosome reporterappoint person

Insamebe to theit should annually.whose publishduty
Executive, thein theof vestedthisconsequence authority,

honor of thethe had theauthor of ap-following pages
pointment.
`

Unfortunately, the 'ocr_only' pdf_strategy takes over 20 minutes for a 25 MB document, which might not be feasible considering the number of PDF documents that I need to index for this project.

Can you point me to the docs explaining how I could experiment with replacing tesseract with something else for handling pdf files?

Thanks.

dadoonet Apr 15, 2025
Maintainer

Can you point me to the docs explaining how I could experiment with replacing tesseract with something else for handling pdf files?

There's nothing like this yet.
I added some support for plugins but we are not yet there...

Could you share your pdf document?

mark-faster-outcomes · 2025-04-15T21:52:20Z

mark-faster-outcomes
Apr 15, 2025
Author

I pulled out one page that has an issue extracting the text when I don't use ocr_only strategy. The actual document has over 400 pages so full tesseract mode is very slow.
outfile.pdf

1 reply

mark-faster-outcomes Apr 23, 2025
Author

Hi. I was doing some additional research about elasticsearch pipelines with an example that has custom image processor then references that pipeline in the _settings.yaml. (elasticsearch: pipeline: "mypipe") If I want to create a custom PDF handler, does it make more sense to use this approach rather than forking fscrawler and writing my custom PDF handler there?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ocr_text option for pdf files gets poor text results #2068

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

ocr_text option for pdf files gets poor text results #2068

Uh oh!

Uh oh!

mark-faster-outcomes Apr 14, 2025

Replies: 2 comments · 3 replies

Uh oh!

dadoonet Apr 15, 2025 Maintainer

Uh oh!

mark-e-hoffman Apr 15, 2025

Uh oh!

dadoonet Apr 15, 2025 Maintainer

Uh oh!

mark-faster-outcomes Apr 15, 2025 Author

Uh oh!

mark-faster-outcomes Apr 23, 2025 Author

mark-faster-outcomes
Apr 14, 2025

Replies: 2 comments 3 replies

dadoonet
Apr 15, 2025
Maintainer

dadoonet Apr 15, 2025
Maintainer

mark-faster-outcomes
Apr 15, 2025
Author

mark-faster-outcomes Apr 23, 2025
Author