ocr_text option for pdf files gets poor text results #2068
mark-faster-outcomes
started this conversation in
General
Replies: 2 comments 3 replies
-
|
Hey! Thanks for trying FSCrawler.
What do you mean? Did you configure as well the |
Beta Was this translation helpful? Give feedback.
2 replies
-
|
I pulled out one page that has an issue extracting the text when I don't use ocr_only strategy. The actual document has over 400 pages so full tesseract mode is very slow. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
I'm 2.10-SNAPSHOT. I have large pdf files ( >25m) that I need to index. The only
pdf_strategythat seems to work properly is"ocr_only". The other options often missing words in the text and/or fail to preserve spacing ( I ampreserve_interword_spacing = true) andpage_seg_mode = 6. I usedpdftotextfrom the popper_utils and it extracts the text properly on my sample document. The problem with usingocr_onlyis that it can be very slow (>20 mins) on a large pdf file. Often the pdfs only contain text but I don't want to run the risk and missing some images that might contain text.Here are my
_settings.ymlBTW, I might consider just using pdftotext to get the text out of the pds, but it's not clear to me how I could override the pdf file handler in fscrawler to use something that I cook up.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions