I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.
I have both files in the same folder. hocr-pdf . > out.pdf generates a pdf but I cannot search inside.
Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).
When I extract the text from the pdf
$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data
and out.txt contains (excerpt)
(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)
(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)
(cid:0)(cid:0)(cid:0)(cid:0)
My hocr file is generated by kraken.
I read from kraken documentation
hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.
Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.
So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of ocr-fileformat. Same result.