corrupted data when generating a searchable pdf with hocr-pdf

I am trying to generate a searchable pdf from a jpeg file and a hocr file with the help of hocr-pdf.

I have both files in the same folder. `hocr-pdf . > out.pdf` generates a pdf but I cannot search inside.

Pdf reader (evince) says "some font thing failed" when displaying the file (I can see the image).

When I extract the  text from the pdf

```
$ pdf2txt out.pdf -o out.txt
WARNING:pdfminer.pdftypes:Data-loss while decompressing corrupted data
```

and out.txt contains (excerpt)

```
(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0) (cid:0)(cid:0)(cid:0)(cid:0)

(cid:0)(cid:0)(cid:0)(cid:0)
```

My hocr file is generated by kraken.

I read from [kraken documentation](https://kraken.re/4.0/advanced.html)

>hOCR output is slightly different from hOCR files produced by ocropus. Each ocr_line span contains not only the bounding box of the line but also character boxes (x_bboxes attribute) indicating the coordinates of each character. In each line alternating sequences of alphanumeric and non-alphanumeric (in the unicode sense) characters are put into ocrx_word spans. Both have bounding boxes as attributes and the recognition confidence for each character in the x_conf attribute.

>Paragraph detection has been removed as it was deemed to be unduly dependent on certain typographic features which may not be valid for your input.

So I also tried with an ALTO file (still generated by Kraken), which I convert to hocr format with the help of [ocr-fileformat](https://github.com/UB-Mannheim/ocr-fileformat). Same result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

corrupted data when generating a searchable pdf with hocr-pdf #186

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

corrupted data when generating a searchable pdf with hocr-pdf #186

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions