Support array /Contents and the PostScript dup operator by peterhoneder · Pull Request #9 · dslipak/pdf

peterhoneder · 2026-06-22T19:52:53Z

I came along this issue when extracting text from multi-page PDFs that ended up hanging infinitely.

GetPlainText hangs and CMap parsing panics on common real-world PDFs because of two unrelated gaps in the content/PostScript handling.

A page's /Contents may be an array of streams that are to be treated as a single stream (PDF 32000-1:2008, 7.8.2). Page.Content() already handles this, but Value.Reader() does not: for an array it returns an errorReadCloser, whose non-EOF "stream not present" error makes the lexer's reload/readByte loop spin forever (GetPlainText never returns). Handle the array case in Value.Reader() by concatenating the stream elements, separated by a newline so tokens cannot merge across a stream boundary. Non-stream elements (a null or a dangling reference) are skipped, since an errorReadCloser in the middle of the concatenation would spin the lexer just the same. This fixes every Reader() caller, GetPlainText included.
The Interpret PostScript machine never implemented dup. ToUnicode CMaps routinely use the "N dict dup begin ... end def" idiom (e.g. for /CIDSystemInfo), so the missing dup underflows the stack and the trailing def panics with "def of non-name". Implement dup.

Tests build minimal PDFs exercising each path: Test_Reader_ArrayOfStreams asserts Value.Reader concatenates an array of streams; the SkipsNonStream test asserts a null array element does not hang; and the DupCmap test extracts Type0/Identity-H text end to end. All three fail on the parent commit (hang or panic) and pass here.

GetPlainText hangs and CMap parsing panics on common real-world PDFs because of two unrelated gaps in the content/PostScript handling. 1. A page's /Contents may be an array of streams that are to be treated as a single stream (PDF 32000-1:2008, 7.8.2). Page.Content() already handles this, but Value.Reader() does not: for an array it returns an errorReadCloser, whose non-EOF "stream not present" error makes the lexer's reload/readByte loop spin forever (GetPlainText never returns). Handle the array case in Value.Reader() by concatenating the stream elements, separated by a newline so tokens cannot merge across a stream boundary. Non-stream elements (a null or a dangling reference) are skipped, since an errorReadCloser in the middle of the concatenation would spin the lexer just the same. This fixes every Reader() caller, GetPlainText included. 2. The Interpret PostScript machine never implemented dup. ToUnicode CMaps routinely use the "N dict dup begin ... end def" idiom (e.g. for /CIDSystemInfo), so the missing dup underflows the stack and the trailing def panics with "def of non-name". Implement dup. Tests build minimal PDFs exercising each path: Test_Reader_ArrayOfStreams asserts Value.Reader concatenates an array of streams; the SkipsNonStream test asserts a null array element does not hang; and the DupCmap test extracts Type0/Identity-H text end to end. All three fail on the parent commit (hang or panic) and pass here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support array /Contents and the PostScript dup operator#9

Support array /Contents and the PostScript dup operator#9
peterhoneder wants to merge 1 commit into
dslipak:masterfrom
peterhoneder:fix/array-contents-and-dup-operator

peterhoneder commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

peterhoneder commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant