Skip to content

Support array /Contents and the PostScript dup operator#9

Open
peterhoneder wants to merge 1 commit into
dslipak:masterfrom
peterhoneder:fix/array-contents-and-dup-operator
Open

Support array /Contents and the PostScript dup operator#9
peterhoneder wants to merge 1 commit into
dslipak:masterfrom
peterhoneder:fix/array-contents-and-dup-operator

Conversation

@peterhoneder

Copy link
Copy Markdown

I came along this issue when extracting text from multi-page PDFs that ended up hanging infinitely.

GetPlainText hangs and CMap parsing panics on common real-world PDFs because of two unrelated gaps in the content/PostScript handling.

  1. A page's /Contents may be an array of streams that are to be treated as a single stream (PDF 32000-1:2008, 7.8.2). Page.Content() already handles this, but Value.Reader() does not: for an array it returns an errorReadCloser, whose non-EOF "stream not present" error makes the lexer's reload/readByte loop spin forever (GetPlainText never returns). Handle the array case in Value.Reader() by concatenating the stream elements, separated by a newline so tokens cannot merge across a stream boundary. Non-stream elements (a null or a dangling reference) are skipped, since an errorReadCloser in the middle of the concatenation would spin the lexer just the same. This fixes every Reader() caller, GetPlainText included.

  2. The Interpret PostScript machine never implemented dup. ToUnicode CMaps routinely use the "N dict dup begin ... end def" idiom (e.g. for /CIDSystemInfo), so the missing dup underflows the stack and the trailing def panics with "def of non-name". Implement dup.

Tests build minimal PDFs exercising each path: Test_Reader_ArrayOfStreams asserts Value.Reader concatenates an array of streams; the SkipsNonStream test asserts a null array element does not hang; and the DupCmap test extracts Type0/Identity-H text end to end. All three fail on the parent commit (hang or panic) and pass here.

GetPlainText hangs and CMap parsing panics on common real-world PDFs
because of two unrelated gaps in the content/PostScript handling.

1. A page's /Contents may be an array of streams that are to be treated
   as a single stream (PDF 32000-1:2008, 7.8.2). Page.Content() already
   handles this, but Value.Reader() does not: for an array it returns an
   errorReadCloser, whose non-EOF "stream not present" error makes the
   lexer's reload/readByte loop spin forever (GetPlainText never returns).
   Handle the array case in Value.Reader() by concatenating the stream
   elements, separated by a newline so tokens cannot merge across a stream
   boundary. Non-stream elements (a null or a dangling reference) are
   skipped, since an errorReadCloser in the middle of the concatenation
   would spin the lexer just the same. This fixes every Reader() caller,
   GetPlainText included.

2. The Interpret PostScript machine never implemented dup. ToUnicode
   CMaps routinely use the "N dict dup begin ... end def" idiom (e.g. for
   /CIDSystemInfo), so the missing dup underflows the stack and the
   trailing def panics with "def of non-name". Implement dup.

Tests build minimal PDFs exercising each path: Test_Reader_ArrayOfStreams
asserts Value.Reader concatenates an array of streams; the SkipsNonStream
test asserts a null array element does not hang; and the DupCmap test
extracts Type0/Identity-H text end to end. All three fail on the parent
commit (hang or panic) and pass here.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant