Add ParseBench integration for non-OCR PDF evaluation#16
Draft
ThomAub wants to merge 3 commits into
Draft
Conversation
Introduces a drop-in ParseBench PARSE provider that shells out to the local OfficeMD CLI (`cargo run -p officemd_cli -- stream --output-format json --pretty`), normalizes the per-page PDF payload into ParseBench's `ParseOutput`, and registers a new `officemd_local` pipeline. Pages are mapped from 1-based `pdf.pages[].number` to 0-based `PageIR.page_index`, document markdown is the blank-line-joined page markdown, and the full OfficeMD JSON is preserved in `raw_output` for analysis. `layout_pages` is left empty in v1. Also ships a non-OCR slicing script that invokes `officemd inspect --output-format json` to select PDFs classified `TextBased` with an empty `pages_needing_ocr`, emitting a JSONL report and a plain-text manifest rather than editing upstream dataset files. Unit tests stub `subprocess.run` so they exercise normalization, page ordering, error mapping, and binary/cargo modes without needing cargo or a live OfficeMD build.
PDF 32000-1 §8.4.2 lists the text state — including Tf (font name) and Tfs (font size) — as part of the graphics state that `q` pushes and `Q` pops. OfficeMD's stack only saved the CTM, fill color, and text rendering mode, so a `/Fx Tf` inside a nested `q ... Q` block leaked out. Concretely, in PDFs that render body prose in a TrueType/WinAnsi font (F1) and punctuation like the en-dash via a Type0/Identity-H font (F6) sharing the same BaseFont, the sequence /F1 Tf ... q /F6 Tf [<00B2>]TJ Q BT [(Fundamentals)]TJ left `current_font=F6` when the `(Fundamentals)` literal was decoded. The bytes got routed through F6's 2-byte ToUnicode CMap, where the range `<0044><005D> -> <0061>` maps 0x46 -> 'c', so "Fundamentals" came out as "cundamentals". Same mechanism yielded "Open" -> "lpen", "Markup" -> "jarkup", "Reference" -> "oence", which all showed up as ParseBench `missing_specific_word` failures on OpenXML_WhitePaper.pdf. Fix: extend the `q`/`Q` save/restore to carry the current font name and size alongside the CTM. The snapshot diff on OpenXML_WhitePaper.pdf is pure gain — ligatures decode correctly, section titles join with their bodies, and numeric "165 pages / 125 pages / 466 pages" fragments stop appearing as "1SR pages / 1OR pages / 4SS pages". ParseBench smoke run on the whitepaper+sample fixture: officemd_local: 10/12 -> 11/12 (83% -> 92%) pypdf_baseline: 12/12 (unchanged) pymupdf_text: 12/12 (unchanged) One remaining failure — `missing_specific_sentence "Part 1 – Fundamentals"` — is a separate space-preservation issue across the F6/F1 font switch and is not addressed here.
…ted widths Some Type0/CID fonts ship only a default width (DW) and no W array, so every glyph — including narrow punctuation like an en-dash — reports the same wide advance. That inflates `end_x` in the line-merge pass and hides real whitespace between items placed via their own Tm operators, producing joins like `Part 1 –Fundamentals` or `four forces –extremely`. When the overlap is a substantial fraction of the prior item's reported width, fall back to a conservative per-character estimate for the "honest" end point, so a visible shift to the next item still reads as a word boundary. Tight letter kerning (small negative gap) is still treated as a no-space continuation. https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
integrations/parsebench/, a drop-in ParseBenchPARSEprovider (officemd_local) that shells out to the local OfficeMD CLI viacargo run -p officemd_cli -- stream <pdf> --output-format json --prettyand normalizes the JSON intoParseOutput.officemd_localpipeline through a tiny patch module (register_officemd_pipelines(register_fn)) so ParseBench picks it up alongside its built-in parse pipelines.scripts/classify_non_ocr_pdfs.py, which usesofficemd inspect --output-format jsonto materialize the non-OCR slice (classificationTextBased, emptypages_needing_ocr) as a JSONL report + plain-text manifest — no upstream dataset edits.subprocess.runand cover valid multi-page normalization, page-order preservation, invalid JSON / non-zero exit / missingpdf.pageserror paths, non-PDF inputs,cargo-vs-binarymodes, and config validation.pdf.pages[]+ diagnostics payload is sufficient.Normalization rules
pdf.pages[].number→PageIR.page_index = number - 1pdf.pages[].markdown→PageIR.markdownmarkdown= page markdown joined by a blank linelayout_pagesleft empty in v1raw_output["document"]for later analysisTest plan
uv pip install -e integrations/parsebench[dev]uv run pytest integrations/parsebench/tests -q— provider unit testsinference/pipelines/parse.pyto callregister_officemd_pipelines, thenuv run parse-bench pipelineslistsofficemd_localuv run parse-bench run officemd_local --test --group text_contentcompletes end-to-enduv run parse-bench run officemd_local --test --group text_formattingcompletes end-to-endscripts/classify_non_ocr_pdfs.pyover the dataset to produce the non-OCR manifest, then runofficemd_local,pypdf_baseline, andpymupdf_texton the same slice and compare reports fortext_content/text_formatting/table.Notes / follow-ups
chartcategory skipped in the first pass: OfficeMD exposes text markdown but no chart-specific structured payload yet.ParseLayoutPageIR) is not wired for v1.https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn