Skip to content

Add ParseBench integration for non-OCR PDF evaluation#16

Draft
ThomAub wants to merge 3 commits into
mainfrom
claude/integrate-officemd-parsebench-BgmL5
Draft

Add ParseBench integration for non-OCR PDF evaluation#16
ThomAub wants to merge 3 commits into
mainfrom
claude/integrate-officemd-parsebench-BgmL5

Conversation

@ThomAub
Copy link
Copy Markdown
Owner

@ThomAub ThomAub commented Apr 19, 2026

Summary

  • Adds integrations/parsebench/, a drop-in ParseBench PARSE provider (officemd_local) that shells out to the local OfficeMD CLI via cargo run -p officemd_cli -- stream <pdf> --output-format json --pretty and normalizes the JSON into ParseOutput.
  • Registers a new officemd_local pipeline through a tiny patch module (register_officemd_pipelines(register_fn)) so ParseBench picks it up alongside its built-in parse pipelines.
  • Ships scripts/classify_non_ocr_pdfs.py, which uses officemd inspect --output-format json to materialize the non-OCR slice (classification TextBased, empty pages_needing_ocr) as a JSONL report + plain-text manifest — no upstream dataset edits.
  • Unit tests stub subprocess.run and cover valid multi-page normalization, page-order preservation, invalid JSON / non-zero exit / missing pdf.pages error paths, non-PDF inputs, cargo-vs-binary modes, and config validation.
  • OfficeMD CLI is intentionally not modified for v1 — current pdf.pages[] + diagnostics payload is sufficient.

Normalization rules

  • pdf.pages[].numberPageIR.page_index = number - 1
  • pdf.pages[].markdownPageIR.markdown
  • Document markdown = page markdown joined by a blank line
  • layout_pages left empty in v1
  • Full OfficeMD JSON kept in raw_output["document"] for later analysis

Test plan

  • uv pip install -e integrations/parsebench[dev]
  • uv run pytest integrations/parsebench/tests -q — provider unit tests
  • From a ParseBench checkout: patch inference/pipelines/parse.py to call register_officemd_pipelines, then
    • uv run parse-bench pipelines lists officemd_local
    • uv run parse-bench run officemd_local --test --group text_content completes end-to-end
    • uv run parse-bench run officemd_local --test --group text_formatting completes end-to-end
  • Run scripts/classify_non_ocr_pdfs.py over the dataset to produce the non-OCR manifest, then run officemd_local, pypdf_baseline, and pymupdf_text on the same slice and compare reports for text_content / text_formatting / table.
  • Produce a short gap report grouping failures by category (reading order, list/header handling, table markdown, formatting loss).

Notes / follow-ups

  • chart category skipped in the first pass: OfficeMD exposes text markdown but no chart-specific structured payload yet.
  • If the comparison surfaces gaps that need cleaner page-boundary/formatting semantics, the follow-up belongs in the OfficeMD PDF JSON payload rather than ParseBench-side post-processing.
  • Layout attribution (ParseLayoutPageIR) is not wired for v1.

https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn

claude added 3 commits April 19, 2026 17:09
Introduces a drop-in ParseBench PARSE provider that shells out to the
local OfficeMD CLI (`cargo run -p officemd_cli -- stream --output-format
json --pretty`), normalizes the per-page PDF payload into ParseBench's
`ParseOutput`, and registers a new `officemd_local` pipeline. Pages are
mapped from 1-based `pdf.pages[].number` to 0-based `PageIR.page_index`,
document markdown is the blank-line-joined page markdown, and the full
OfficeMD JSON is preserved in `raw_output` for analysis. `layout_pages`
is left empty in v1.

Also ships a non-OCR slicing script that invokes `officemd inspect
--output-format json` to select PDFs classified `TextBased` with an
empty `pages_needing_ocr`, emitting a JSONL report and a plain-text
manifest rather than editing upstream dataset files.

Unit tests stub `subprocess.run` so they exercise normalization, page
ordering, error mapping, and binary/cargo modes without needing cargo or
a live OfficeMD build.
PDF 32000-1 §8.4.2 lists the text state — including Tf (font name) and
Tfs (font size) — as part of the graphics state that `q` pushes and `Q`
pops. OfficeMD's stack only saved the CTM, fill color, and text rendering
mode, so a `/Fx Tf` inside a nested `q ... Q` block leaked out.

Concretely, in PDFs that render body prose in a TrueType/WinAnsi font
(F1) and punctuation like the en-dash via a Type0/Identity-H font (F6)
sharing the same BaseFont, the sequence
  /F1 Tf ... q /F6 Tf [<00B2>]TJ Q BT [(Fundamentals)]TJ
left `current_font=F6` when the `(Fundamentals)` literal was decoded.
The bytes got routed through F6's 2-byte ToUnicode CMap, where the range
`<0044><005D> -> <0061>` maps 0x46 -> 'c', so "Fundamentals" came out as
"cundamentals". Same mechanism yielded "Open" -> "lpen", "Markup" ->
"jarkup", "Reference" -> "oence", which all showed up as ParseBench
`missing_specific_word` failures on OpenXML_WhitePaper.pdf.

Fix: extend the `q`/`Q` save/restore to carry the current font name and
size alongside the CTM. The snapshot diff on OpenXML_WhitePaper.pdf is
pure gain — ligatures decode correctly, section titles join with their
bodies, and numeric "165 pages / 125 pages / 466 pages" fragments stop
appearing as "1SR pages / 1OR pages / 4SS pages".

ParseBench smoke run on the whitepaper+sample fixture:
  officemd_local: 10/12 -> 11/12 (83% -> 92%)
  pypdf_baseline: 12/12 (unchanged)
  pymupdf_text:   12/12 (unchanged)
One remaining failure — `missing_specific_sentence "Part 1 – Fundamentals"`
— is a separate space-preservation issue across the F6/F1 font switch and
is not addressed here.
…ted widths

Some Type0/CID fonts ship only a default width (DW) and no W array, so every
glyph — including narrow punctuation like an en-dash — reports the same wide
advance. That inflates `end_x` in the line-merge pass and hides real whitespace
between items placed via their own Tm operators, producing joins like
`Part 1 –Fundamentals` or `four forces –extremely`.

When the overlap is a substantial fraction of the prior item's reported width,
fall back to a conservative per-character estimate for the "honest" end point,
so a visible shift to the next item still reads as a word boundary. Tight
letter kerning (small negative gap) is still treated as a no-space continuation.

https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants