Add ParseBench integration for non-OCR PDF evaluation by ThomAub · Pull Request #16 · ThomAub/officemd

ThomAub · 2026-04-19T17:09:59Z

Summary

Adds integrations/parsebench/, a drop-in ParseBench PARSE provider (officemd_local) that shells out to the local OfficeMD CLI via cargo run -p officemd_cli -- stream <pdf> --output-format json --pretty and normalizes the JSON into ParseOutput.
Registers a new officemd_local pipeline through a tiny patch module (register_officemd_pipelines(register_fn)) so ParseBench picks it up alongside its built-in parse pipelines.
Ships scripts/classify_non_ocr_pdfs.py, which uses officemd inspect --output-format json to materialize the non-OCR slice (classification TextBased, empty pages_needing_ocr) as a JSONL report + plain-text manifest — no upstream dataset edits.
Unit tests stub subprocess.run and cover valid multi-page normalization, page-order preservation, invalid JSON / non-zero exit / missing pdf.pages error paths, non-PDF inputs, cargo-vs-binary modes, and config validation.
OfficeMD CLI is intentionally not modified for v1 — current pdf.pages[] + diagnostics payload is sufficient.

Normalization rules

pdf.pages[].number → PageIR.page_index = number - 1
pdf.pages[].markdown → PageIR.markdown
Document markdown = page markdown joined by a blank line
layout_pages left empty in v1
Full OfficeMD JSON kept in raw_output["document"] for later analysis

Test plan

uv pip install -e integrations/parsebench[dev]
uv run pytest integrations/parsebench/tests -q — provider unit tests
From a ParseBench checkout: patch inference/pipelines/parse.py to call register_officemd_pipelines, then
- uv run parse-bench pipelines lists officemd_local
- uv run parse-bench run officemd_local --test --group text_content completes end-to-end
- uv run parse-bench run officemd_local --test --group text_formatting completes end-to-end
Run scripts/classify_non_ocr_pdfs.py over the dataset to produce the non-OCR manifest, then run officemd_local, pypdf_baseline, and pymupdf_text on the same slice and compare reports for text_content / text_formatting / table.
Produce a short gap report grouping failures by category (reading order, list/header handling, table markdown, formatting loss).

Notes / follow-ups

chart category skipped in the first pass: OfficeMD exposes text markdown but no chart-specific structured payload yet.
If the comparison surfaces gaps that need cleaner page-boundary/formatting semantics, the follow-up belongs in the OfficeMD PDF JSON payload rather than ParseBench-side post-processing.
Layout attribution (ParseLayoutPageIR) is not wired for v1.

https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn

Introduces a drop-in ParseBench PARSE provider that shells out to the local OfficeMD CLI (`cargo run -p officemd_cli -- stream --output-format json --pretty`), normalizes the per-page PDF payload into ParseBench's `ParseOutput`, and registers a new `officemd_local` pipeline. Pages are mapped from 1-based `pdf.pages[].number` to 0-based `PageIR.page_index`, document markdown is the blank-line-joined page markdown, and the full OfficeMD JSON is preserved in `raw_output` for analysis. `layout_pages` is left empty in v1. Also ships a non-OCR slicing script that invokes `officemd inspect --output-format json` to select PDFs classified `TextBased` with an empty `pages_needing_ocr`, emitting a JSONL report and a plain-text manifest rather than editing upstream dataset files. Unit tests stub `subprocess.run` so they exercise normalization, page ordering, error mapping, and binary/cargo modes without needing cargo or a live OfficeMD build.

PDF 32000-1 §8.4.2 lists the text state — including Tf (font name) and Tfs (font size) — as part of the graphics state that `q` pushes and `Q` pops. OfficeMD's stack only saved the CTM, fill color, and text rendering mode, so a `/Fx Tf` inside a nested `q ... Q` block leaked out. Concretely, in PDFs that render body prose in a TrueType/WinAnsi font (F1) and punctuation like the en-dash via a Type0/Identity-H font (F6) sharing the same BaseFont, the sequence /F1 Tf ... q /F6 Tf [<00B2>]TJ Q BT [(Fundamentals)]TJ left `current_font=F6` when the `(Fundamentals)` literal was decoded. The bytes got routed through F6's 2-byte ToUnicode CMap, where the range `<0044><005D> -> <0061>` maps 0x46 -> 'c', so "Fundamentals" came out as "cundamentals". Same mechanism yielded "Open" -> "lpen", "Markup" -> "jarkup", "Reference" -> "oence", which all showed up as ParseBench `missing_specific_word` failures on OpenXML_WhitePaper.pdf. Fix: extend the `q`/`Q` save/restore to carry the current font name and size alongside the CTM. The snapshot diff on OpenXML_WhitePaper.pdf is pure gain — ligatures decode correctly, section titles join with their bodies, and numeric "165 pages / 125 pages / 466 pages" fragments stop appearing as "1SR pages / 1OR pages / 4SS pages". ParseBench smoke run on the whitepaper+sample fixture: officemd_local: 10/12 -> 11/12 (83% -> 92%) pypdf_baseline: 12/12 (unchanged) pymupdf_text: 12/12 (unchanged) One remaining failure — `missing_specific_sentence "Part 1 – Fundamentals"` — is a separate space-preservation issue across the F6/F1 font switch and is not addressed here.

…ted widths Some Type0/CID fonts ship only a default width (DW) and no W array, so every glyph — including narrow punctuation like an en-dash — reports the same wide advance. That inflates `end_x` in the line-merge pass and hides real whitespace between items placed via their own Tm operators, producing joins like `Part 1 –Fundamentals` or `four forces –extremely`. When the overlap is a substantial fraction of the prior item's reported width, fall back to a conservative per-character estimate for the "honest" end point, so a visible shift to the next item still reads as a word boundary. Tight letter kerning (small negative gap) is still treated as a no-space continuation. https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn

claude added 3 commits April 19, 2026 17:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ParseBench integration for non-OCR PDF evaluation#16

Add ParseBench integration for non-OCR PDF evaluation#16
ThomAub wants to merge 3 commits into
mainfrom
claude/integrate-officemd-parsebench-BgmL5

ThomAub commented Apr 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ThomAub commented Apr 19, 2026

Summary

Normalization rules

Test plan

Notes / follow-ups

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants