From 36be34f7f9f3e22a34e151a5153210eb8a03c462 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 19 Apr 2026 17:09:25 +0000 Subject: [PATCH 1/3] Add ParseBench integration for non-OCR PDF evaluation Introduces a drop-in ParseBench PARSE provider that shells out to the local OfficeMD CLI (`cargo run -p officemd_cli -- stream --output-format json --pretty`), normalizes the per-page PDF payload into ParseBench's `ParseOutput`, and registers a new `officemd_local` pipeline. Pages are mapped from 1-based `pdf.pages[].number` to 0-based `PageIR.page_index`, document markdown is the blank-line-joined page markdown, and the full OfficeMD JSON is preserved in `raw_output` for analysis. `layout_pages` is left empty in v1. Also ships a non-OCR slicing script that invokes `officemd inspect --output-format json` to select PDFs classified `TextBased` with an empty `pages_needing_ocr`, emitting a JSONL report and a plain-text manifest rather than editing upstream dataset files. Unit tests stub `subprocess.run` so they exercise normalization, page ordering, error mapping, and binary/cargo modes without needing cargo or a live OfficeMD build. --- CLAUDE.md | 2 + integrations/parsebench/README.md | 170 +++++++++ integrations/parsebench/pyproject.toml | 26 ++ .../scripts/classify_non_ocr_pdfs.py | 238 +++++++++++++ .../inference/pipelines/officemd_pipelines.py | 49 +++ .../providers/parse/officemd_local.py | 300 ++++++++++++++++ .../parsebench/tests/test_officemd_local.py | 331 ++++++++++++++++++ 7 files changed, 1116 insertions(+) create mode 100644 integrations/parsebench/README.md create mode 100644 integrations/parsebench/pyproject.toml create mode 100755 integrations/parsebench/scripts/classify_non_ocr_pdfs.py create mode 100644 integrations/parsebench/src/parse_bench/inference/pipelines/officemd_pipelines.py create mode 100644 integrations/parsebench/src/parse_bench/inference/providers/parse/officemd_local.py create mode 100644 integrations/parsebench/tests/test_officemd_local.py diff --git a/CLAUDE.md b/CLAUDE.md index 7f019f3..c88052e 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -83,3 +83,5 @@ uv run pytest ../../crates/tests/snapshots/ --force-regen -q - Pre-commit hooks via prek: `cargo fmt`, `cargo clippy`, `oxlint`, `ruff` - Snapshot tests live in `crates/tests/rust_snapshots/` (Rust) and `crates/tests/snapshots/` (Python) - Fixtures in `examples/data/` +- External benchmark integrations live in `integrations/` (e.g. + `integrations/parsebench/` for the run-llama/ParseBench PARSE provider) diff --git a/integrations/parsebench/README.md b/integrations/parsebench/README.md new file mode 100644 index 0000000..baba0f6 --- /dev/null +++ b/integrations/parsebench/README.md @@ -0,0 +1,170 @@ +# OfficeMD ↔ ParseBench integration + +A drop-in [ParseBench][parsebench] `PARSE` provider that invokes the local +OfficeMD CLI on a PDF and normalizes the JSON document into ParseBench's +`ParseOutput`. Shipped with a helper script to materialize a non-OCR slice of +a document corpus using OfficeMD's own classifier. + +The first benchmark pass targets **text-based, non-OCR PDFs** so results are +comparable with ParseBench's local baselines (`pypdf_baseline`, +`pymupdf_text`). + +[parsebench]: https://github.com/run-llama/ParseBench + +## Layout + +``` +integrations/parsebench/ +├── pyproject.toml +├── src/parse_bench/ +│ └── inference/ +│ ├── providers/parse/officemd_local.py # Provider (drop-in module path) +│ └── pipelines/officemd_pipelines.py # register_officemd_pipelines(register_fn) +├── scripts/classify_non_ocr_pdfs.py # Materialize the non-OCR slice +└── tests/test_officemd_local.py # Provider unit tests +``` + +The provider module lives under the `parse_bench.inference.providers.parse` +package path so it can be imported as-is from a ParseBench checkout. The +package is distributed as `parse-bench-officemd` and cohabits the +`parse_bench` namespace via a Hatch wheel. + +## Installation + +Install into the same virtualenv as your ParseBench checkout. From the +OfficeMD repo root: + +```sh +uv pip install -e integrations/parsebench +# or: +pip install -e integrations/parsebench +``` + +Then patch the ParseBench pipeline registry to register the new pipeline. +Edit `src/parse_bench/inference/pipelines/parse.py` (in the ParseBench +checkout) to call `register_officemd_pipelines`: + +```python +from parse_bench.inference.pipelines.officemd_pipelines import ( + register_officemd_pipelines, +) + +def register_parse_pipelines(register_fn): + # ...existing registrations... + register_officemd_pipelines(register_fn) +``` + +No upstream changes are required beyond this single call; the provider +auto-registers via `@register_provider("officemd_local")` when the pipeline +module is imported. + +## Running + +Point the provider at an OfficeMD checkout (either via env var or pipeline +config) and run the pipeline: + +```sh +export OFFICEMD_REPO_ROOT=/path/to/officemd +uv run parse-bench pipelines # confirms officemd_local is listed +uv run parse-bench run officemd_local --test --group text_content +uv run parse-bench run officemd_local --test --group text_formatting +``` + +By default the provider invokes `cargo run --release -p officemd_cli -- +stream --output-format json --pretty` from `repo_root`. To use a +prebuilt binary instead: + +```python +PipelineSpec( + pipeline_name="officemd_local_binary", + provider_name="officemd_local", + product_type=ProductType.PARSE, + config={ + "cargo_run": False, + "binary": "/path/to/target/release/officemd", + "extra_args": ["--no-headers-footers"], + }, +) +``` + +### Provider config reference + +| Key | Type | Default | Description | +|---|---|---|---| +| `cargo_run` | bool | `True` | Invoke `cargo run` from `repo_root` | +| `repo_root` | str | `$OFFICEMD_REPO_ROOT` | Absolute path to the workspace root (required when `cargo_run=True`) | +| `cargo_profile` | str | `"release"` | `release`, `dev`, or a custom profile name | +| `binary` | str | `None` | Prebuilt binary path (required when `cargo_run=False`) | +| `extra_args` | list[str] | `[]` | Extra CLI flags appended after the input path | +| `timeout_seconds` | float | `600` | Per-file subprocess timeout | + +## Normalization rules + +- `pdf.pages[].number` → `PageIR.page_index = number - 1` +- `pdf.pages[].markdown` → `PageIR.markdown` +- Document `markdown` is page markdown joined with a single blank line +- `layout_pages` is intentionally left empty in v1; layout attribution is + not wired through OfficeMD yet +- The full OfficeMD JSON document (including `pdf.diagnostics`) is preserved + in `raw_output["document"]` for downstream analysis + +## Slicing the non-OCR benchmark subset + +After the dataset documents are present, run the classifier to select PDFs +that OfficeMD considers `TextBased` with no pages requiring OCR: + +```sh +uv run integrations/parsebench/scripts/classify_non_ocr_pdfs.py \ + --input-dir ~/parsebench/data/documents \ + --report-jsonl non_ocr_report.jsonl \ + --manifest non_ocr_manifest.txt +``` + +Outputs: + +- `non_ocr_report.jsonl` — one row per PDF with classification, confidence, + page count, `pages_needing_ocr`, and a `non_ocr` boolean. Keep this + alongside the benchmark run for later drill-down. +- `non_ocr_manifest.txt` — the filtered list of qualifying PDFs; feed it to + ParseBench (or a small wrapper) to restrict the run rather than editing + the upstream dataset files. + +A PDF is selected when the classification is `TextBased` **and** +`pages_needing_ocr` is empty, matching OfficeMD's own definition of +"pure text-based". + +## Comparison workflow + +Run the same non-OCR slice through OfficeMD and the local baselines: + +```sh +uv run parse-bench run officemd_local --test --group text_content +uv run parse-bench run pypdf_baseline --test --group text_content +uv run parse-bench run pymupdf_text --test --group text_content +``` + +Repeat for `text_formatting` and `table`. `chart` is intentionally skipped +in the first pass — OfficeMD emits text markdown but no chart-specific +structured payload yet, so a chart comparison would degenerate into a text +comparison. Revisit once chart metadata is part of the PDF JSON payload. + +## Testing + +```sh +uv pip install -e integrations/parsebench[dev] +uv run pytest integrations/parsebench/tests -q +``` + +The unit tests stub `subprocess.run` and do not require `cargo`, a real PDF, +or a live OfficeMD checkout. They do require `parse_bench` to be installed +in the environment; otherwise they are skipped via `pytest.importorskip`. + +## Assumptions and non-goals + +- Invocation uses `cargo run` by default so benchmark runs always exercise + the current tree. Switch to `binary` mode for stable runs. +- ParseBench layout attribution is **not** wired up in v1; the goal is to + improve non-OCR parse quality, not overlay reconstruction. +- Cleaner page-boundary or formatting semantics than the current CLI JSON + exposes should be added to the OfficeMD PDF payload, **not** papered over + with ParseBench-specific post-processing. diff --git a/integrations/parsebench/pyproject.toml b/integrations/parsebench/pyproject.toml new file mode 100644 index 0000000..3705020 --- /dev/null +++ b/integrations/parsebench/pyproject.toml @@ -0,0 +1,26 @@ +[project] +name = "parse-bench-officemd" +version = "0.1.0" +description = "OfficeMD local provider for run-llama/ParseBench (PARSE task)" +readme = "README.md" +requires-python = ">=3.10" +license = { text = "MIT" } + +# `parse_bench` is assumed to be installed alongside this package (editable +# install of the ParseBench checkout or a released wheel). It is intentionally +# left out of `dependencies` so this package can be added to an existing +# ParseBench environment without fighting version pins. +dependencies = [] + +[project.optional-dependencies] +dev = ["pytest>=7"] + +[build-system] +requires = ["hatchling>=1.18"] +build-backend = "hatchling.build" + +[tool.hatch.build.targets.wheel] +packages = ["src/parse_bench"] + +[tool.pytest.ini_options] +testpaths = ["tests"] diff --git a/integrations/parsebench/scripts/classify_non_ocr_pdfs.py b/integrations/parsebench/scripts/classify_non_ocr_pdfs.py new file mode 100755 index 0000000..656dbd2 --- /dev/null +++ b/integrations/parsebench/scripts/classify_non_ocr_pdfs.py @@ -0,0 +1,238 @@ +#!/usr/bin/env python3 +"""Materialize a non-OCR PDF slice using OfficeMD's classifier. + +Runs `officemd inspect --output-format json` against a directory of PDFs (or a +list of paths) and emits: + +- a JSONL file with one row per PDF listing classification, page count, OCR + requirement, encoding issues, and pass/fail for the "non-OCR" slice, and +- a plain-text manifest containing only the paths that qualify. + +A PDF qualifies for the non-OCR slice when: + +- the OfficeMD classification is `TextBased`, AND +- `pages_needing_ocr` is empty. + +Usage from the OfficeMD repo root:: + + uv run integrations/parsebench/scripts/classify_non_ocr_pdfs.py \\ + --input-dir /path/to/pdfs \\ + --report-jsonl non_ocr_report.jsonl \\ + --manifest non_ocr_manifest.txt + +Or explicitly listing files:: + + uv run .../classify_non_ocr_pdfs.py file1.pdf file2.pdf \\ + --report-jsonl report.jsonl --manifest manifest.txt + +By default, the script uses `cargo run -p officemd_cli --release`. Pass +`--binary /path/to/officemd` to use a prebuilt binary instead. +""" + +from __future__ import annotations + +import argparse +import json +import os +import shutil +import subprocess +import sys +from collections.abc import Iterable +from pathlib import Path + + +def _collect_pdfs( + input_dir: Path | None, explicit_paths: list[Path] +) -> list[Path]: + pdfs: list[Path] = list(explicit_paths) + if input_dir is not None: + pdfs.extend( + sorted(p for p in input_dir.rglob("*.pdf") if p.is_file()) + ) + + # Keep stable order, drop duplicates. + seen: set[Path] = set() + deduped: list[Path] = [] + for path in pdfs: + resolved = path.resolve() + if resolved in seen: + continue + seen.add(resolved) + deduped.append(resolved) + return deduped + + +def _inspect_pdf( + pdf: Path, + *, + cargo_run: bool, + repo_root: Path, + binary: Path | None, + cargo_profile: str, +) -> dict[str, object]: + if cargo_run: + argv = ["cargo", "run", "--quiet", "-p", "officemd_cli"] + profile = cargo_profile.lower() + if profile == "release": + argv.append("--release") + elif profile not in ("dev", "debug", ""): + argv.extend(["--profile", cargo_profile]) + argv.extend(["--", "inspect", str(pdf), "--output-format", "json"]) + cwd: str | None = str(repo_root) + else: + assert binary is not None + argv = [str(binary), "inspect", str(pdf), "--output-format", "json"] + cwd = None + + completed = subprocess.run( # noqa: S603 - argv constructed internally + argv, + cwd=cwd, + capture_output=True, + text=True, + check=False, + ) + if completed.returncode != 0: + return { + "error": "inspect_failed", + "returncode": completed.returncode, + "stderr_tail": (completed.stderr or "").strip().splitlines()[-5:], + } + + try: + payload = json.loads(completed.stdout or "") + except json.JSONDecodeError as exc: + return {"error": f"invalid_json: {exc}"} + + pdf_info = payload.get("pdf") if isinstance(payload, dict) else None + if not isinstance(pdf_info, dict): + return {"error": "missing_pdf_info"} + return pdf_info + + +def _is_non_ocr(pdf_info: dict[str, object]) -> bool: + classification = pdf_info.get("classification") + pages_needing_ocr = pdf_info.get("pages_needing_ocr") + return ( + classification == "TextBased" + and isinstance(pages_needing_ocr, list) + and len(pages_needing_ocr) == 0 + ) + + +def _default_repo_root() -> Path: + env = os.environ.get("OFFICEMD_REPO_ROOT") + if env: + return Path(env).resolve() + # Walk upwards from this script until we find a Cargo.toml. + here = Path(__file__).resolve() + for candidate in here.parents: + if (candidate / "Cargo.toml").is_file(): + return candidate + return Path.cwd().resolve() + + +def main(argv: Iterable[str] | None = None) -> int: + parser = argparse.ArgumentParser(description=__doc__) + parser.add_argument( + "pdfs", + nargs="*", + type=Path, + help="Explicit PDF paths to classify (combined with --input-dir).", + ) + parser.add_argument( + "--input-dir", + type=Path, + default=None, + help="Directory to scan recursively for *.pdf files.", + ) + parser.add_argument( + "--report-jsonl", + type=Path, + required=True, + help="Path to write the per-PDF classification report (JSONL).", + ) + parser.add_argument( + "--manifest", + type=Path, + required=True, + help="Path to write the plain-text list of non-OCR PDFs (one per line).", + ) + parser.add_argument( + "--repo-root", + type=Path, + default=_default_repo_root(), + help="OfficeMD workspace root used when `--cargo-run` is set.", + ) + parser.add_argument( + "--binary", + type=Path, + default=None, + help="Path to a prebuilt officemd binary. Disables --cargo-run when set.", + ) + parser.add_argument( + "--cargo-profile", + default="release", + help="Cargo profile for `cargo run`. Default: release.", + ) + parser.add_argument( + "--no-cargo-run", + dest="cargo_run", + action="store_false", + help="Require --binary instead of invoking cargo.", + ) + parser.set_defaults(cargo_run=True) + + args = parser.parse_args(list(argv) if argv is not None else None) + + cargo_run = args.cargo_run and args.binary is None + if cargo_run and shutil.which("cargo") is None: + print("ERROR: cargo not found on PATH.", file=sys.stderr) + return 2 + if not cargo_run and args.binary is None: + print("ERROR: must provide --binary when --no-cargo-run is used.", file=sys.stderr) + return 2 + if not cargo_run and not args.binary.is_file(): + print(f"ERROR: binary {args.binary} not found.", file=sys.stderr) + return 2 + + pdfs = _collect_pdfs(args.input_dir, args.pdfs) + if not pdfs: + print("ERROR: no PDFs provided (use positional args or --input-dir).", file=sys.stderr) + return 2 + + args.report_jsonl.parent.mkdir(parents=True, exist_ok=True) + args.manifest.parent.mkdir(parents=True, exist_ok=True) + + selected: list[Path] = [] + with args.report_jsonl.open("w", encoding="utf-8") as report_fh: + for pdf in pdfs: + pdf_info = _inspect_pdf( + pdf, + cargo_run=cargo_run, + repo_root=args.repo_root, + binary=args.binary, + cargo_profile=args.cargo_profile, + ) + non_ocr = "error" not in pdf_info and _is_non_ocr(pdf_info) + record = { + "path": str(pdf), + "non_ocr": non_ocr, + "pdf": pdf_info, + } + report_fh.write(json.dumps(record) + "\n") + if non_ocr: + selected.append(pdf) + + with args.manifest.open("w", encoding="utf-8") as manifest_fh: + for pdf in selected: + manifest_fh.write(f"{pdf}\n") + + print( + f"Classified {len(pdfs)} PDFs; {len(selected)} in non-OCR slice.", + file=sys.stderr, + ) + return 0 + + +if __name__ == "__main__": + raise SystemExit(main()) diff --git a/integrations/parsebench/src/parse_bench/inference/pipelines/officemd_pipelines.py b/integrations/parsebench/src/parse_bench/inference/pipelines/officemd_pipelines.py new file mode 100644 index 0000000..65b02fb --- /dev/null +++ b/integrations/parsebench/src/parse_bench/inference/pipelines/officemd_pipelines.py @@ -0,0 +1,49 @@ +"""Pipeline definitions for the OfficeMD local provider. + +To register these pipelines inside a ParseBench checkout, import this module +from `parse_bench.inference.pipelines.parse.register_parse_pipelines` (or call +`register_officemd_pipelines(register_fn)` directly), passing the same +`register_fn` used for built-in pipelines. + +Example patch applied to `src/parse_bench/inference/pipelines/parse.py`: + + from parse_bench.inference.pipelines.officemd_pipelines import ( + register_officemd_pipelines, + ) + + def register_parse_pipelines(register_fn): + # ...existing registrations... + register_officemd_pipelines(register_fn) +""" + +from __future__ import annotations + +from collections.abc import Callable + +from parse_bench.schemas.pipeline import PipelineSpec +from parse_bench.schemas.product import ProductType + +# Importing the provider module ensures `@register_provider("officemd_local")` +# fires before any pipeline referencing it is instantiated. +import parse_bench.inference.providers.parse.officemd_local # noqa: F401 + + +def register_officemd_pipelines(register_fn: Callable[[PipelineSpec], None]) -> None: + """Register all OfficeMD-backed pipelines with ParseBench. + + The default pipeline invokes the CLI from a local OfficeMD checkout via + `cargo run`. Set `OFFICEMD_REPO_ROOT` in the environment, or override + `repo_root` on the pipeline config, to point at the checkout. + """ + register_fn( + PipelineSpec( + pipeline_name="officemd_local", + provider_name="officemd_local", + product_type=ProductType.PARSE, + config={ + "cargo_run": True, + "cargo_profile": "release", + "extra_args": [], + }, + ) + ) diff --git a/integrations/parsebench/src/parse_bench/inference/providers/parse/officemd_local.py b/integrations/parsebench/src/parse_bench/inference/providers/parse/officemd_local.py new file mode 100644 index 0000000..d27a393 --- /dev/null +++ b/integrations/parsebench/src/parse_bench/inference/providers/parse/officemd_local.py @@ -0,0 +1,300 @@ +"""Provider that invokes the local OfficeMD CLI for PARSE tasks. + +The provider shells out to `cargo run -p officemd_cli -- stream +--output-format json --pretty` from the OfficeMD workspace root, parses the +resulting JSON document, and normalizes the per-page PDF payload into a +ParseBench `ParseOutput`. + +Config keys (read from `PipelineSpec.config` via `base_config`): + +- `repo_root` (str, required unless `OFFICEMD_REPO_ROOT` env var is set): + absolute path to the OfficeMD workspace root (the directory containing the + top-level `Cargo.toml`). +- `cargo_run` (bool, default True): when True, invoke via `cargo run`. When + False, invoke the `binary` path directly. +- `binary` (str, optional): absolute path to a prebuilt `officemd` binary. + Only used when `cargo_run` is False. +- `cargo_profile` (str, default "release"): cargo profile flag passed as + `--release` when "release", otherwise passed as `--profile `. Set + to `"dev"` to drop the flag entirely. +- `extra_args` (list[str], optional): additional CLI flags appended after + the input path (for example `["--no-headers-footers"]`). +- `timeout_seconds` (float | None, default 600): subprocess timeout. + +Normalization rules: + +- `pdf.pages[].number` maps to `PageIR.page_index = number - 1`. +- `pdf.pages[].markdown` becomes per-page markdown. +- Document markdown is page markdown joined by a single blank line. +- `layout_pages` is left empty in v1. +- The full OfficeMD JSON document is preserved in the raw output for + downstream analysis. +""" + +from __future__ import annotations + +import json +import os +import shutil +import subprocess +from datetime import datetime +from pathlib import Path +from typing import Any + +from parse_bench.inference.providers.base import ( + Provider, + ProviderConfigError, + ProviderPermanentError, + ProviderTransientError, +) +from parse_bench.inference.providers.registry import register_provider +from parse_bench.schemas.parse_output import PageIR, ParseOutput +from parse_bench.schemas.pipeline import PipelineSpec +from parse_bench.schemas.pipeline_io import ( + InferenceRequest, + InferenceResult, + RawInferenceResult, +) +from parse_bench.schemas.product import ProductType + +_DEFAULT_TIMEOUT_SECONDS = 600.0 + + +@register_provider("officemd_local") +class OfficeMDLocalProvider(Provider): + """Run the local OfficeMD CLI as a PARSE provider.""" + + def __init__(self, provider_name: str, base_config: dict[str, Any] | None = None): + super().__init__(provider_name, base_config) + config = self._base_config + + self._cargo_run: bool = bool(config.get("cargo_run", True)) + self._cargo_profile: str = str(config.get("cargo_profile", "release")) + self._extra_args: list[str] = list(config.get("extra_args") or []) + self._timeout_seconds: float | None = config.get( + "timeout_seconds", _DEFAULT_TIMEOUT_SECONDS + ) + + repo_root = config.get("repo_root") or os.environ.get("OFFICEMD_REPO_ROOT") + self._repo_root: Path | None = Path(repo_root).resolve() if repo_root else None + + binary = config.get("binary") + self._binary: Path | None = Path(binary).resolve() if binary else None + + # Validate mode eagerly so misconfiguration surfaces before first call. + self._resolve_command_prefix() + + def _resolve_command_prefix(self) -> list[str]: + """Return the argv prefix through `stream` (excluding input path).""" + if self._cargo_run: + if self._repo_root is None: + raise ProviderConfigError( + "officemd_local: cargo_run mode requires `repo_root` in config " + "or the OFFICEMD_REPO_ROOT environment variable." + ) + if not (self._repo_root / "Cargo.toml").is_file(): + raise ProviderConfigError( + f"officemd_local: repo_root {self._repo_root} does not contain Cargo.toml." + ) + if shutil.which("cargo") is None: + raise ProviderConfigError( + "officemd_local: `cargo` not found on PATH; install Rust or set " + "`cargo_run=False` and provide `binary`." + ) + argv = ["cargo", "run", "--quiet", "-p", "officemd_cli"] + profile = self._cargo_profile.lower() + if profile == "release": + argv.append("--release") + elif profile not in ("dev", "debug", ""): + argv.extend(["--profile", self._cargo_profile]) + argv.extend(["--", "stream"]) + return argv + + if self._binary is None: + raise ProviderConfigError( + "officemd_local: cargo_run is False but no `binary` path configured." + ) + if not self._binary.is_file(): + raise ProviderConfigError( + f"officemd_local: configured binary {self._binary} does not exist." + ) + return [str(self._binary), "stream"] + + def _working_directory(self) -> Path | None: + if self._cargo_run: + return self._repo_root + return None + + def _build_argv(self, pdf_path: Path) -> list[str]: + argv = self._resolve_command_prefix() + argv.append(str(pdf_path)) + argv.extend(["--output-format", "json", "--pretty"]) + argv.extend(self._extra_args) + return argv + + def _invoke_cli(self, pdf_path: Path) -> tuple[dict[str, Any], str]: + argv = self._build_argv(pdf_path) + cwd = self._working_directory() + try: + completed = subprocess.run( # noqa: S603 - argv is constructed internally + argv, + cwd=str(cwd) if cwd is not None else None, + capture_output=True, + text=True, + timeout=self._timeout_seconds, + check=False, + ) + except FileNotFoundError as exc: + raise ProviderConfigError( + f"officemd_local: failed to launch CLI ({exc}). " + "Check `cargo` / `binary` configuration." + ) from exc + except subprocess.TimeoutExpired as exc: + raise ProviderTransientError( + f"officemd_local: CLI exceeded timeout of {self._timeout_seconds}s" + ) from exc + + if completed.returncode != 0: + stderr_tail = (completed.stderr or "").strip().splitlines()[-20:] + raise ProviderPermanentError( + "officemd_local: CLI exited with code " + f"{completed.returncode}. stderr (tail):\n" + + "\n".join(stderr_tail), + debug_payload={ + "argv": argv, + "returncode": completed.returncode, + "stderr": completed.stderr, + }, + ) + + stdout = completed.stdout or "" + try: + document = json.loads(stdout) + except json.JSONDecodeError as exc: + raise ProviderPermanentError( + f"officemd_local: failed to parse CLI JSON output: {exc}", + debug_payload={"argv": argv, "stdout_head": stdout[:2000]}, + ) from exc + + if not isinstance(document, dict): + raise ProviderPermanentError( + "officemd_local: CLI JSON output is not a JSON object.", + debug_payload={"argv": argv}, + ) + + return document, " ".join(argv) + + def run_inference( + self, pipeline: PipelineSpec, request: InferenceRequest + ) -> RawInferenceResult: + if request.product_type != ProductType.PARSE: + raise ProviderPermanentError( + "officemd_local only supports PARSE product type, got " + f"{request.product_type}" + ) + + pdf_path = Path(request.source_file_path) + if pdf_path.suffix.lower() != ".pdf": + raise ProviderPermanentError( + f"officemd_local only supports .pdf files, got {pdf_path.suffix}" + ) + if not pdf_path.exists(): + raise ProviderPermanentError(f"PDF file not found: {pdf_path}") + + started_at = datetime.now() + document, argv_display = self._invoke_cli(pdf_path) + completed_at = datetime.now() + latency_ms = int((completed_at - started_at).total_seconds() * 1000) + + raw_output: dict[str, Any] = { + "document": document, + "argv": argv_display, + "source_file_path": str(pdf_path), + } + + return RawInferenceResult( + request=request, + pipeline=pipeline, + pipeline_name=pipeline.pipeline_name, + product_type=request.product_type, + raw_output=raw_output, + started_at=started_at, + completed_at=completed_at, + latency_in_ms=latency_ms, + ) + + def normalize(self, raw_result: RawInferenceResult) -> InferenceResult: + if raw_result.product_type != ProductType.PARSE: + raise ProviderPermanentError( + "officemd_local only supports PARSE product type, got " + f"{raw_result.product_type}" + ) + + document = raw_result.raw_output.get("document") + if not isinstance(document, dict): + raise ProviderPermanentError( + "officemd_local: raw_output missing `document` object." + ) + + pdf_payload = document.get("pdf") + if not isinstance(pdf_payload, dict): + raise ProviderPermanentError( + "officemd_local: CLI output missing `pdf` payload; " + "is the input file a PDF?" + ) + + pages_payload = pdf_payload.get("pages") + if not isinstance(pages_payload, list): + raise ProviderPermanentError( + "officemd_local: `pdf.pages` is missing or not a list." + ) + + pages = _normalize_pages(pages_payload) + document_markdown = "\n\n".join(page.markdown for page in pages) + + output = ParseOutput( + task_type="parse", + example_id=raw_result.request.example_id, + pipeline_name=raw_result.pipeline_name, + pages=pages, + markdown=document_markdown, + ) + + return InferenceResult( + request=raw_result.request, + pipeline_name=raw_result.pipeline_name, + product_type=raw_result.product_type, + raw_output=raw_result.raw_output, + output=output, + started_at=raw_result.started_at, + completed_at=raw_result.completed_at, + latency_in_ms=raw_result.latency_in_ms, + ) + + +def _normalize_pages(pages_payload: list[Any]) -> list[PageIR]: + """Map OfficeMD `pdf.pages` entries to sorted PageIR (0-indexed).""" + pages: list[PageIR] = [] + for entry in pages_payload: + if not isinstance(entry, dict): + raise ProviderPermanentError( + "officemd_local: `pdf.pages` entry is not an object." + ) + raw_number = entry.get("number") + if not isinstance(raw_number, int) or raw_number < 1: + raise ProviderPermanentError( + "officemd_local: `pdf.pages[].number` must be a positive integer, " + f"got {raw_number!r}." + ) + markdown = entry.get("markdown") + if markdown is None: + markdown = "" + elif not isinstance(markdown, str): + raise ProviderPermanentError( + "officemd_local: `pdf.pages[].markdown` must be a string or null, " + f"got {type(markdown).__name__}." + ) + pages.append(PageIR(page_index=raw_number - 1, markdown=markdown)) + + pages.sort(key=lambda p: p.page_index) + return pages diff --git a/integrations/parsebench/tests/test_officemd_local.py b/integrations/parsebench/tests/test_officemd_local.py new file mode 100644 index 0000000..d84ebfa --- /dev/null +++ b/integrations/parsebench/tests/test_officemd_local.py @@ -0,0 +1,331 @@ +"""Unit tests for the `officemd_local` ParseBench provider. + +These tests stub subprocess so they run without invoking `cargo` or +ParseBench-driven execution. They assume `parse_bench` is installed in the +test environment; when it is not, the tests are skipped. +""" + +from __future__ import annotations + +import json +import subprocess +from datetime import datetime +from pathlib import Path +from typing import Any + +import pytest + +pytest.importorskip("parse_bench") + +from parse_bench.inference.providers.base import ( # noqa: E402 + ProviderConfigError, + ProviderPermanentError, +) +from parse_bench.schemas.pipeline import PipelineSpec # noqa: E402 +from parse_bench.schemas.pipeline_io import InferenceRequest # noqa: E402 +from parse_bench.schemas.product import ProductType # noqa: E402 + +from parse_bench.inference.providers.parse.officemd_local import ( # noqa: E402 + OfficeMDLocalProvider, +) + + +@pytest.fixture() +def repo_root(tmp_path: Path) -> Path: + root = tmp_path / "officemd" + root.mkdir() + (root / "Cargo.toml").write_text("[workspace]\n") + return root + + +@pytest.fixture() +def pdf_path(tmp_path: Path) -> Path: + path = tmp_path / "sample.pdf" + path.write_bytes(b"%PDF-1.4\n%stub\n") + return path + + +def _pipeline(repo_root: Path) -> PipelineSpec: + return PipelineSpec( + pipeline_name="officemd_local", + provider_name="officemd_local", + product_type=ProductType.PARSE, + config={ + "cargo_run": True, + "repo_root": str(repo_root), + "cargo_profile": "dev", + }, + ) + + +def _request(pdf_path: Path) -> InferenceRequest: + return InferenceRequest( + example_id="ex-1", + source_file_path=str(pdf_path), + product_type=ProductType.PARSE, + ) + + +def _stub_cli( + monkeypatch: pytest.MonkeyPatch, + *, + stdout: str, + returncode: int = 0, + stderr: str = "", +) -> list[list[str]]: + captured: list[list[str]] = [] + + def fake_run(argv: list[str], *args: Any, **kwargs: Any) -> subprocess.CompletedProcess[str]: + captured.append(list(argv)) + return subprocess.CompletedProcess( + args=argv, + returncode=returncode, + stdout=stdout, + stderr=stderr, + ) + + monkeypatch.setattr(subprocess, "run", fake_run) + return captured + + +def _stub_which(monkeypatch: pytest.MonkeyPatch) -> None: + import shutil + + monkeypatch.setattr(shutil, "which", lambda _: "/usr/bin/cargo") + + +def test_normalizes_valid_output( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + stdout = json.dumps( + { + "kind": "Pdf", + "pdf": { + "pages": [ + {"number": 2, "markdown": "# Page 2\n\nBody"}, + {"number": 1, "markdown": "# Page 1"}, + ], + "diagnostics": { + "classification": "TextBased", + "confidence": 0.9, + "page_count": 2, + "pages_needing_ocr": [], + "has_encoding_issues": False, + }, + }, + } + ) + captured = _stub_cli(monkeypatch, stdout=stdout) + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + result = provider.run_inference_normalized( + _pipeline(repo_root), _request(pdf_path) + ) + + assert [p.page_index for p in result.output.pages] == [0, 1] + assert result.output.pages[0].markdown == "# Page 1" + assert result.output.pages[1].markdown == "# Page 2\n\nBody" + assert result.output.markdown == "# Page 1\n\n# Page 2\n\nBody" + assert result.output.layout_pages == [] + assert result.raw_output["document"]["pdf"]["diagnostics"]["classification"] == "TextBased" + + argv = captured[0] + assert argv[:5] == ["cargo", "run", "--quiet", "-p", "officemd_cli"] + assert "stream" in argv + assert "--output-format" in argv and "json" in argv + assert "--pretty" in argv + assert str(pdf_path) in argv + + +def test_preserves_page_order_multipage( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + stdout = json.dumps( + { + "pdf": { + "pages": [ + {"number": i, "markdown": f"page{i}"} for i in range(1, 6) + ], + "diagnostics": { + "classification": "TextBased", + "confidence": 1.0, + "page_count": 5, + "pages_needing_ocr": [], + "has_encoding_issues": False, + }, + }, + } + ) + _stub_cli(monkeypatch, stdout=stdout) + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + result = provider.run_inference_normalized( + _pipeline(repo_root), _request(pdf_path) + ) + + assert [p.page_index for p in result.output.pages] == [0, 1, 2, 3, 4] + assert result.output.markdown.split("\n\n") == [ + "page1", + "page2", + "page3", + "page4", + "page5", + ] + + +def test_invalid_json_raises_permanent( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + _stub_cli(monkeypatch, stdout="not json at all") + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + with pytest.raises(ProviderPermanentError): + provider.run_inference(_pipeline(repo_root), _request(pdf_path)) + + +def test_nonzero_exit_raises_permanent( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + _stub_cli( + monkeypatch, + stdout="", + returncode=3, + stderr="Error: something broke\n", + ) + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + with pytest.raises(ProviderPermanentError): + provider.run_inference(_pipeline(repo_root), _request(pdf_path)) + + +def test_missing_pdf_payload_fails_normalize( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + _stub_cli(monkeypatch, stdout=json.dumps({"kind": "Docx"})) + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + raw = provider.run_inference(_pipeline(repo_root), _request(pdf_path)) + with pytest.raises(ProviderPermanentError): + provider.normalize(raw) + + +def test_rejects_non_pdf_input( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, tmp_path: Path +) -> None: + _stub_which(monkeypatch) + docx = tmp_path / "file.docx" + docx.write_bytes(b"PK\x03\x04") + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + request = InferenceRequest( + example_id="ex-1", + source_file_path=str(docx), + product_type=ProductType.PARSE, + ) + with pytest.raises(ProviderPermanentError): + provider.run_inference(_pipeline(repo_root), request) + + +def test_requires_repo_root_when_cargo_run( + monkeypatch: pytest.MonkeyPatch, +) -> None: + monkeypatch.delenv("OFFICEMD_REPO_ROOT", raising=False) + with pytest.raises(ProviderConfigError): + OfficeMDLocalProvider( + "officemd_local", + {"cargo_run": True}, + ) + + +def test_binary_mode_requires_existing_binary() -> None: + with pytest.raises(ProviderConfigError): + OfficeMDLocalProvider( + "officemd_local", + {"cargo_run": False, "binary": "/nonexistent/officemd"}, + ) + + +def test_binary_mode_builds_argv_from_binary( + monkeypatch: pytest.MonkeyPatch, tmp_path: Path, pdf_path: Path +) -> None: + binary = tmp_path / "officemd" + binary.write_text("#!/bin/sh\n") + binary.chmod(0o755) + + stdout = json.dumps( + { + "pdf": { + "pages": [{"number": 1, "markdown": "hi"}], + "diagnostics": { + "classification": "TextBased", + "confidence": 1.0, + "page_count": 1, + "pages_needing_ocr": [], + "has_encoding_issues": False, + }, + } + } + ) + captured = _stub_cli(monkeypatch, stdout=stdout) + + provider = OfficeMDLocalProvider( + "officemd_local", + {"cargo_run": False, "binary": str(binary)}, + ) + pipeline = PipelineSpec( + pipeline_name="officemd_local", + provider_name="officemd_local", + product_type=ProductType.PARSE, + config={"cargo_run": False, "binary": str(binary)}, + ) + provider.run_inference_normalized(pipeline, _request(pdf_path)) + + argv = captured[0] + assert argv[0] == str(binary) + assert argv[1] == "stream" + assert str(pdf_path) in argv + + +def test_timing_and_latency( + monkeypatch: pytest.MonkeyPatch, repo_root: Path, pdf_path: Path +) -> None: + _stub_which(monkeypatch) + stdout = json.dumps( + { + "pdf": { + "pages": [{"number": 1, "markdown": "x"}], + "diagnostics": { + "classification": "TextBased", + "confidence": 1.0, + "page_count": 1, + "pages_needing_ocr": [], + "has_encoding_issues": False, + }, + } + } + ) + _stub_cli(monkeypatch, stdout=stdout) + + provider = OfficeMDLocalProvider( + "officemd_local", _pipeline(repo_root).config + ) + result = provider.run_inference(_pipeline(repo_root), _request(pdf_path)) + assert isinstance(result.started_at, datetime) + assert isinstance(result.completed_at, datetime) + assert result.latency_in_ms >= 0 From 6fe0ab8124182e4c6da66c29cbf9d78fe3bfcbf5 Mon Sep 17 00:00:00 2001 From: Claude Date: Sun, 19 Apr 2026 19:16:02 +0000 Subject: [PATCH 2/3] fix(pdf): save/restore text font on q/Q graphics-state stack MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit PDF 32000-1 §8.4.2 lists the text state — including Tf (font name) and Tfs (font size) — as part of the graphics state that `q` pushes and `Q` pops. OfficeMD's stack only saved the CTM, fill color, and text rendering mode, so a `/Fx Tf` inside a nested `q ... Q` block leaked out. Concretely, in PDFs that render body prose in a TrueType/WinAnsi font (F1) and punctuation like the en-dash via a Type0/Identity-H font (F6) sharing the same BaseFont, the sequence /F1 Tf ... q /F6 Tf [<00B2>]TJ Q BT [(Fundamentals)]TJ left `current_font=F6` when the `(Fundamentals)` literal was decoded. The bytes got routed through F6's 2-byte ToUnicode CMap, where the range `<0044><005D> -> <0061>` maps 0x46 -> 'c', so "Fundamentals" came out as "cundamentals". Same mechanism yielded "Open" -> "lpen", "Markup" -> "jarkup", "Reference" -> "oence", which all showed up as ParseBench `missing_specific_word` failures on OpenXML_WhitePaper.pdf. Fix: extend the `q`/`Q` save/restore to carry the current font name and size alongside the CTM. The snapshot diff on OpenXML_WhitePaper.pdf is pure gain — ligatures decode correctly, section titles join with their bodies, and numeric "165 pages / 125 pages / 466 pages" fragments stop appearing as "1SR pages / 1OR pages / 4SS pages". ParseBench smoke run on the whitepaper+sample fixture: officemd_local: 10/12 -> 11/12 (83% -> 92%) pypdf_baseline: 12/12 (unchanged) pymupdf_text: 12/12 (unchanged) One remaining failure — `missing_specific_sentence "Part 1 – Fundamentals"` — is a separate space-preservation issue across the F6/F1 font switch and is not addressed here. --- .../pdf_inspector/extractor/interpreter.rs | 39 +++++++++---- .../snapshots/pdf__openxml_whitepaper_ir.snap | 6 +- .../pdf__openxml_whitepaper_markdown.snap | 58 +++++-------------- 3 files changed, 45 insertions(+), 58 deletions(-) diff --git a/crates/officemd_pdf/src/pdf_inspector/extractor/interpreter.rs b/crates/officemd_pdf/src/pdf_inspector/extractor/interpreter.rs index 6a7f111..397502b 100644 --- a/crates/officemd_pdf/src/pdf_inspector/extractor/interpreter.rs +++ b/crates/officemd_pdf/src/pdf_inspector/extractor/interpreter.rs @@ -27,12 +27,27 @@ pub(crate) struct ExtractionSink { pub(crate) rects: Vec, } +/// One entry on the q/Q graphics-state stack. Per PDF 32000-1 §8.4.2, the +/// saved state includes the text state (Tc, Tw, Th, TL, Tf, Tfs, Tr, Trise), +/// so we capture the font name + size alongside the CTM so that `Q` restores +/// the font that was active before the matching `q` — otherwise a `/Fx Tf` +/// inside a nested graphics block leaks out and misroutes later decoding +/// through the wrong font's ToUnicode CMap. +#[derive(Debug, Clone)] +pub(crate) struct GraphicsStackEntry { + ctm: [f32; 6], + fill_is_white: bool, + text_rendering_mode: i32, + current_font: String, + current_font_size: f32, +} + #[derive(Debug)] pub(crate) struct GraphicsState { pub(crate) ctm: [f32; 6], fill_is_white: bool, text_rendering_mode: i32, - stack: Vec<([f32; 6], bool, i32)>, + stack: Vec, } impl GraphicsState { @@ -583,17 +598,21 @@ where trace!("{} {:?}", op.operator, op.operands); match op.operator.as_str() { "q" => { - graphics_state.stack.push(( - graphics_state.ctm, - graphics_state.fill_is_white, - graphics_state.text_rendering_mode, - )); + graphics_state.stack.push(GraphicsStackEntry { + ctm: graphics_state.ctm, + fill_is_white: graphics_state.fill_is_white, + text_rendering_mode: graphics_state.text_rendering_mode, + current_font: text_state.current_font.clone(), + current_font_size: text_state.current_font_size, + }); } "Q" => { - if let Some((saved_ctm, saved_fill, saved_tr)) = graphics_state.stack.pop() { - graphics_state.ctm = saved_ctm; - graphics_state.fill_is_white = saved_fill; - graphics_state.text_rendering_mode = saved_tr; + if let Some(saved) = graphics_state.stack.pop() { + graphics_state.ctm = saved.ctm; + graphics_state.fill_is_white = saved.fill_is_white; + graphics_state.text_rendering_mode = saved.text_rendering_mode; + text_state.current_font = saved.current_font; + text_state.current_font_size = saved.current_font_size; } } "cm" => { diff --git a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap index ca26949..8473982 100644 --- a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap +++ b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap @@ -14,15 +14,15 @@ expression: canonical_json(&json) }, "pages": [ { - "markdown": "# OFFICE OPEN XML OVERVIEW\n\n## ECMA TC45 TOM NGO (NEXTPAGE), EDITOR\n\nOffice Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.\n\nThe work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).\n\n- Understand the purposes of OpenXML and structure of its Specification\n- Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility\n- Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail\nOpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.\n\nThe original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.\n\nConcurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:\n\n- generate documents automatically from business data;", + "markdown": "# OFFICE OPEN XML OVERVIEW\n\n## ECMA TC45 TOM NGO (NEXTPAGE), EDITOR\n\nOffice Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.\n\nThe work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).\n\n- Understand the purposes of OpenXML and structure of its Specification\n- Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility\n- Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail\nOpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.\n\nThe original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.\n\nConcurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:\n\n- generate documents automatically from business data;\n\n| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports |\n| ------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. |\n| Part 1 –Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. |\n| Part 2 –Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). |\n| Part 3 –Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. |\n| | | 2 |", "number": 1 }, { - "markdown": "- extract business data from documents and feed those data into business applications;\n- perform restricted tasks that operate on a small subset of a document, yet preserve editability;\n- provide accessibility for user populations with specialized needs, such as the blind; or\n- run on a variety of hardware, including mobile devices.\nPerhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority.\n\nThe emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process.\n\nVarious document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports every feature in the binary formats.\n\n| OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified | | |\n| ------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- |\n| through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a | | |\n| document of any one of these three types to contain material in the other primary markup languages and in a number of | | |\n| supporting markup languages. | | |\n| The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids | | |\n| the reader’s understanding but | is not prescriptive). It is structured in Parts to meet the needs of varying audiences. | |\n| Part 1 –cundamentals | - Defines vocabulary, notational conventions, and abbreviations. | |\n| 1SR pages | - Summarizes the-Establishes conditions for conformance and provides inter-Describes | the supporting markup languages. operability guidelines. |\n| | type. | |\n| Part 2 –lpen Packaging | - Defines | . Every OpenXML |\n| Conventions | of byte streams called parts, combined into a container called a package. The packaging format | |\n| 1OR pages | is defined by the OPC. - Describes a recommended physical implementation of-Declares the XML schemas that is issued only in electronic form. the schemas using RELAX NG (ISO/IEC 19757 | the OPC ema Definitions - 2) (3) . |\n| Part 3 –Primer | - Introduces the features of each markup language, providing context and illustrating elements | |\n| 4SS pages | through examples and diagrams. This Part is informative (non-Describes the facility for storing custom XML data within a pac business data. | - normative). |\n\nthree primary markup languages\n\nthe constraints within the Open Packaging Conventions that apply to each document\n\nthe Open Packaging Conventions file comprises a collection\n\nThe annex also includes non- normative representations of\n\nkage to support integration with", + "markdown": "", "number": 2 }, { - "markdown": "Part 4 –jarkup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language oence and additional semantics as appropriate. This Part is intended for use as a reference whenever RTRS pages complete detail about an element or attribute is required.\n\n- Defines the facility for storing custom XML data.\n- Declares the XML schemas for the markup languages (2), in an annex that is issued on in electronic form. The annex also expresses them- normatively RELAX NG\n- 2) (3).\nPart 5 –jarkup- Describes facilities for extension of OpenXML documents.\n\n- Spec ifies elements and attributes by which applications with different extensions can\nbxtensibility interoperate. 34 Pages\n\n- Expresses extensibility rules using NVDL (ISO/IEC 19757- 4) (4).\nIn order to ease reading and navigation through these documents, t he electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout.\n\nThis section prepares you to investigate OpenXML by describing some of its high-level properties. Each subsection describes one of these properties and refers to specific features within OpenXML.\n\n- “Interoperability” describes how OpenXML is independent of proprietary formats, features, and run-time environment, allowing developers a broad range of choices.\n- “Internationalization” mentions a few representative ways in which OpenXML supports every major language group.\n- “Low Barrier to Developer Adoption”, “Compactness”,and “Modularity” list specific ways in which OpenXML avoids or removes practical impediments to implementation by diverse parties: learning curve, minimum feature set, and performance.\n- “High Fidelity Migration” describes how OpenXML meets the over-arching goal to preserve the information, including the original creator’s full intent, in existing and new documents.\n- “Integration with Business Data” describes how OpenXML incorporates business information in custom schemas to enable integration and reuse of information between productivity applications and information systems.\n- “Room for Innovation” describes how OpenXML prepares for the future by defining further extensibility mechanisms and providing for interoperability between applications with differing feature sets.\nThe remainder of this document, including this section, is a topical guide to OpenXML. References to the Specification are all of the form §Part:section.subsection; for example, §1:2.5 refers to Part 1, Section 2.5 of the Specification. References to other headings within this paper are by name.\n\n**4.1** Developers can write applications that consume and produce OpenXML on multiple platforms. Foremost, the interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee (1) with diverse backgrounds and corporate interests. Representation included:\n- Vendors (Apple, Intel, Microsoft, NextPage, Novell, and Toshiba) with multiple operating systems (Linux, MacOS, and Windows) and multiple intended uses of OpenXML", + "markdown": "Part 4 –Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required.\n\n- Defines the facility for storing custom XML data.\n- Declares the XML schemas for the markup languages as XSD (2), in an annex that is issued only in electronic form. The annex also expresses them non-normatively using RELAX NG (ISO/IEC 19757-2) (3).\nPart 5 –Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages\n\n- Expresses extensibility rules using NVDL (ISO/IEC 19757-4) (4).\nIn order to ease reading and navigation through these documents, the electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout.\n\nThis section prepares you to investigate OpenXML by describing some of its high-level properties. Each subsection describes one of these properties and refers to specific features within OpenXML.\n\n- “Interoperability” describes how OpenXML is independent of proprietary formats, features, and run-time environment, allowing developers a broad range of choices.\n- “Internationalization” mentions a few representative ways in which OpenXML supports every major language group.\n- “Low Barrier to Developer Adoption”, “Compactness”,and “Modularity” list specific ways in which OpenXML avoids or removes practical impediments to implementation by diverse parties: learning curve, minimum feature set, and performance.\n- “High Fidelity Migration” describes how OpenXML meets the over-arching goal to preserve the information, including the original creator’s full intent, in existing and new documents.\n- “Integration with Business Data” describes how OpenXML incorporates business information in custom schemas to enable integration and reuse of information between productivity applications and information systems.\n- “Room for Innovation” describes how OpenXML prepares for the future by defining further extensibility mechanisms and providing for interoperability between applications with differing feature sets.\nThe remainder of this document, including this section, is a topical guide to OpenXML. References to the Specification are all of the form §Part:section.subsection; for example, §1:2.5 refers to Part 1, Section 2.5 of the Specification. References to other headings within this paper are by name.\n\n**4.1** Developers can write applications that consume and produce OpenXML on multiple platforms. Foremost, the interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee (1) with diverse backgrounds and corporate interests. Representation included:\n- Vendors (Apple, Intel, Microsoft, NextPage, Novell, and Toshiba) with multiple operating systems (Linux, MacOS, and Windows) and multiple intended uses of OpenXML", "number": 3 }, { diff --git a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap index 4e8bea9..c72d040 100644 --- a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap +++ b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap @@ -25,58 +25,26 @@ Concurrently with those technological advances, markets have diversified to incl - generate documents automatically from business data; -## Page: 2 - -- extract business data from documents and feed those data into business applications; -- perform restricted tasks that operate on a small subset of a document, yet preserve editability; -- provide accessibility for user populations with specialized needs, such as the blind; or -- run on a variety of hardware, including mobile devices. -Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. - -The emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. - -Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports every feature in the binary formats. - -| OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified | | | -| ------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------------------------------------- | -| through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a | | | -| document of any one of these three types to contain material in the other primary markup languages and in a number of | | | -| supporting markup languages. | | | -| The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids | | | -| the reader’s understanding but | is not prescriptive). It is structured in Parts to meet the needs of varying audiences. | | -| Part 1 –cundamentals | - Defines vocabulary, notational conventions, and abbreviations. | | -| 1SR pages | - Summarizes the-Establishes conditions for conformance and provides inter-Describes | the supporting markup languages. operability guidelines. | -| | type. | | -| Part 2 –lpen Packaging | - Defines | . Every OpenXML | -| Conventions | of byte streams called parts, combined into a container called a package. The packaging format | | -| 1OR pages | is defined by the OPC. - Describes a recommended physical implementation of-Declares the XML schemas that is issued only in electronic form. the schemas using RELAX NG (ISO/IEC 19757 | the OPC ema Definitions - 2) (3) . | -| Part 3 –Primer | - Introduces the features of each markup language, providing context and illustrating elements | | -| 4SS pages | through examples and diagrams. This Part is informative (non-Describes the facility for storing custom XML data within a pac business data. | - normative). | - -three primary markup languages +| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports | +| ------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. | +| Part 1 –Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. | +| Part 2 –Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). | +| Part 3 –Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. | +| | | 2 | -the constraints within the Open Packaging Conventions that apply to each document - -the Open Packaging Conventions file comprises a collection - -The annex also includes non- normative representations of - -kage to support integration with +## Page: 2 ## Page: 3 -Part 4 –jarkup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language oence and additional semantics as appropriate. This Part is intended for use as a reference whenever RTRS pages complete detail about an element or attribute is required. +Part 4 –Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required. - Defines the facility for storing custom XML data. -- Declares the XML schemas for the markup languages (2), in an annex that is issued on in electronic form. The annex also expresses them- normatively RELAX NG -- 2) (3). -Part 5 –jarkup- Describes facilities for extension of OpenXML documents. - -- Spec ifies elements and attributes by which applications with different extensions can -bxtensibility interoperate. 34 Pages +- Declares the XML schemas for the markup languages as XSD (2), in an annex that is issued only in electronic form. The annex also expresses them non-normatively using RELAX NG (ISO/IEC 19757-2) (3). +Part 5 –Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages -- Expresses extensibility rules using NVDL (ISO/IEC 19757- 4) (4). -In order to ease reading and navigation through these documents, t he electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout. +- Expresses extensibility rules using NVDL (ISO/IEC 19757-4) (4). +In order to ease reading and navigation through these documents, the electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout. This section prepares you to investigate OpenXML by describing some of its high-level properties. Each subsection describes one of these properties and refers to specific features within OpenXML. From 8a2b2658d1db13b626c23a505acc33a4fe40e0d5 Mon Sep 17 00:00:00 2001 From: Claude Date: Wed, 22 Apr 2026 13:30:49 +0000 Subject: [PATCH 3/3] fix(pdf): preserve word-boundary spaces when a CID font reports inflated widths MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Some Type0/CID fonts ship only a default width (DW) and no W array, so every glyph — including narrow punctuation like an en-dash — reports the same wide advance. That inflates `end_x` in the line-merge pass and hides real whitespace between items placed via their own Tm operators, producing joins like `Part 1 –Fundamentals` or `four forces –extremely`. When the overlap is a substantial fraction of the prior item's reported width, fall back to a conservative per-character estimate for the "honest" end point, so a visible shift to the next item still reads as a word boundary. Tight letter kerning (small negative gap) is still treated as a no-space continuation. https://claude.ai/code/session_01PJH2eef2vqet1qC1EsXpkn --- .../src/pdf_inspector/extractor/mod.rs | 22 +++++++- .../snapshots/pdf__openxml_whitepaper_ir.snap | 10 ++-- .../pdf__openxml_whitepaper_markdown.snap | 56 +++++++++---------- 3 files changed, 54 insertions(+), 34 deletions(-) diff --git a/crates/officemd_pdf/src/pdf_inspector/extractor/mod.rs b/crates/officemd_pdf/src/pdf_inspector/extractor/mod.rs index 226a943..79c9c9d 100644 --- a/crates/officemd_pdf/src/pdf_inspector/extractor/mod.rs +++ b/crates/officemd_pdf/src/pdf_inspector/extractor/mod.rs @@ -270,6 +270,7 @@ pub(crate) fn merge_text_items(items: Vec) -> Vec { let mut text = first.text.clone(); let mut end_x = first.x + first.width; let x_gap_max = first.font_size * 0.75; + let mut prev = first; let mut j = i + 1; while j < group.len() { @@ -281,7 +282,25 @@ pub(crate) fn merge_text_items(items: Vec) -> Vec { if (next.font_size - first.font_size).abs() > first.font_size * 0.20 { break; } - let gap = next.x - end_x; + let raw_gap = next.x - end_x; + // Some CID fonts declare only a default width (DW) with no W array, so every + // glyph — even narrow punctuation like an en-dash — reports the same wide + // advance. That inflates `end_x` and can mask real whitespace between items + // placed via their own `Tm` operator. Only trust that diagnosis when the + // overlap is a substantial fraction of the reported width; tight letter + // kerning produces a tiny negative gap that we still want to treat as a + // no-space continuation. + let gap = if raw_gap < 0.0 && -raw_gap > prev.width * 0.25 { + let prev_chars = prev.text.chars().count().max(1) as f32; + let honest_end = prev.x + prev_chars * prev.font_size * 0.5; + if next.x > honest_end { + next.x - honest_end + } else { + raw_gap + } + } else { + raw_gap + }; if gap > x_gap_max { break; } @@ -294,6 +313,7 @@ pub(crate) fn merge_text_items(items: Vec) -> Vec { } text.push_str(&next.text); end_x = next.x + next.width; + prev = next; j += 1; } diff --git a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap index 8473982..9a7df4e 100644 --- a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap +++ b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_ir.snap @@ -14,7 +14,7 @@ expression: canonical_json(&json) }, "pages": [ { - "markdown": "# OFFICE OPEN XML OVERVIEW\n\n## ECMA TC45 TOM NGO (NEXTPAGE), EDITOR\n\nOffice Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.\n\nThe work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).\n\n- Understand the purposes of OpenXML and structure of its Specification\n- Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility\n- Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail\nOpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.\n\nThe original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.\n\nConcurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:\n\n- generate documents automatically from business data;\n\n| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports |\n| ------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. |\n| Part 1 –Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. |\n| Part 2 –Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). |\n| Part 3 –Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. |\n| | | 2 |", + "markdown": "# OFFICE OPEN XML OVERVIEW\n\n## ECMA TC45 TOM NGO (NEXTPAGE), EDITOR\n\nOffice Open XML (OpenXML) is a proposed open standard for word-processing documents, presentations, and spreadsheets that can be freely implemented by multiple applications on multiple platforms. Its publication benefits organizations that intend to implement applications capable of using the format, commercial and governmental entities that procure such software, and educators or authors who teach the format. Ultimately, all users enjoy the benefits of an XML standard for their documents, including stability, preservation, interoperability, and ongoing evolution.\n\nThe work to standardize OpenXML has been carried out by Ecma International via its Technical Committee 45 (TC45), which includes representatives from Apple, Barclays Capital, BP, The British Library, Essilor, Intel, Microsoft, NextPage, Novell, Statoil, Toshiba, and the United States Library of Congress (1).\n\n- Understand the purposes of OpenXML and structure of its Specification\n- Know its properties: how it addresses backward compatibility, preservation, extensibility, custom schemas, subsetting, multiple platforms, internationalization, and accessibility\n- Learn how to follow the high-level structure of any OpenXML file, and navigate quickly to any portion of the Specification from which you require further detail\nOpenXML was designed from the start to be capable of faithfully representing the pre-existing corpus of word-processing documents, presentations, and spreadsheets that are encoded in binary formats defined by Microsoft Corporation. The standardization process consisted of mirroring in XML the capabilities required to represent the existing corpus, extending them, providing detailed documentation, and enabling interoperability. At the time of writing, more than 400 million users generate documents in the binary formats, with estimates exceeding 40 billion documents and billions more being created each year.\n\nThe original binary formats for these files were created in an era when space was precious and parsing time severely impacted user experience. They were based on direct serialization of in-memory data structures used by Microsoft® Office® applications. Modern hardware, network, and standards infrastructure (especially XML) permit a new design that favors implementation by multiple vendors on multiple platforms and allows for evolution.\n\nConcurrently with those technological advances, markets have diversified to include a new range of applications not originally contemplated in the simple world of document editing programs. These new applications include ones that:\n\n- generate documents automatically from business data;\n\n| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces – extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation – have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports |\n| ------------------------------------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |\n| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. |\n| Part 1 – Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. |\n| Part 2 – Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). |\n| Part 3 – Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. |\n| | | 2 |", "number": 1 }, { @@ -22,7 +22,7 @@ expression: canonical_json(&json) "number": 2 }, { - "markdown": "Part 4 –Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required.\n\n- Defines the facility for storing custom XML data.\n- Declares the XML schemas for the markup languages as XSD (2), in an annex that is issued only in electronic form. The annex also expresses them non-normatively using RELAX NG (ISO/IEC 19757-2) (3).\nPart 5 –Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages\n\n- Expresses extensibility rules using NVDL (ISO/IEC 19757-4) (4).\nIn order to ease reading and navigation through these documents, the electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout.\n\nThis section prepares you to investigate OpenXML by describing some of its high-level properties. Each subsection describes one of these properties and refers to specific features within OpenXML.\n\n- “Interoperability” describes how OpenXML is independent of proprietary formats, features, and run-time environment, allowing developers a broad range of choices.\n- “Internationalization” mentions a few representative ways in which OpenXML supports every major language group.\n- “Low Barrier to Developer Adoption”, “Compactness”,and “Modularity” list specific ways in which OpenXML avoids or removes practical impediments to implementation by diverse parties: learning curve, minimum feature set, and performance.\n- “High Fidelity Migration” describes how OpenXML meets the over-arching goal to preserve the information, including the original creator’s full intent, in existing and new documents.\n- “Integration with Business Data” describes how OpenXML incorporates business information in custom schemas to enable integration and reuse of information between productivity applications and information systems.\n- “Room for Innovation” describes how OpenXML prepares for the future by defining further extensibility mechanisms and providing for interoperability between applications with differing feature sets.\nThe remainder of this document, including this section, is a topical guide to OpenXML. References to the Specification are all of the form §Part:section.subsection; for example, §1:2.5 refers to Part 1, Section 2.5 of the Specification. References to other headings within this paper are by name.\n\n**4.1** Developers can write applications that consume and produce OpenXML on multiple platforms. Foremost, the interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee (1) with diverse backgrounds and corporate interests. Representation included:\n- Vendors (Apple, Intel, Microsoft, NextPage, Novell, and Toshiba) with multiple operating systems (Linux, MacOS, and Windows) and multiple intended uses of OpenXML", + "markdown": "Part 4 – Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required.\n\n- Defines the facility for storing custom XML data.\n- Declares the XML schemas for the markup languages as XSD (2), in an annex that is issued only in electronic form. The annex also expresses them non-normatively using RELAX NG (ISO/IEC 19757-2) (3).\nPart 5 – Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages\n\n- Expresses extensibility rules using NVDL (ISO/IEC 19757-4) (4).\nIn order to ease reading and navigation through these documents, the electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout.\n\nThis section prepares you to investigate OpenXML by describing some of its high-level properties. Each subsection describes one of these properties and refers to specific features within OpenXML.\n\n- “Interoperability” describes how OpenXML is independent of proprietary formats, features, and run-time environment, allowing developers a broad range of choices.\n- “Internationalization” mentions a few representative ways in which OpenXML supports every major language group.\n- “Low Barrier to Developer Adoption”, “Compactness”,and “Modularity” list specific ways in which OpenXML avoids or removes practical impediments to implementation by diverse parties: learning curve, minimum feature set, and performance.\n- “High Fidelity Migration” describes how OpenXML meets the over-arching goal to preserve the information, including the original creator’s full intent, in existing and new documents.\n- “Integration with Business Data” describes how OpenXML incorporates business information in custom schemas to enable integration and reuse of information between productivity applications and information systems.\n- “Room for Innovation” describes how OpenXML prepares for the future by defining further extensibility mechanisms and providing for interoperability between applications with differing feature sets.\nThe remainder of this document, including this section, is a topical guide to OpenXML. References to the Specification are all of the form §Part:section.subsection; for example, §1:2.5 refers to Part 1, Section 2.5 of the Specification. References to other headings within this paper are by name.\n\n**4.1** Developers can write applications that consume and produce OpenXML on multiple platforms. Foremost, the interoperability of OpenXML has been accomplished through extensive contributions, modification, and review of the Specification by members of the Ecma TC45 committee (1) with diverse backgrounds and corporate interests. Representation included:\n- Vendors (Apple, Intel, Microsoft, NextPage, Novell, and Toshiba) with multiple operating systems (Linux, MacOS, and Windows) and multiple intended uses of OpenXML", "number": 3 }, { @@ -46,7 +46,7 @@ expression: canonical_json(&json) "number": 8 }, { - "markdown": "Second, the custom data are embedded in any OpenXML document in a Custom XML part (§3.7.3) and can be described using a Custom XML Data Properties part (§4:7.5). By separating these custom data from presentation, OpenXML enables clean data integration, while enabling end-user presentation and manipulation within a wide variety of contexts, including documents, forms, slides, and spreadsheets. Interoperability can thus be achieved at a more fundamental and semantically accurate level.\n\n**4.8 ROOM FOR INNOVATION** OpenXML is designed to encourage developers to create new applications that were not contemplated when the binary formats were defined, or even when OpenXML was defined. First, we discuss extensibility mechanisms that work together to allow interoperability between applications with differing feature sets. Consider an up-level application (one that contains a new feature not documented in OpenXML) and a down-level application (one that does not understand that feature). The three primary goals of extensibility are:\n- Visual fidelity: the ability for the down-level application to display what the up-level application would display. This inherently requires that a file store multiple representations of the same data.\n- Editability: the ability to edit one or more of the representations.\n- Privacy: the ability to ensure that old versions of one representation do not remain after editing another representation, unexpectedly leaving information that a user believes is deleted or modified. An application can achieve this by eliminating or synchronizing representations.\nA developer wishing to extend the OpenXML feature set has two main options:\n\n- Alternate content blocks: An alternate content block (§3:2.18.4 and §5:9.2) stores multiple representations of the same content, each within its own choice block. A down-level application reads one choice block that it is capable of reading. Upon editing, it writes as many choice blocks as it is capable of writing.\n- Extension lists: An extension list (§3:2.6) stores arbitrary custom XML without a visual representation.\nDevelopers have room to innovate outside of those extensibility mechanisms.\n\n- Alternative interaction paradigms. OpenXML specifies more than document syntax but less than application behavior. As described in the Conformance statement, it focuses on semantics (§1:2.2, §1.2.3). Consequently, a conformant application is free to communicate with an end user through a variety of means, or not communicate with an end user at all –as long as it respects the specified semantics.\n- Novel computing environments. The Conformance statement admits applications that have low capacity, so that they can run on small devices, and applications that implement only a subset of OpenXML (§1:2.6). The Additional Characteristics mechanism permits a producing application to communicate its capacity limits (§3:8.1).\nAs indicated in the previous subsection, some of the most substantial opportunities for innovation do not involve rendering documents for direct user interaction. Instead, they involve machine-to-machine processing using XML message formats, e.g., via XML Web Services (9). Although such applications have no user-visible behavior other than their operations on data contained within OpenXML documents, they are subject to document conformance (§1:2.4) and application conformance (§1:2.5), which are purely syntactic, and interoperability guidelines (§1:2.6), which incorporate semantics.\n\nWhile it is impossible to enumerate all possible use cases for customized XML processing, one may anticipate XML-centric services that process OpenXML documents for automatic extraction and insertion of custom data, custom security services such as XML Digital Signature (10) or XML Encryption (11), or even arbitrary XSLT transformations (12) that convert to and from other XML formats. OpenXML places no prohibitions or limitations on such processing.", + "markdown": "Second, the custom data are embedded in any OpenXML document in a Custom XML part (§3.7.3) and can be described using a Custom XML Data Properties part (§4:7.5). By separating these custom data from presentation, OpenXML enables clean data integration, while enabling end-user presentation and manipulation within a wide variety of contexts, including documents, forms, slides, and spreadsheets. Interoperability can thus be achieved at a more fundamental and semantically accurate level.\n\n**4.8 ROOM FOR INNOVATION** OpenXML is designed to encourage developers to create new applications that were not contemplated when the binary formats were defined, or even when OpenXML was defined. First, we discuss extensibility mechanisms that work together to allow interoperability between applications with differing feature sets. Consider an up-level application (one that contains a new feature not documented in OpenXML) and a down-level application (one that does not understand that feature). The three primary goals of extensibility are:\n- Visual fidelity: the ability for the down-level application to display what the up-level application would display. This inherently requires that a file store multiple representations of the same data.\n- Editability: the ability to edit one or more of the representations.\n- Privacy: the ability to ensure that old versions of one representation do not remain after editing another representation, unexpectedly leaving information that a user believes is deleted or modified. An application can achieve this by eliminating or synchronizing representations.\nA developer wishing to extend the OpenXML feature set has two main options:\n\n- Alternate content blocks: An alternate content block (§3:2.18.4 and §5:9.2) stores multiple representations of the same content, each within its own choice block. A down-level application reads one choice block that it is capable of reading. Upon editing, it writes as many choice blocks as it is capable of writing.\n- Extension lists: An extension list (§3:2.6) stores arbitrary custom XML without a visual representation.\nDevelopers have room to innovate outside of those extensibility mechanisms.\n\n- Alternative interaction paradigms. OpenXML specifies more than document syntax but less than application behavior. As described in the Conformance statement, it focuses on semantics (§1:2.2, §1.2.3). Consequently, a conformant application is free to communicate with an end user through a variety of means, or not communicate with an end user at all – as long as it respects the specified semantics.\n- Novel computing environments. The Conformance statement admits applications that have low capacity, so that they can run on small devices, and applications that implement only a subset of OpenXML (§1:2.6). The Additional Characteristics mechanism permits a producing application to communicate its capacity limits (§3:8.1).\nAs indicated in the previous subsection, some of the most substantial opportunities for innovation do not involve rendering documents for direct user interaction. Instead, they involve machine-to-machine processing using XML message formats, e.g., via XML Web Services (9). Although such applications have no user-visible behavior other than their operations on data contained within OpenXML documents, they are subject to document conformance (§1:2.4) and application conformance (§1:2.5), which are purely syntactic, and interoperability guidelines (§1:2.6), which incorporate semantics.\n\nWhile it is impossible to enumerate all possible use cases for customized XML processing, one may anticipate XML-centric services that process OpenXML documents for automatic extraction and insertion of custom data, custom security services such as XML Digital Signature (10) or XML Encryption (11), or even arbitrary XSLT transformations (12) that convert to and from other XML formats. OpenXML places no prohibitions or limitations on such processing.", "number": 9 }, { @@ -54,11 +54,11 @@ expression: canonical_json(&json) "number": 10 }, { - "markdown": "- document –the root element of the main document (§3:2.3).\n- body –body (§3:2.7.1). Can contain multiple paragraphs. Can also contain section properties specified in a sectPr element.\n- p –paragraph (§3:2.4.1). Can contain one or more runs. Can also contain paragraph properties specified in a pPr element, which in turn can contain default run properties (also referred to as character properties) specified in a rPr element (§3:2.4.4).\n- r –run (§3:2.4.2). Can contain multiple types of run content, primarily text ranges. Can also contain run properties (rPr). The run is a fundamental concept within OpenXML. A run is a contiguous piece of text with identical properties; a run contains no additional text markup. For example, if a sentence were to contain the words “this is **three runs”**, then it would be represented by at least three runs: “thisis ”, “**three”, and “ runs”. In this respect,** OpenXML differs significantly from formats that allow for arbitrary nesting of properties, such as HTML.\n- t –text range (§3:2.4.3.1). Contains an arbitrary amount of text with no formatting, line breaks, tables, graphics, or other non-text material. The formatting for the text is inherited from the run properties and the paragraph properties.\nIn this subsection, we have touched upon direct formatting of text by specifying paragraph and run properties. Direct formatting falls at the end of an order of application that also includes character, paragraph, numbering, and table styles, as well as document defaults (§3:2.8.10). Those styles are themselves organized into inheritance hierarchies (§3:2.8.9).\n\nThe subsection “Minimal WordprocessingML Document”below lists a WordprocessingML document in full.\n\n**5.3** A PresentationML document is described by a presentation part. The presentation part is the target of the package relationship whose type is: [http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument) The presentation refers to these primary constructs (§3:4.2), which we list from top to bottom in the default hierarchy:\n- slide masters, notes masters, and handout masters (§3:4.2.2), all of which inherit properties from presentation;\n- slide layouts (§3:4.2.5), which inherit properties from slide master; and\n- slides (§3:4.2.3) and notes pages (§3:4.2.4), which inherit properties from slide layouts and notes masters respectively.\nEach master, layout, and slide is stored in its own part. The name of each part is specified in the relationship part for the presentation part. Each of the six parts other than presentation is structured in essentially the same way. A typical path from root to leaf in the XML tree would comprise these XML elements (§3:2.2):\n\n- sld, sldLayout, sldMaster, notes, notesMaster, or handoutMaster –the root element.\n- cSld –slide (§4:4.4.1.15). Can contain DrawingML elements (as described in the next two bullets) and other structural elements (as described below).\n- spTree –shape tree (§4:4.4.1.42). Can contain group shape properties in a grpSpPr element (§4:4.4.1.20) and non- visual group shape properties in an nvGrpSpPr element (§4:4.4.1.28). This node and its descendants are all DrawingML elements. We list some DrawingML elements here because of their pivotal role in PresentationML.\n- sp –shape (§4:4.4.1.40). Can contain shape properties in a spPr element (§4:4.4.1.41) and non-visual shape properties in an nvSpPr element (§4:4.4.1.31).\nIn addition to the DrawingML shape content, a cSld can contain other structural elements, depending on the root element in which it resides, as summarized in this table:", + "markdown": "- document – the root element of the main document (§3:2.3).\n- body – body (§3:2.7.1). Can contain multiple paragraphs. Can also contain section properties specified in a sectPr element.\n- p – paragraph (§3:2.4.1). Can contain one or more runs. Can also contain paragraph properties specified in a pPr element, which in turn can contain default run properties (also referred to as character properties) specified in a rPr element (§3:2.4.4).\n- r – run (§3:2.4.2). Can contain multiple types of run content, primarily text ranges. Can also contain run properties (rPr). The run is a fundamental concept within OpenXML. A run is a contiguous piece of text with identical properties; a run contains no additional text markup. For example, if a sentence were to contain the words “this is **three runs”**, then it would be represented by at least three runs: “thisis ”, “**three”, and “ runs”. In this respect,** OpenXML differs significantly from formats that allow for arbitrary nesting of properties, such as HTML.\n- t – text range (§3:2.4.3.1). Contains an arbitrary amount of text with no formatting, line breaks, tables, graphics, or other non-text material. The formatting for the text is inherited from the run properties and the paragraph properties.\nIn this subsection, we have touched upon direct formatting of text by specifying paragraph and run properties. Direct formatting falls at the end of an order of application that also includes character, paragraph, numbering, and table styles, as well as document defaults (§3:2.8.10). Those styles are themselves organized into inheritance hierarchies (§3:2.8.9).\n\nThe subsection “Minimal WordprocessingML Document”below lists a WordprocessingML document in full.\n\n**5.3** A PresentationML document is described by a presentation part. The presentation part is the target of the package relationship whose type is: [http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument) The presentation refers to these primary constructs (§3:4.2), which we list from top to bottom in the default hierarchy:\n- slide masters, notes masters, and handout masters (§3:4.2.2), all of which inherit properties from presentation;\n- slide layouts (§3:4.2.5), which inherit properties from slide master; and\n- slides (§3:4.2.3) and notes pages (§3:4.2.4), which inherit properties from slide layouts and notes masters respectively.\nEach master, layout, and slide is stored in its own part. The name of each part is specified in the relationship part for the presentation part. Each of the six parts other than presentation is structured in essentially the same way. A typical path from root to leaf in the XML tree would comprise these XML elements (§3:2.2):\n\n- sld, sldLayout, sldMaster, notes, notesMaster, or handoutMaster – the root element.\n- cSld – slide (§4:4.4.1.15). Can contain DrawingML elements (as described in the next two bullets) and other structural elements (as described below).\n- spTree – shape tree (§4:4.4.1.42). Can contain group shape properties in a grpSpPr element (§4:4.4.1.20) and non- visual group shape properties in an nvGrpSpPr element (§4:4.4.1.28). This node and its descendants are all DrawingML elements. We list some DrawingML elements here because of their pivotal role in PresentationML.\n- sp – shape (§4:4.4.1.40). Can contain shape properties in a spPr element (§4:4.4.1.41) and non-visual shape properties in an nvSpPr element (§4:4.4.1.31).\nIn addition to the DrawingML shape content, a cSld can contain other structural elements, depending on the root element in which it resides, as summarized in this table:", "number": 11 }, { - "markdown": "
Layout Master Master Master Page
\n\nTransition X X X Timing X X X Headers and Footers X X X X Matching Name X Layout Type X Preserve X X Layout List X Text Style X\n\nProperties specified by objects lower in the default hierarchy (slide master, slide layout, slide) override the corresponding properties specified by objects higher in the hierarchy. For example, if a transition is not specified for a slide, then it is taken from the slide layout; if it is not specified there, then it is taken from the slide master.\n\n**5.4** A SpreadsheetML document is described at the top level by a workbook part. The workbook part is the target of the package relationship whose type is: [http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument) The workbook part stores information about the workbook and its structure, such as file version, creating application, and password to modify. Logically, the workbook contains one or more sheets (§3:3.2); physically, each sheet is stored in its own part and is referenced in the usual manner from the workbook part. Each sheet can be a worksheet, a chart sheet, or a dialog sheet. We will discuss only the worksheet, which is the most common type. Within a worksheet object, a typical path from root to leaf in the XML tree would comprise these XML elements:\n- worksheet –the root element in a worksheet (§3:3.2).\n- sheetData –the cell table, which represents every non-empty cell in the worksheet (§3:3.2.4).\n- c –one cell (§3:3.2.9). The r attribute indicates the cell’s location using A1-style coordinates. The cell can also have a style identifier (attribute s) and a data type (attribute t).\n- v and f –the value (§3:3.2.9.1) and optional formula (§3:3.2.9.2) of the cell. If a cell has a formula, then the value is the result of the most recent calculation.\nBoth strings and formulas are stored in shared tables (§3:3.3 and §3:3.2.9.2.1) to avoid redundant storage and speed loads and saves.\n\n**5.5 SUPPORTING MARKUP LANGUAGE S** Several supporting markup languages can also be used to describe the content of an OpenXML document.\n- DrawingML (§3:5) –used to represent shapes and other graphically rendered objects within a document.\n- VML (§3:6) –a format for vector graphics that is included for backwards compatibility and will eventually be replaced by DrawingML.\n- Shared MLs: Math (§3:7.1), Metadata (§3:7.2), Custom XML (§3:7.3), and Bibliography (§3:7.4).\n\n| 5.6 | | | | MINIMAL WORDPROCESSINGML DOCUMENT | |\n| --- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| | The content-type part “ The package-relationship part “ | The document part, in this case “ (§1:13.2), and SpreadsheetML (§1:12.2). | Hello, world. | This subsection contains a minimal WordprocessingML document that comprises three parts. describes the content types of the two other required parts. ContentType=\"application/vnd.openxmlformats-package.relationships+xml\"/> ContentType=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml\"/> /_rels/.rels” describes the relationship between the package and the main document part. Type=\"[http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument\"](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument\") /document.xml” , contains the document content. The Specification provides minimal documents and additional detail for WordprocessingML (§1:11.2), PresentationML OpenXML is the product of substantial effort by representatives from many industry and public institutions with diverse backgrounds and organizational interests. It covers the full set of features used in the existing document corpus, as well as the internationalization needs inherent in all of the major language groups worldwide. As a result of the standardization work by Ecma TC45 (1) and contributions via public comment, OpenXML has enabled a high level of interoperability and platform independence; and its documentation has become both complete (through extensive reference material) and accessible (through non-normative descriptions). It also includes enough information for assistive technology products to properly process documents. OpenXML implementations can be very small and provide focused functionality, or they can encompass the full feature set. Extensibility mechanisms built into the format guarantee room for innovation. Standardizing the format specification and maintaining it over time ensure that multiple parties can safely rely on it, confident that further evolution will enjoy the checks and balances afforded by an open standards process. The compelling need exists for an open document-format standard that is capable of preserving the billions of documents that have been created in the pre- existing binary formats, and the billions that continue to be created each year. Technological advances in hardware, networking, and a standards-based software infrastructure make it possible. The explosive diversification in market demand –including significant existing investments in mission critical business systems –makes it essential. 13 |", + "markdown": "
Layout Master Master Master Page
\n\nTransition X X X Timing X X X Headers and Footers X X X X Matching Name X Layout Type X Preserve X X Layout List X Text Style X\n\nProperties specified by objects lower in the default hierarchy (slide master, slide layout, slide) override the corresponding properties specified by objects higher in the hierarchy. For example, if a transition is not specified for a slide, then it is taken from the slide layout; if it is not specified there, then it is taken from the slide master.\n\n**5.4** A SpreadsheetML document is described at the top level by a workbook part. The workbook part is the target of the package relationship whose type is: [http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument) The workbook part stores information about the workbook and its structure, such as file version, creating application, and password to modify. Logically, the workbook contains one or more sheets (§3:3.2); physically, each sheet is stored in its own part and is referenced in the usual manner from the workbook part. Each sheet can be a worksheet, a chart sheet, or a dialog sheet. We will discuss only the worksheet, which is the most common type. Within a worksheet object, a typical path from root to leaf in the XML tree would comprise these XML elements:\n- worksheet – the root element in a worksheet (§3:3.2).\n- sheetData – the cell table, which represents every non-empty cell in the worksheet (§3:3.2.4).\n- c – one cell (§3:3.2.9). The r attribute indicates the cell’s location using A1-style coordinates. The cell can also have a style identifier (attribute s) and a data type (attribute t).\n- v and f – the value (§3:3.2.9.1) and optional formula (§3:3.2.9.2) of the cell. If a cell has a formula, then the value is the result of the most recent calculation.\nBoth strings and formulas are stored in shared tables (§3:3.3 and §3:3.2.9.2.1) to avoid redundant storage and speed loads and saves.\n\n**5.5 SUPPORTING MARKUP LANGUAGE S** Several supporting markup languages can also be used to describe the content of an OpenXML document.\n- DrawingML (§3:5) – used to represent shapes and other graphically rendered objects within a document.\n- VML (§3:6) – a format for vector graphics that is included for backwards compatibility and will eventually be replaced by DrawingML.\n- Shared MLs: Math (§3:7.1), Metadata (§3:7.2), Custom XML (§3:7.3), and Bibliography (§3:7.4).\n\n| 5.6 | | | | MINIMAL WORDPROCESSINGML DOCUMENT |\n| --- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |\n| | The content-type part “
The package-relationship part “ | The document part, in this case “ (§1:13.2), and SpreadsheetML (§1:12.2). | ContentType=\"application/vnd.openxmlformats-package.relationships+xml\"/> ContentType=\"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml\"/> /_rels/.rels” describes the relationship between the package and the main document part. Type=\"[http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument\"](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument\") Target=\"document.xml\"/> /document.xml” , contains the document content. Hello, world. The Specification provides minimal documents and additional detail for WordprocessingML (§1:11.2), PresentationML OpenXML is the product of substantial effort by representatives from many industry and public institutions with diverse backgrounds and organizational interests. It covers the full set of features used in the existing document corpus, as well as the internationalization needs inherent in all of the major language groups worldwide. As a result of the standardization work by Ecma TC45 (1) and contributions via public comment, OpenXML has enabled a high level of interoperability and platform independence; and its documentation has become both complete (through extensive reference material) and accessible (through non-normative descriptions). It also includes enough information for assistive technology products to properly process documents. OpenXML implementations can be very small and provide focused functionality, or they can encompass the full feature set. Extensibility mechanisms built into the format guarantee room for innovation. Standardizing the format specification and maintaining it over time ensure that multiple parties can safely rely on it, confident that further evolution will enjoy the checks and balances afforded by an open standards process. The compelling need exists for an open document-format standard that is capable of preserving the billions of documents that have been created in the pre- existing binary formats, and the billions that continue to be created each year. Technological advances in hardware, networking, and a standards-based software infrastructure make it possible. The explosive diversification in market demand – including significant existing investments in mission critical business systems – makes it essential. 13 |", "number": 12 }, { diff --git a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap index c72d040..25f3afd 100644 --- a/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap +++ b/crates/tests/rust_snapshots/tests/snapshots/pdf__openxml_whitepaper_markdown.snap @@ -25,23 +25,23 @@ Concurrently with those technological advances, markets have diversified to incl - generate documents automatically from business data; -| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces –extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation –have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports | -| ------------------------------------------------------------- | ------------------------------------ | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. | -| Part 1 –Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. | -| Part 2 –Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). | -| Part 3 –Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. | -| | | 2 | +| | every feature in the binary formats. | • extract business data from documents and feed those data into business applications; • perform restricted tasks that operate on a small subset of a document, yet preserve editability; • provide accessibility for user populations with specialized needs, such as the blind; or • run on a variety of hardware, including mobile devices. Perhaps the most profound issue is one of long-term preservation. We have learned to create exponentially increasing amounts of information. Yet we have been encoding that information using digital representations that are so deeply coupled with the programs that created them that after a decade or two, they routinely become extremely difficult to read without significant loss. Preserving the financial and intellectual investment in those documents (both existing and new) has become a pressing priority. The emergence of these four forces – extremely broad adoption of the binary formats, technological advances, market forces that demand diverse applications, and the increasing difficulty of long-term preservation – have created an imperative to define an open XML format and migrate the billions of documents to it with as little loss as possible. Further, standardizing that open XML format and maintaining it over time create an environment in which any organization can safely rely on the ongoing stability of the specification, confident that further evolution will enjoy the checks and balances afforded by a standards process. Various document standards and specifications exist; these include HTML, XHTML, PDF and its subsets, ODF, DocBook, DITA, and RTF. Like the numerous standards that represent bitmapped images, including TIFF/IT, TIFF/EP, JPEG 2000, and PNG, each was created for a different set of purposes. OpenXML addresses the need for a standard that covers the features represented in the existing document corpus. To the best of our knowledge, it is the only XML document format that supports | +| ------------------------------------------------------------- | ------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | +| supporting markup languages. the reader’s understanding but | | OpenXML defines formats for word-processing, presentation, and spreadsheet documents. Each type of document is specified through a primary markup language: WordprocessingML, PresentationML, or SpreadsheetML. Embedding mechanisms permit a document of any one of these three types to contain material in the other primary markup languages and in a number of The Specification contains both normative material (material that defines OpenXML) and informative material (material that aids is not prescriptive). It is structured in Parts to meet the needs of varying audiences. | +| Part 1 – Fundamentals 165 pages | | - Defines vocabulary, notational conventions, and abbreviations. - Summarizes the three primary markup languages and the supporting markup languages. - Establishes conditions for conformance and provides interoperability guidelines. - Describes the constraints within the Open Packaging Conventions that apply to each document type. | +| Part 2 – Open Packaging Conventions 125 pages | | - Defines the Open Packaging Conventions (OPC). Every OpenXML file comprises a collection of byte streams called parts, combined into a container called a package. The packaging format is defined by the OPC. - Describes a recommended physical implementation of the OPC that uses the Zip file format. - Declares the XML schemas for the OPC as XML Schema Definitions (XSD) (2), in an annex that is issued only in electronic form. The annex also includes non-normative representations of the schemas using RELAX NG (ISO/IEC 19757-2) (3). | +| Part 3 – Primer 466 pages | | - Introduces the features of each markup language, providing context and illustrating elements through examples and diagrams. This Part is informative (non-normative). - Describes the facility for storing custom XML data within a package to support integration with business data. | +| | | 2 | ## Page: 2 ## Page: 3 -Part 4 –Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required. +Part 4 – Markup- Defines every element and attribute, the hierarchy of parent/child relationships for elements, Language Reference and additional semantics as appropriate. This Part is intended for use as a reference whenever 5756 pages complete detail about an element or attribute is required. - Defines the facility for storing custom XML data. - Declares the XML schemas for the markup languages as XSD (2), in an annex that is issued only in electronic form. The annex also expresses them non-normatively using RELAX NG (ISO/IEC 19757-2) (3). -Part 5 –Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages +Part 5 – Markup- Describes facilities for extension of OpenXML documents. Compatibility and- Specifies elements and attributes by which applications with different extensions can Extensibility interoperate. 34 Pages - Expresses extensibility rules using NVDL (ISO/IEC 19757-4) (4). In order to ease reading and navigation through these documents, the electronic versions have many internal active links. In particular, Part 4 has links to parent and child elements throughout. @@ -165,7 +165,7 @@ A developer wishing to extend the OpenXML feature set has two main options: - Extension lists: An extension list (§3:2.6) stores arbitrary custom XML without a visual representation. Developers have room to innovate outside of those extensibility mechanisms. -- Alternative interaction paradigms. OpenXML specifies more than document syntax but less than application behavior. As described in the Conformance statement, it focuses on semantics (§1:2.2, §1.2.3). Consequently, a conformant application is free to communicate with an end user through a variety of means, or not communicate with an end user at all –as long as it respects the specified semantics. +- Alternative interaction paradigms. OpenXML specifies more than document syntax but less than application behavior. As described in the Conformance statement, it focuses on semantics (§1:2.2, §1.2.3). Consequently, a conformant application is free to communicate with an end user through a variety of means, or not communicate with an end user at all – as long as it respects the specified semantics. - Novel computing environments. The Conformance statement admits applications that have low capacity, so that they can run on small devices, and applications that implement only a subset of OpenXML (§1:2.6). The Additional Characteristics mechanism permits a producing application to communicate its capacity limits (§3:8.1). As indicated in the previous subsection, some of the most substantial opportunities for innovation do not involve rendering documents for direct user interaction. Instead, they involve machine-to-machine processing using XML message formats, e.g., via XML Web Services (9). Although such applications have no user-visible behavior other than their operations on data contained within OpenXML documents, they are subject to document conformance (§1:2.4) and application conformance (§1:2.5), which are purely syntactic, and interoperability guidelines (§1:2.6), which incorporate semantics. @@ -180,11 +180,11 @@ A primary objective of this white paper is to enable the reader to follow the hi ## Page: 11 -- document –the root element of the main document (§3:2.3). -- body –body (§3:2.7.1). Can contain multiple paragraphs. Can also contain section properties specified in a sectPr element. -- p –paragraph (§3:2.4.1). Can contain one or more runs. Can also contain paragraph properties specified in a pPr element, which in turn can contain default run properties (also referred to as character properties) specified in a rPr element (§3:2.4.4). -- r –run (§3:2.4.2). Can contain multiple types of run content, primarily text ranges. Can also contain run properties (rPr). The run is a fundamental concept within OpenXML. A run is a contiguous piece of text with identical properties; a run contains no additional text markup. For example, if a sentence were to contain the words “this is **three runs”**, then it would be represented by at least three runs: “thisis ”, “**three”, and “ runs”. In this respect,** OpenXML differs significantly from formats that allow for arbitrary nesting of properties, such as HTML. -- t –text range (§3:2.4.3.1). Contains an arbitrary amount of text with no formatting, line breaks, tables, graphics, or other non-text material. The formatting for the text is inherited from the run properties and the paragraph properties. +- document – the root element of the main document (§3:2.3). +- body – body (§3:2.7.1). Can contain multiple paragraphs. Can also contain section properties specified in a sectPr element. +- p – paragraph (§3:2.4.1). Can contain one or more runs. Can also contain paragraph properties specified in a pPr element, which in turn can contain default run properties (also referred to as character properties) specified in a rPr element (§3:2.4.4). +- r – run (§3:2.4.2). Can contain multiple types of run content, primarily text ranges. Can also contain run properties (rPr). The run is a fundamental concept within OpenXML. A run is a contiguous piece of text with identical properties; a run contains no additional text markup. For example, if a sentence were to contain the words “this is **three runs”**, then it would be represented by at least three runs: “thisis ”, “**three”, and “ runs”. In this respect,** OpenXML differs significantly from formats that allow for arbitrary nesting of properties, such as HTML. +- t – text range (§3:2.4.3.1). Contains an arbitrary amount of text with no formatting, line breaks, tables, graphics, or other non-text material. The formatting for the text is inherited from the run properties and the paragraph properties. In this subsection, we have touched upon direct formatting of text by specifying paragraph and run properties. Direct formatting falls at the end of an order of application that also includes character, paragraph, numbering, and table styles, as well as document defaults (§3:2.8.10). Those styles are themselves organized into inheritance hierarchies (§3:2.8.9). The subsection “Minimal WordprocessingML Document”below lists a WordprocessingML document in full. @@ -195,10 +195,10 @@ The subsection “Minimal WordprocessingML Document”below lists a Wordprocessi - slides (§3:4.2.3) and notes pages (§3:4.2.4), which inherit properties from slide layouts and notes masters respectively. Each master, layout, and slide is stored in its own part. The name of each part is specified in the relationship part for the presentation part. Each of the six parts other than presentation is structured in essentially the same way. A typical path from root to leaf in the XML tree would comprise these XML elements (§3:2.2): -- sld, sldLayout, sldMaster, notes, notesMaster, or handoutMaster –the root element. -- cSld –slide (§4:4.4.1.15). Can contain DrawingML elements (as described in the next two bullets) and other structural elements (as described below). -- spTree –shape tree (§4:4.4.1.42). Can contain group shape properties in a grpSpPr element (§4:4.4.1.20) and non- visual group shape properties in an nvGrpSpPr element (§4:4.4.1.28). This node and its descendants are all DrawingML elements. We list some DrawingML elements here because of their pivotal role in PresentationML. -- sp –shape (§4:4.4.1.40). Can contain shape properties in a spPr element (§4:4.4.1.41) and non-visual shape properties in an nvSpPr element (§4:4.4.1.31). +- sld, sldLayout, sldMaster, notes, notesMaster, or handoutMaster – the root element. +- cSld – slide (§4:4.4.1.15). Can contain DrawingML elements (as described in the next two bullets) and other structural elements (as described below). +- spTree – shape tree (§4:4.4.1.42). Can contain group shape properties in a grpSpPr element (§4:4.4.1.20) and non- visual group shape properties in an nvGrpSpPr element (§4:4.4.1.28). This node and its descendants are all DrawingML elements. We list some DrawingML elements here because of their pivotal role in PresentationML. +- sp – shape (§4:4.4.1.40). Can contain shape properties in a spPr element (§4:4.4.1.41) and non-visual shape properties in an nvSpPr element (§4:4.4.1.31). In addition to the DrawingML shape content, a cSld can contain other structural elements, depending on the root element in which it resides, as summarized in this table: ## Page: 12 @@ -210,20 +210,20 @@ Transition X X X Timing X X X Headers and Footers X X X X Matching Name X Layout Properties specified by objects lower in the default hierarchy (slide master, slide layout, slide) override the corresponding properties specified by objects higher in the hierarchy. For example, if a transition is not specified for a slide, then it is taken from the slide layout; if it is not specified there, then it is taken from the slide master. **5.4** A SpreadsheetML document is described at the top level by a workbook part. The workbook part is the target of the package relationship whose type is: [http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument) The workbook part stores information about the workbook and its structure, such as file version, creating application, and password to modify. Logically, the workbook contains one or more sheets (§3:3.2); physically, each sheet is stored in its own part and is referenced in the usual manner from the workbook part. Each sheet can be a worksheet, a chart sheet, or a dialog sheet. We will discuss only the worksheet, which is the most common type. Within a worksheet object, a typical path from root to leaf in the XML tree would comprise these XML elements: -- worksheet –the root element in a worksheet (§3:3.2). -- sheetData –the cell table, which represents every non-empty cell in the worksheet (§3:3.2.4). -- c –one cell (§3:3.2.9). The r attribute indicates the cell’s location using A1-style coordinates. The cell can also have a style identifier (attribute s) and a data type (attribute t). -- v and f –the value (§3:3.2.9.1) and optional formula (§3:3.2.9.2) of the cell. If a cell has a formula, then the value is the result of the most recent calculation. +- worksheet – the root element in a worksheet (§3:3.2). +- sheetData – the cell table, which represents every non-empty cell in the worksheet (§3:3.2.4). +- c – one cell (§3:3.2.9). The r attribute indicates the cell’s location using A1-style coordinates. The cell can also have a style identifier (attribute s) and a data type (attribute t). +- v and f – the value (§3:3.2.9.1) and optional formula (§3:3.2.9.2) of the cell. If a cell has a formula, then the value is the result of the most recent calculation. Both strings and formulas are stored in shared tables (§3:3.3 and §3:3.2.9.2.1) to avoid redundant storage and speed loads and saves. **5.5 SUPPORTING MARKUP LANGUAGE S** Several supporting markup languages can also be used to describe the content of an OpenXML document. -- DrawingML (§3:5) –used to represent shapes and other graphically rendered objects within a document. -- VML (§3:6) –a format for vector graphics that is included for backwards compatibility and will eventually be replaced by DrawingML. +- DrawingML (§3:5) – used to represent shapes and other graphically rendered objects within a document. +- VML (§3:6) – a format for vector graphics that is included for backwards compatibility and will eventually be replaced by DrawingML. - Shared MLs: Math (§3:7.1), Metadata (§3:7.2), Custom XML (§3:7.3), and Bibliography (§3:7.4). -| 5.6 | | | | MINIMAL WORDPROCESSINGML DOCUMENT | | -| --- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| | The content-type part “ The package-relationship part “ | The document part, in this case “ (§1:13.2), and SpreadsheetML (§1:12.2). | Hello, world. | This subsection contains a minimal WordprocessingML document that comprises three parts. describes the content types of the two other required parts. ContentType="application/vnd.openxmlformats-package.relationships+xml"/> ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> /_rels/.rels” describes the relationship between the package and the main document part. Type="[http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument") /document.xml” , contains the document content. The Specification provides minimal documents and additional detail for WordprocessingML (§1:11.2), PresentationML OpenXML is the product of substantial effort by representatives from many industry and public institutions with diverse backgrounds and organizational interests. It covers the full set of features used in the existing document corpus, as well as the internationalization needs inherent in all of the major language groups worldwide. As a result of the standardization work by Ecma TC45 (1) and contributions via public comment, OpenXML has enabled a high level of interoperability and platform independence; and its documentation has become both complete (through extensive reference material) and accessible (through non-normative descriptions). It also includes enough information for assistive technology products to properly process documents. OpenXML implementations can be very small and provide focused functionality, or they can encompass the full feature set. Extensibility mechanisms built into the format guarantee room for innovation. Standardizing the format specification and maintaining it over time ensure that multiple parties can safely rely on it, confident that further evolution will enjoy the checks and balances afforded by an open standards process. The compelling need exists for an open document-format standard that is capable of preserving the billions of documents that have been created in the pre- existing binary formats, and the billions that continue to be created each year. Technological advances in hardware, networking, and a standards-based software infrastructure make it possible. The explosive diversification in market demand –including significant existing investments in mission critical business systems –makes it essential. 13 | +| 5.6 | | | | MINIMAL WORDPROCESSINGML DOCUMENT | +| --- | ---------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------- | -------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| | The content-type part “ The package-relationship part “ | The document part, in this case “ (§1:13.2), and SpreadsheetML (§1:12.2). | ContentType="application/vnd.openxmlformats-package.relationships+xml"/> ContentType="application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml"/> /_rels/.rels” describes the relationship between the package and the main document part. Type="[http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument"](http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument") Target="document.xml"/> /document.xml” , contains the document content. Hello, world. The Specification provides minimal documents and additional detail for WordprocessingML (§1:11.2), PresentationML OpenXML is the product of substantial effort by representatives from many industry and public institutions with diverse backgrounds and organizational interests. It covers the full set of features used in the existing document corpus, as well as the internationalization needs inherent in all of the major language groups worldwide. As a result of the standardization work by Ecma TC45 (1) and contributions via public comment, OpenXML has enabled a high level of interoperability and platform independence; and its documentation has become both complete (through extensive reference material) and accessible (through non-normative descriptions). It also includes enough information for assistive technology products to properly process documents. OpenXML implementations can be very small and provide focused functionality, or they can encompass the full feature set. Extensibility mechanisms built into the format guarantee room for innovation. Standardizing the format specification and maintaining it over time ensure that multiple parties can safely rely on it, confident that further evolution will enjoy the checks and balances afforded by an open standards process. The compelling need exists for an open document-format standard that is capable of preserving the billions of documents that have been created in the pre- existing binary formats, and the billions that continue to be created each year. Technological advances in hardware, networking, and a standards-based software infrastructure make it possible. The explosive diversification in market demand – including significant existing investments in mission critical business systems – makes it essential. 13 | ## Page: 13