perf(extract): remux all-CCITT sources instead of decoding, finer chunks by P4suta · Pull Request #35 · P4suta/my-pdf-tools-java

P4suta · 2026-06-10T09:55:31Z

Stacked on #28 (needs its benchPipeline harness for the A/B numbers and perf-baseline.md update).

Why

pdfimages -tiff decodes every embedded G4 image into an uncompressed TIFF (~2.2 MB per 600-dpi page — ~434 MB of transient intermediates for a 200-page book), even though the typical self-scanned source is CCITT G4 end to end. The originally planned -tiffcompression g4 flag turned out not to exist on pdfimages (it's a pdftoppm option), which forced a better design.

What

CCITT remux mode. One pdfimages -list pass picks the extractor's mode: when every embedded image is 1-bpp CCITT, each chunk dumps the raw embedded G4 streams (-ccitt) and CcittTiffs wraps them verbatim into single-strip CCITT-G4 TIFFs — a pure remux:

no decode/re-encode anywhere in extraction,
intermediates drop ~60× (34 KB vs 2.2 MB per page),
the image's true ppi is stamped instead of pdfimages' default 72 dpi.

Defense in depth. PDF's EncodedByteAlign never reaches the dumped .params file (verified against poppler 24.02 source), so every wrapped page is decoded back once through Leptonica; a chunk that deviates in any way (params shape, dump count, or a wrap that fails to decode to the listed dimensions) is deleted and re-extracted decoded (-tiff) — which is also the whole-run mode for any non-CCITT source. The photometric mapping (-B → WhiteIsZero, -W → BlackIsZero) is pinned empirically by a pixel-identical round-trip test (the first mapping attempt was inverted; the test caught it).

Finer chunks. Extraction chunks shrink from total/jobs to ~12 pages (capped at 4×jobs): fast finishers free their pool slot early, and a future streaming source can consume pages chunk by chunk.

Measurements (200-page fixture, warm median of 3, vs #28 baseline)

Metric	Before	After
extract (-j8)	1.15s	0.46s (−60%)
extract (-j1)	4.57s	0.88s (−81%)
conv (-j1)	49.85s	45.98s (−7.8%)
conv (-j8)	14.48s	14.23s (−1.7%)
intermediates	~434 MB	~7 MB

Meets the acceptance rule on both prongs: ≥5% wall at -j1 and an explicit disk win.

Verification

Full ./gradlew check green (all modules; ArchUnit, Error Prone, NullAway).
New tests: params parser, single-strip wrap round-trip (pixel-identical via Leptonica), extractor-level remux (per-page .tif only, stamped 200 ppi, no .ccitt/.params residue, non-inverted ink).
PipelineFlowTest e2e exercises the new path through despeckle → register → spread.
Output validated: qpdf --check clean, 100 spreads, linearized.

🤖 Generated with Claude Code

Note: re-filed #29 — GitHub closed the original instead of retargeting when its base branch was deleted on the #28 merge.

Nothing in the pipeline measured where a run's time went: ProgressEvents carry no timestamps and the stage logs no durations, so optimization work had no baseline to argue against. This adds the measurement layer: - --timings: a StageTimingSink (composed in the CLI shell) prints a stable, machine-parseable per-stage breakdown to stderr when a run ends ("timing: <stage> = <seconds>s (<percent>%)"), including the still-open stage on failure. - PipelineRunner logs each stage directory's byte total, making the intermediate I/O of every stage visible. - benchPipeline: a Gradle task driving the installDist launcher with --timings (PipelineBenchmark, test sources), measuring E2E wall, the per-stage medians, peak RSS via /proc VmHWM, and output size, over a -Pjobs sweep; writes pipeline/docs/perf-baseline.md. - createSampleScan: a deterministic synthetic 600-dpi A5 scan book (specks for despeckle, ±0.5° skew for deskew) so the benchmark needs no copyrighted input and stays comparable across machines. Baseline on the 200-page fixture (8 CPUs): conv 14.48s at -j8 — despeckle 68%, register 22.6%, extract 7.9%, spread 1.5% — and a 3.44x scale-up from -j1, recorded in pipeline/docs/perf-baseline.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pdfimages -tiff decodes every embedded G4 image into an uncompressed TIFF (~2.2 MB per 600-dpi page; ~434 MB of transient intermediates for a 200-page book) even though the typical self-scanned source is CCITT G4 end to end. (The originally planned `-tiffcompression g4` flag does not exist on pdfimages — it is a pdftoppm option.) The extractor now picks its mode from one pdfimages -list pass: when every embedded image is 1-bpp CCITT, each chunk dumps the raw G4 streams (-ccitt) and CcittTiffs wraps them verbatim into single-strip CCITT-G4 TIFFs — a pure remux: no decode/re-encode, intermediates drop ~60x, and the image's true ppi is stamped instead of pdfimages' default 72 dpi. Because PDF's EncodedByteAlign never reaches the dumped .params file, every wrapped page is decoded back once through Leptonica as verification; a chunk that deviates in any way (params shape, count, or a wrap that fails to decode) is re-extracted decoded, which is also the whole-run mode for any non-CCITT source. The photometric mapping (-B -> WhiteIsZero, -W -> BlackIsZero) is pinned empirically by a pixel-identical round trip test. Extraction chunks also shrink from total/jobs to ~12 pages (capped at 4*jobs): fast finishers free their pool slot early, and a future streaming source can consume pages chunk by chunk. Benchmark (200-page fixture, warm median of 3, vs the PR #28 baseline): extract 1.15s -> 0.46s at -j8 (4.57s -> 0.88s at -j1), conv 49.85s -> 45.98s (-7.8%) at -j1, intermediates ~434 MB -> ~7 MB. Output validated with qpdf --check (100 spreads, linearized, no errors). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The CI spell check flags PDFBox's COSName.DECODE_PARMS (the PDF spec's own key name, which the remux test must name verbatim) and a hyphenated coinage in the extractor's javadoc. Allowlist the spec identifier — the same precedent as the veraPDF en-GB names — and use plain words. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Squash merges orphan the stack's ancestry, so the benchmark documents (regenerated on every bench run) collide as add/add between this branch and main. Align them to main's version; the round's closing PR commits the final regenerated baselines, so no information is lost from the final state. The measured numbers this PR contributed remain in its commit message and PR description. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

P4suta and others added 4 commits June 10, 2026 13:47

P4suta merged commit 3b6fcdd into main Jun 10, 2026
20 checks passed

P4suta deleted the perf/ccitt-passthrough-extract branch June 10, 2026 10:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(extract): remux all-CCITT sources instead of decoding, finer chunks#35

perf(extract): remux all-CCITT sources instead of decoding, finer chunks#35
P4suta merged 4 commits into
mainfrom
perf/ccitt-passthrough-extract

P4suta commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

P4suta commented Jun 10, 2026

Why

What

Measurements (200-page fixture, warm median of 3, vs #28 baseline)

Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant