perf(imaging): route morphology through Leptonica's DWA kernels, exactly by P4suta · Pull Request #33 · P4suta/my-pdf-tools-java

P4suta · 2026-06-10T06:51:48Z

Stacked on #32. The main course of the despeckle round.

Why

The cleaner's two heaviest morphology ops ran on Leptonica's generic rasterop bricks: the 43×43 isolated-dust dilation (37.5ms/page — the single hottest op in the committed baseline) and the 7×7 fill-holes opening (12.2ms). Leptonica ships word-accelerated DWA kernels for both; they were unbound.

The trap the equality sweep caught

Binding them surfaced exactly what the planned pixel-identity gate existed for: pixDilateBrickDwa silently diverges from the generic brick for sel sizes missing from the generated table — empirically every prime above 15, including the production 43 — while pixOpenBrickDwa is exact at every size and dilate is exact up to 15. (The composite pix*CompBrickDwa variants diverge too; diagnosed per-size in-container, table in the test rationale.)

The shipped routing (exact by construction)

dilated(): single DWA pass ≤ 15; larger sizes composed from safe-size DWA passes — exact by Minkowski sum (brick(a) ⊕ brick(b) = brick(a+b−1); per-pass clipping changes nothing inside the image rectangle because L∞ paths between in-bounds points stay in bounds), and version-robust (only the always-present small sels are used). 43×43 = three passes.
opened(): DWA ≤ 15 (production is 7×7), generic beyond (an opening does not compose).
Generic variants stay as package-private oracles; PixTest pins pixel-identity across radii 0–31 (sel 1–63) on border-touching ink + degenerate (tiny/all-black/all-white) pages.
Rider: pixCountPixels reuses one process-lifetime popcount table (hygiene, <0.1%).

Measured

metric	before	after
dilate 43×43	37.5ms	14.1ms (−62%)
open 7×7	12.2ms	4.0ms (−67%)
clean() pipeline path	139.5ms	107.6ms (−38.5% cumulative vs the #31 baseline)
despeckle stage (-j8)	9.57s	5.47s (−43%) — bandwidth relief compounds across workers
conv (-j8)	13.61s	9.51s (−30%; −34% vs the original #28 baseline)

selectBySize is now ~70% of the remaining clean() — the measured gate for the follow-up selection restructuring (De Morgan flip + single-labeling fusion) is met.

Verification

Full ./gradlew check green; ArchUnit FFM confinement untouched (new bindings in Leptonica, routing in Pix).
The golden cleaner tests pass unchanged (they exercise both DWA paths).
Rollback = revert the dilated/opened internals (~10 lines); bindings and sweep stay harmless.

🤖 Generated with Claude Code

Nothing in the pipeline measured where a run's time went: ProgressEvents carry no timestamps and the stage logs no durations, so optimization work had no baseline to argue against. This adds the measurement layer: - --timings: a StageTimingSink (composed in the CLI shell) prints a stable, machine-parseable per-stage breakdown to stderr when a run ends ("timing: <stage> = <seconds>s (<percent>%)"), including the still-open stage on failure. - PipelineRunner logs each stage directory's byte total, making the intermediate I/O of every stage visible. - benchPipeline: a Gradle task driving the installDist launcher with --timings (PipelineBenchmark, test sources), measuring E2E wall, the per-stage medians, peak RSS via /proc VmHWM, and output size, over a -Pjobs sweep; writes pipeline/docs/perf-baseline.md. - createSampleScan: a deterministic synthetic 600-dpi A5 scan book (specks for despeckle, ±0.5° skew for deskew) so the benchmark needs no copyrighted input and stays comparable across machines. Baseline on the 200-page fixture (8 CPUs): conv 14.48s at -j8 — despeckle 68%, register 22.6%, extract 7.9%, spread 1.5% — and a 3.44x scale-up from -j1, recorded in pipeline/docs/perf-baseline.md. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

pdfimages -tiff decodes every embedded G4 image into an uncompressed TIFF (~2.2 MB per 600-dpi page; ~434 MB of transient intermediates for a 200-page book) even though the typical self-scanned source is CCITT G4 end to end. (The originally planned `-tiffcompression g4` flag does not exist on pdfimages — it is a pdftoppm option.) The extractor now picks its mode from one pdfimages -list pass: when every embedded image is 1-bpp CCITT, each chunk dumps the raw G4 streams (-ccitt) and CcittTiffs wraps them verbatim into single-strip CCITT-G4 TIFFs — a pure remux: no decode/re-encode, intermediates drop ~60x, and the image's true ppi is stamped instead of pdfimages' default 72 dpi. Because PDF's EncodedByteAlign never reaches the dumped .params file, every wrapped page is decoded back once through Leptonica as verification; a chunk that deviates in any way (params shape, count, or a wrap that fails to decode) is re-extracted decoded, which is also the whole-run mode for any non-CCITT source. The photometric mapping (-B -> WhiteIsZero, -W -> BlackIsZero) is pinned empirically by a pixel-identical round trip test. Extraction chunks also shrink from total/jobs to ~12 pages (capped at 4*jobs): fast finishers free their pool slot early, and a future streaming source can consume pages chunk by chunk. Benchmark (200-page fixture, warm median of 3, vs the PR #28 baseline): extract 1.15s -> 0.46s at -j8 (4.57s -> 0.88s at -j1), conv 49.85s -> 45.98s (-7.8%) at -j1, intermediates ~434 MB -> ~7 MB. Output validated with qpdf --check (100 spreads, linearized, no errors). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The shared Tasks.awaitAll (and its two private copies in despeckle/ register) had three real defects: - No fail-fast: invokeAll waits for every page even after the first failure — a corrupt page at position 1 of 600 still ran the other 599. - Error kinds were destroyed at the pool boundary: a worker's domain exception (DespeckleException, RegisterException — RuntimeExceptions carrying their ErrorKind) was wrapped into a generic IOException, so ExceptionMapper saw INTERNAL instead of the real kind: wrong exit code, wrong RunFailed token in the webapp's SSE stream. - ProcessRunner leaked the child on interrupt: an InterruptedException from waitFor propagated without destroyForcibly, leaving the drainer close() hanging on a live child's pipes. Tasks is rewritten around a batch-owned executor (try-with-resources) with StructuredTaskScope.join() semantics on final-Java features: fail-fast with sibling interruption plus a gate that stops queued tasks a freed worker dequeues in the shutdownNow race window, quiescence before the failure propagates (so finally-deletes never race in-flight writers), and exception identity preserved (IOException/ RuntimeException unchanged, UncheckedIOException unwrapped). Two scheduling modes: Workers.platform(jobs) for CPU-bound Leptonica/FFM work (long downcalls pin virtual threads' carriers) and Workers.virtual(maxInFlight) for subprocess waits (pdfimages chunks, per-page jbig2), semaphore-bounded. ItemProgress now fires on the orchestrating thread in completion order, so per-page progress events are strictly ordered and the AtomicInteger counters at every call site are gone. All seven call sites migrate (G4EncodeStage, DespeckleService, RegistrationService both passes, PdfExtractSource, the shared extractor/assembler, Jbig2PackService, both PdfPipelineServices — whose redundant outer pools, idle through the whole middle step, are removed); the ExecutorService parameter disappears from the PdfImageExtractor/Jbig2Assembler ports; the dead NativeTools.awaitAll copies (zero callers) are deleted. pdfbook's Main also gains a shutdown hook that interrupts the run and waits (bounded 15s) for it to unwind, so Ctrl-C now kills the children, quiesces the workers, and deletes the p4suta-pipeline- work area instead of leaking it. Verified in-container: SIGINT to the process group aborts the run (exit 130), no work dir remains. Benchmark: no regression (conv 14.29s vs 14.23s at -j8 on the 200-page fixture, within noise). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

despeckle dominates pdfbook conversions (~72% of conv), but nothing measured WHERE inside clean() the ~50ms/page/core goes. benchCleaner times each Leptonica primitive the cleaner composes — read, the four selectBySize shapes (incl. the inverted-page variant whose giant background component is rendered back), the 43x43 dilate, the 7x7 open, the boolean ops, both counting passes, the G4 write — plus clean() end-to-end, on a deterministic synthetic 600-dpi A5 page, and writes the table to despeckle/docs/cleaner-baseline.md. The committed baseline (174.9ms clean(), sigma row covers 92.5%): dilate 43x43 = 21.5%, selectBySize on the inverted page x2 = 25.3% (22.2ms vs 15.2ms on the normal page — the background-component re-render penalty, measured), metrics-only countConnComp x2 = 13.5%, page selectBySize x2 = 17.4%, open 7x7 = 7.0%. Every following optimization is judged against this table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

clean() counted 8-connected components before and after every page — two full connected-component labelings, 13.5% of the page's time per the committed baseline — yet the counts feed nothing but the HTML report's charts and the summary log line. The pdfbook pipeline never sets a report dir, so its hot path paid for numbers nobody read. ProcessResult's component counts become OptionalInt (a counted 4-int convenience constructor keeps every existing construction compiling unchanged; componentsRemoved() stays present-or-zero int, documented). ProcessOptions grows collectComponentStats (default true via the delegating 5-knob constructor) plus a withoutComponentStats() wither. DespeckleService turns counting off exactly when no reportDir is set — the report path is byte-identical, including its charts and totals. The black-pixel counts still run unconditionally: they feed the 3% over-removal guard. Summary logs switch to always-measured black-pixel terms when counting was skipped. Measured: clean() 161.8ms -> 139.5ms single-threaded (-13.8%, matching the baseline's countConnComp share); pipeline at -j8 despeckle stage 10.19s -> 9.57s (-6.1%, the saving partially absorbed by memory- bandwidth saturation), conv -4.4%. The benchmark gains a "clean() without component stats" row pinning the pipeline-path number. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The cleaner's two heaviest morphology ops ran on the generic rasterop bricks: the 43x43 isolated-dust dilation (37.5ms/page, the single hottest op in the baseline) and the 7x7 fill-holes opening (12.2ms). Leptonica ships word-accelerated DWA kernels for both, unbound. Binding them surfaced a real trap the planned equality sweep was built to catch: pixDilateBrickDwa silently diverges from the generic brick for sel sizes missing from the generated table (every prime above 15 — including the production 43), while pixOpenBrickDwa is exact at every size and dilate is exact up to 15. The shipped routing is therefore: - dilated(): a single DWA pass up to size 15, larger sizes composed from safe-size DWA passes — exact by Minkowski sum (brick(a) then brick(b) equals brick(a+b-1); clipping per pass changes nothing inside the image rectangle since L-inf paths between in-bounds points stay in bounds), and version-robust (only the always-present small sels are used). - opened(): DWA up to size 15 (production is 7x7), generic beyond (an opening does not compose from smaller passes). The generic variants stay as package-private oracles, and PixTest pins pixel-identity across radii 0..31 (sel 1..63) on border-touching ink plus degenerate pages. Rider: pixCountPixels now reuses one process-lifetime popcount table instead of rebuilding it per call. Measured: dilate 43x43 37.5 -> 14.1ms (-62%), open 7x7 12.2 -> 4.0ms (-67%); clean() on the pipeline path 139.5 -> 107.6ms; despeckle stage at -j8 9.57 -> 5.47s (-43%, the bandwidth relief compounds across workers), conv 13.61 -> 9.51s (-30%; -34% vs the original baseline). selectBySize is now ~70% of the remaining clean() — the measured gate for the follow-up selection restructuring. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The CI spell check flags PDFBox's COSName.DECODE_PARMS (the PDF spec's own key name, which the remux test must name verbatim) and a hyphenated coinage in the extractor's javadoc. Allowlist the spec identifier — the same precedent as the veraPDF en-GB names — and use plain words. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…failfast

The test released its gate from inside the ok tasks, leaving a window where the failing task's completion could overtake an ok task's still-being-enqueued completion (countDown happens inside call(), the queue add after it returns) — the recorded sequence then read [1] instead of [1, 2]. The gate now opens from the progress callback on the orchestrating thread, so both successes are provably consumed before the failure is thrown. Also renames `failer` to `failing` for the spell check. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rphology

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

…rphology

Squash merges orphan the stack's ancestry, so the benchmark documents (regenerated on every bench run) collide as add/add between this branch and main. Align them to main's version; the round's closing PR commits the final regenerated baselines, so no information is lost from the final state. The measured numbers this PR contributed remain in its commit message and PR description. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

P4suta · 2026-06-10T10:58:06Z

Re-filed as a fresh branch off main (stacked-squash ancestry orphaning); content identical.

P4suta and others added 6 commits June 10, 2026 13:47

P4suta mentioned this pull request Jun 10, 2026

docs(perf): close out the despeckle optimization round with final baselines #34

Closed

P4suta and others added 11 commits June 10, 2026 16:06

Merge branch 'perf/ccitt-passthrough-extract' into concurrency/tasks-…

fbe6417

…failfast

Merge branch 'concurrency/tasks-failfast' into perf/despeckle-bench

6512ac0

Merge branch 'perf/despeckle-bench' into perf/despeckle-skip-metrics

a2d54a2

Merge branch 'perf/despeckle-skip-metrics' into perf/despeckle-dwa-mo…

5301445

…rphology

style: rewrap the line the failing-task rename pushed past 100 columns

efac0ad

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Merge branch 'concurrency/tasks-failfast' into perf/despeckle-bench

65bb8ca

Merge branch 'perf/despeckle-bench' into perf/despeckle-skip-metrics

cda1e87

Merge branch 'perf/despeckle-skip-metrics' into perf/despeckle-dwa-mo…

0daa41b

…rphology

P4suta changed the base branch from perf/despeckle-skip-metrics to main June 10, 2026 10:16

P4suta closed this Jun 10, 2026

This was referenced Jun 10, 2026

perf(imaging): route morphology through Leptonica's DWA kernels, exactly #39

Merged

docs(perf): close out the despeckle optimization round with final baselines #40

Merged

P4suta deleted the perf/despeckle-dwa-morphology branch June 10, 2026 11:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(imaging): route morphology through Leptonica's DWA kernels, exactly#33

perf(imaging): route morphology through Leptonica's DWA kernels, exactly#33
P4suta wants to merge 17 commits into
mainfrom
perf/despeckle-dwa-morphology

P4suta commented Jun 10, 2026

Uh oh!

P4suta commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

P4suta commented Jun 10, 2026

Why

The trap the equality sweep caught

The shipped routing (exact by construction)

Measured

Verification

Uh oh!

P4suta commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant