test(despeckle): add an op-level micro-benchmark for the page cleaner by P4suta · Pull Request #37 · P4suta/my-pdf-tools-java

P4suta · 2026-06-10T10:33:09Z

Stacked on #30. First PR of the despeckle-algorithm optimization round (despeckle = 71.6% of conv per pipeline/docs/perf-baseline.md).

Why

Nothing measured where inside clean() the ~50ms/page/core goes — and the optimization candidates (DWA morphology, metrics-pass skipping, selection restructuring) target different ops. This adds the op-level measurement first, so every following claim is judged against a committed table.

What

CleanerBenchmark (test sources, mirroring PipelineBenchmark): times each Leptonica primitive the cleaner composes plus clean() end-to-end on a deterministic synthetic 600-dpi A5 page (3496×4961, glyph columns + dust + isolated blots + pin-holes so all three passes have real work), writing despeckle/docs/cleaner-baseline.md. Task: ./gradlew :despeckle:infrastructure:benchCleaner (-Preps=N).

The baseline (clean() = 174.9ms single-threaded; Σ row covers 92.5%)

op	median	calls	share
dilate 43×43	37.5ms	1	21.5%
selectBySize k=6 (inverted)	22.2ms	2	25.3%
countConnComp (metrics-only)	11.8ms	2	13.5%
selectBySize (page)	15.2ms	2	17.4%
open 7×7	12.2ms	1	7.0%
write/read G4 + booleans			~7%

Notable: the inverted-page selectBySize costs 22.2ms vs 15.2ms on the normal page — the giant-background-component re-render penalty is now a measured fact (it motivates the planned De Morgan flip), and the two metrics-only counting passes are 13.5% of the page.

🤖 Generated with Claude Code

Re-file of #31 (same commit, cherry-picked onto main).

despeckle dominates pdfbook conversions (~72% of conv), but nothing measured WHERE inside clean() the ~50ms/page/core goes. benchCleaner times each Leptonica primitive the cleaner composes — read, the four selectBySize shapes (incl. the inverted-page variant whose giant background component is rendered back), the 43x43 dilate, the 7x7 open, the boolean ops, both counting passes, the G4 write — plus clean() end-to-end, on a deterministic synthetic 600-dpi A5 page, and writes the table to despeckle/docs/cleaner-baseline.md. The committed baseline (174.9ms clean(), sigma row covers 92.5%): dilate 43x43 = 21.5%, selectBySize on the inverted page x2 = 25.3% (22.2ms vs 15.2ms on the normal page — the background-component re-render penalty, measured), metrics-only countConnComp x2 = 13.5%, page selectBySize x2 = 17.4%, open 7x7 = 7.0%. Every following optimization is judged against this table. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

P4suta and others added 2 commits June 10, 2026 19:33

Merge branch 'main' into perf/despeckle-bench-2

1a88956

P4suta merged commit 25e7809 into main Jun 10, 2026
20 checks passed

P4suta deleted the perf/despeckle-bench-2 branch June 10, 2026 10:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(despeckle): add an op-level micro-benchmark for the page cleaner#37

test(despeckle): add an op-level micro-benchmark for the page cleaner#37
P4suta merged 2 commits into
mainfrom
perf/despeckle-bench-2

P4suta commented Jun 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

P4suta commented Jun 10, 2026

Why

What

The baseline (clean() = 174.9ms single-threaded; Σ row covers 92.5%)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant