Skip to content

test(despeckle): add an op-level micro-benchmark for the page cleaner#31

Closed
P4suta wants to merge 11 commits into
mainfrom
perf/despeckle-bench
Closed

test(despeckle): add an op-level micro-benchmark for the page cleaner#31
P4suta wants to merge 11 commits into
mainfrom
perf/despeckle-bench

Conversation

@P4suta

@P4suta P4suta commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Stacked on #30. First PR of the despeckle-algorithm optimization round (despeckle = 71.6% of conv per pipeline/docs/perf-baseline.md).

Why

Nothing measured where inside clean() the ~50ms/page/core goes — and the optimization candidates (DWA morphology, metrics-pass skipping, selection restructuring) target different ops. This adds the op-level measurement first, so every following claim is judged against a committed table.

What

CleanerBenchmark (test sources, mirroring PipelineBenchmark): times each Leptonica primitive the cleaner composes plus clean() end-to-end on a deterministic synthetic 600-dpi A5 page (3496×4961, glyph columns + dust + isolated blots + pin-holes so all three passes have real work), writing despeckle/docs/cleaner-baseline.md. Task: ./gradlew :despeckle:infrastructure:benchCleaner (-Preps=N).

The baseline (clean() = 174.9ms single-threaded; Σ row covers 92.5%)

op median calls share
dilate 43×43 37.5ms 1 21.5%
selectBySize k=6 (inverted) 22.2ms 2 25.3%
countConnComp (metrics-only) 11.8ms 2 13.5%
selectBySize (page) 15.2ms 2 17.4%
open 7×7 12.2ms 1 7.0%
write/read G4 + booleans ~7%

Notable: the inverted-page selectBySize costs 22.2ms vs 15.2ms on the normal page — the giant-background-component re-render penalty is now a measured fact (it motivates the planned De Morgan flip), and the two metrics-only counting passes are 13.5% of the page.

🤖 Generated with Claude Code

P4suta and others added 4 commits June 10, 2026 13:47
Nothing in the pipeline measured where a run's time went: ProgressEvents
carry no timestamps and the stage logs no durations, so optimization
work had no baseline to argue against. This adds the measurement layer:

- --timings: a StageTimingSink (composed in the CLI shell) prints a
  stable, machine-parseable per-stage breakdown to stderr when a run
  ends ("timing: <stage> = <seconds>s (<percent>%)"), including the
  still-open stage on failure.
- PipelineRunner logs each stage directory's byte total, making the
  intermediate I/O of every stage visible.
- benchPipeline: a Gradle task driving the installDist launcher with
  --timings (PipelineBenchmark, test sources), measuring E2E wall, the
  per-stage medians, peak RSS via /proc VmHWM, and output size, over a
  -Pjobs sweep; writes pipeline/docs/perf-baseline.md.
- createSampleScan: a deterministic synthetic 600-dpi A5 scan book
  (specks for despeckle, ±0.5° skew for deskew) so the benchmark needs
  no copyrighted input and stays comparable across machines.

Baseline on the 200-page fixture (8 CPUs): conv 14.48s at -j8 —
despeckle 68%, register 22.6%, extract 7.9%, spread 1.5% — and a
3.44x scale-up from -j1, recorded in pipeline/docs/perf-baseline.md.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
pdfimages -tiff decodes every embedded G4 image into an uncompressed
TIFF (~2.2 MB per 600-dpi page; ~434 MB of transient intermediates for
a 200-page book) even though the typical self-scanned source is CCITT
G4 end to end. (The originally planned `-tiffcompression g4` flag does
not exist on pdfimages — it is a pdftoppm option.)

The extractor now picks its mode from one pdfimages -list pass: when
every embedded image is 1-bpp CCITT, each chunk dumps the raw G4
streams (-ccitt) and CcittTiffs wraps them verbatim into single-strip
CCITT-G4 TIFFs — a pure remux: no decode/re-encode, intermediates drop
~60x, and the image's true ppi is stamped instead of pdfimages' default
72 dpi. Because PDF's EncodedByteAlign never reaches the dumped .params
file, every wrapped page is decoded back once through Leptonica as
verification; a chunk that deviates in any way (params shape, count,
or a wrap that fails to decode) is re-extracted decoded, which is also
the whole-run mode for any non-CCITT source. The photometric mapping
(-B -> WhiteIsZero, -W -> BlackIsZero) is pinned empirically by a
pixel-identical round trip test.

Extraction chunks also shrink from total/jobs to ~12 pages (capped at
4*jobs): fast finishers free their pool slot early, and a future
streaming source can consume pages chunk by chunk.

Benchmark (200-page fixture, warm median of 3, vs the PR #28 baseline):
extract 1.15s -> 0.46s at -j8 (4.57s -> 0.88s at -j1), conv 49.85s ->
45.98s (-7.8%) at -j1, intermediates ~434 MB -> ~7 MB. Output validated
with qpdf --check (100 spreads, linearized, no errors).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The shared Tasks.awaitAll (and its two private copies in despeckle/
register) had three real defects:

- No fail-fast: invokeAll waits for every page even after the first
  failure — a corrupt page at position 1 of 600 still ran the other 599.
- Error kinds were destroyed at the pool boundary: a worker's domain
  exception (DespeckleException, RegisterException — RuntimeExceptions
  carrying their ErrorKind) was wrapped into a generic IOException, so
  ExceptionMapper saw INTERNAL instead of the real kind: wrong exit
  code, wrong RunFailed token in the webapp's SSE stream.
- ProcessRunner leaked the child on interrupt: an InterruptedException
  from waitFor propagated without destroyForcibly, leaving the drainer
  close() hanging on a live child's pipes.

Tasks is rewritten around a batch-owned executor (try-with-resources)
with StructuredTaskScope.join() semantics on final-Java features:
fail-fast with sibling interruption plus a gate that stops queued tasks
a freed worker dequeues in the shutdownNow race window, quiescence
before the failure propagates (so finally-deletes never race in-flight
writers), and exception identity preserved (IOException/
RuntimeException unchanged, UncheckedIOException unwrapped). Two
scheduling modes: Workers.platform(jobs) for CPU-bound Leptonica/FFM
work (long downcalls pin virtual threads' carriers) and
Workers.virtual(maxInFlight) for subprocess waits (pdfimages chunks,
per-page jbig2), semaphore-bounded. ItemProgress now fires on the
orchestrating thread in completion order, so per-page progress events
are strictly ordered and the AtomicInteger counters at every call site
are gone.

All seven call sites migrate (G4EncodeStage, DespeckleService,
RegistrationService both passes, PdfExtractSource, the shared
extractor/assembler, Jbig2PackService, both PdfPipelineServices — whose
redundant outer pools, idle through the whole middle step, are
removed); the ExecutorService parameter disappears from the
PdfImageExtractor/Jbig2Assembler ports; the dead NativeTools.awaitAll
copies (zero callers) are deleted.

pdfbook's Main also gains a shutdown hook that interrupts the run and
waits (bounded 15s) for it to unwind, so Ctrl-C now kills the children,
quiesces the workers, and deletes the p4suta-pipeline- work area
instead of leaking it. Verified in-container: SIGINT to the process
group aborts the run (exit 130), no work dir remains.

Benchmark: no regression (conv 14.29s vs 14.23s at -j8 on the 200-page
fixture, within noise).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
despeckle dominates pdfbook conversions (~72% of conv), but nothing
measured WHERE inside clean() the ~50ms/page/core goes. benchCleaner
times each Leptonica primitive the cleaner composes — read, the four
selectBySize shapes (incl. the inverted-page variant whose giant
background component is rendered back), the 43x43 dilate, the 7x7
open, the boolean ops, both counting passes, the G4 write — plus
clean() end-to-end, on a deterministic synthetic 600-dpi A5 page, and
writes the table to despeckle/docs/cleaner-baseline.md.

The committed baseline (174.9ms clean(), sigma row covers 92.5%):
dilate 43x43 = 21.5%, selectBySize on the inverted page x2 = 25.3%
(22.2ms vs 15.2ms on the normal page — the background-component
re-render penalty, measured), metrics-only countConnComp x2 = 13.5%,
page selectBySize x2 = 17.4%, open 7x7 = 7.0%. Every following
optimization is judged against this table.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
P4suta and others added 7 commits June 10, 2026 16:06
The CI spell check flags PDFBox's COSName.DECODE_PARMS (the PDF spec's
own key name, which the remux test must name verbatim) and a hyphenated
coinage in the extractor's javadoc. Allowlist the spec identifier — the
same precedent as the veraPDF en-GB names — and use plain words.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The test released its gate from inside the ok tasks, leaving a window
where the failing task's completion could overtake an ok task's
still-being-enqueued completion (countDown happens inside call(), the
queue add after it returns) — the recorded sequence then read [1]
instead of [1, 2]. The gate now opens from the progress callback on
the orchestrating thread, so both successes are provably consumed
before the failure is thrown. Also renames `failer` to `failing` for
the spell check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Squash merges orphan the stack's ancestry, so the benchmark documents
(regenerated on every bench run) collide as add/add between this branch
and main. Align them to main's version; the round's closing PR commits
the final regenerated baselines, so no information is lost from the
final state. The measured numbers this PR contributed remain in its
commit message and PR description.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@P4suta P4suta changed the base branch from concurrency/tasks-failfast to main June 10, 2026 10:16
@P4suta

P4suta commented Jun 10, 2026

Copy link
Copy Markdown
Owner Author

Re-filed as a fresh branch off main (stacked-squash ancestry orphaning); content identical.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant