Skip to content

fix(concurrency): fail-fast page fan-out that keeps exception identity#36

Merged
P4suta merged 3 commits into
mainfrom
concurrency/tasks-failfast-2
Jun 10, 2026
Merged

fix(concurrency): fail-fast page fan-out that keeps exception identity#36
P4suta merged 3 commits into
mainfrom
concurrency/tasks-failfast-2

Conversation

@P4suta

@P4suta P4suta commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Stacked on #29. Combines the substrate rewrite and the 7-site migration in one PR: the build compiles with -Xlint:deprecation -Werror, so a deprecated migration window is not possible.

The defects (all verified in code)

  1. No fail-fast. Tasks.awaitAll (and its two private copies in despeckle/register) used invokeAll, which waits for every page even after the first failure — a corrupt page at position 1 of 600 still ran the other 599.
  2. Error kinds destroyed at the pool boundary. A worker's domain exception (DespeckleException/RegisterExceptionRuntimeExceptions carrying their ErrorKind) was wrapped into a generic IOException, so ExceptionMapper saw INTERNAL instead of the real kind: wrong exit code, wrong RunFailed token in the webapp's SSE stream. (despeckle's copy even wrapped IOExceptions, losing the page name.)
  3. ProcessRunner leaked the child on interrupt: InterruptedException from waitFor propagated without destroyForcibly, leaving the drainer close() hanging on a live child's pipes.
  4. Ctrl-C leaked the work area: no finally ran on SIGINT, so the whole p4suta-pipeline- temp tree of intermediates remained.

The new substrate (shared/process Tasks)

Batch-owned executor (try-with-resources) with StructuredTaskScope.join() semantics on final-Java features (preview stays off; a future swap to JEP 505 is internal-only):

  • Fail-fast + sibling interruption, plus a failure gate that stops queued tasks a freed worker dequeues in the shutdownNow race window (caught by the test that pinned this race).
  • Quiescence before the failure propagatesclose() joins every worker, so the callers' finally-deletes (PipelineRunner, RegistrationService scratch) can never race in-flight writers.
  • Exception identity preserved: IOException/RuntimeException unchanged, UncheckedIOException unwrapped — kinds and exit codes are now correct on page failures.
  • Two scheduling modes: Workers.platform(jobs) for CPU-bound Leptonica/FFM work (long downcalls pin virtual threads' carriers — virtual would silently cap -j at core count), Workers.virtual(maxInFlight) for subprocess waits (pdfimages chunks, per-page jbig2), semaphore-bounded.
  • ItemProgress fires on the orchestrating thread in completion order — per-page progress is now strictly ordered, and the AtomicInteger counter at every call site is gone.

Migration (all 7 sites)

G4EncodeStage, DespeckleService, RegistrationService (both passes), PdfExtractSource, shared extractor/assembler (ports lose the ExecutorService parameter), Jbig2PackService, both PdfPipelineServices — whose redundant outer pools that sat idle through the whole middle step are removed. The dead NativeTools.awaitAll copies (zero callers) are deleted.

Ctrl-C cleanup

Main installs a shutdown hook that interrupts the run and waits (bounded, 15s) for it to unwind; a latch makes the normal-exit path return immediately. Verified in-container: SIGINT to the process group aborts the run (exit 130, Error[INTERNAL] reported), children killed, no work dir remains. (Note for testers: signaling a non-job-control background launcher doesn't deliver SIGINT — the JVM honors an inherited SIG_IGN; a real terminal Ctrl-C delivers to the foreground process group.)

Tests

  • TasksTest: 12 deterministic, latch-coordinated tests — submission order, orchestrator-thread ordered progress, exception identity (assertSame) for IO/Runtime/UncheckedIO, label-wrapped unknown checked, fail-fast + sibling interrupt + queued-skip, quiescence (in-flight counter is 0 when the throw reaches the caller), caller interruption (flag restored, workers joined), virtual-mode in-flight bound, limit validation.
  • ProcessRunnerTest: interrupted waiter kills the child and unwinds promptly (bounded join is the proof; pre-fix this hung on the drainers).
  • Two existing tests that pinned the old wrap-everything behavior updated to pin the new identity-preserving contract.
  • Full ./gradlew check green; benchmark shows no regression (conv 14.29s vs 14.23s at -j8).

🤖 Generated with Claude Code

Re-file of #30 (same commits, cherry-picked onto main; the original was orphaned by stacked squash merges).

P4suta and others added 3 commits June 10, 2026 19:32
The shared Tasks.awaitAll (and its two private copies in despeckle/
register) had three real defects:

- No fail-fast: invokeAll waits for every page even after the first
  failure — a corrupt page at position 1 of 600 still ran the other 599.
- Error kinds were destroyed at the pool boundary: a worker's domain
  exception (DespeckleException, RegisterException — RuntimeExceptions
  carrying their ErrorKind) was wrapped into a generic IOException, so
  ExceptionMapper saw INTERNAL instead of the real kind: wrong exit
  code, wrong RunFailed token in the webapp's SSE stream.
- ProcessRunner leaked the child on interrupt: an InterruptedException
  from waitFor propagated without destroyForcibly, leaving the drainer
  close() hanging on a live child's pipes.

Tasks is rewritten around a batch-owned executor (try-with-resources)
with StructuredTaskScope.join() semantics on final-Java features:
fail-fast with sibling interruption plus a gate that stops queued tasks
a freed worker dequeues in the shutdownNow race window, quiescence
before the failure propagates (so finally-deletes never race in-flight
writers), and exception identity preserved (IOException/
RuntimeException unchanged, UncheckedIOException unwrapped). Two
scheduling modes: Workers.platform(jobs) for CPU-bound Leptonica/FFM
work (long downcalls pin virtual threads' carriers) and
Workers.virtual(maxInFlight) for subprocess waits (pdfimages chunks,
per-page jbig2), semaphore-bounded. ItemProgress now fires on the
orchestrating thread in completion order, so per-page progress events
are strictly ordered and the AtomicInteger counters at every call site
are gone.

All seven call sites migrate (G4EncodeStage, DespeckleService,
RegistrationService both passes, PdfExtractSource, the shared
extractor/assembler, Jbig2PackService, both PdfPipelineServices — whose
redundant outer pools, idle through the whole middle step, are
removed); the ExecutorService parameter disappears from the
PdfImageExtractor/Jbig2Assembler ports; the dead NativeTools.awaitAll
copies (zero callers) are deleted.

pdfbook's Main also gains a shutdown hook that interrupts the run and
waits (bounded 15s) for it to unwind, so Ctrl-C now kills the children,
quiesces the workers, and deletes the p4suta-pipeline- work area
instead of leaking it. Verified in-container: SIGINT to the process
group aborts the run (exit 130), no work dir remains.

Benchmark: no regression (conv 14.29s vs 14.23s at -j8 on the 200-page
fixture, within noise).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The test released its gate from inside the ok tasks, leaving a window
where the failing task's completion could overtake an ok task's
still-being-enqueued completion (countDown happens inside call(), the
queue add after it returns) — the recorded sequence then read [1]
instead of [1, 2]. The gate now opens from the progress callback on
the orchestrating thread, so both successes are provably consumed
before the failure is thrown. Also renames `failer` to `failing` for
the spell check.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@P4suta P4suta merged commit 2ca87cf into main Jun 10, 2026
20 checks passed
@P4suta P4suta deleted the concurrency/tasks-failfast-2 branch June 10, 2026 10:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant