Skip to content

sync: recover stale and blocked jobs more reliably#126

Open
thanosipsis wants to merge 1 commit into
DamianB-BitFlipper:mainfrom
thanosipsis:upstream-job-recovery-fixes
Open

sync: recover stale and blocked jobs more reliably#126
thanosipsis wants to merge 1 commit into
DamianB-BitFlipper:mainfrom
thanosipsis:upstream-job-recovery-fixes

Conversation

@thanosipsis

@thanosipsis thanosipsis commented Mar 6, 2026

Copy link
Copy Markdown

Problem

Some sync jobs can get stuck in BLOCKED or leave the worker pool effectively wedged after long-running or hung operations. In practice this can require manual intervention even when the underlying condition is transient. Something to note is that I've only been running this app in Docker (And encountered these issues there), so non-docker is untested.

Changes

  • recover stale RUNNING_PID self-locks caused by PID reuse
  • add execution timeouts around long-running job operations
  • reclaim stale in-memory worker slots when a task never settles
  • periodically recover blocked jobs in the processor loop
  • requeue blocked jobs caused by transient auth failures with cooldown
  • requeue recoverable remote not found races when the local path still exists
  • mark missing-local blocked jobs as skipped/synced
  • treat create-file name conflicts as idempotent success to avoid endless retries

Validation

Tested against real sync workloads where jobs became blocked or workers stalled, including:

  • transient Invalid access token failures
  • remote file or folder not found races
  • immutable-file create conflicts
  • stale worker slots after long-running operations

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant