Skip to content

feat(foreman-dispatch-bridge): retry failed workloads (bounded)#776

Merged
joryirving merged 1 commit into
mainfrom
feat/bridge-retry-failed-workloads
Jul 1, 2026
Merged

feat(foreman-dispatch-bridge): retry failed workloads (bounded)#776
joryirving merged 1 commit into
mainfrom
feat/bridge-retry-failed-workloads

Conversation

@joryirving

Copy link
Copy Markdown
Owner

Summary

  • Each bridge tick, before claiming new work, retry Failed bridge Workloads: delete + recreate at attempt+1 (up to RETRY_MAX_ATTEMPTS, default 3) so the re-run picks up the current config (gateProfile, agent refs). At the cap, leave the Workload as a Failed tombstone (issue stays claimed → groomer won't re-serve → no loop) for human triage.
  • Workloads now carry foreman.llmkube.dev/{attempt,issue-id,agent-name} annotations; ClaimedItem gains issue_id.
  • New retry.py is pure + dependency-injected (list/create/delete closures) → unit-tested without a cluster. VERSION 0.2.0 → 0.3.0.

Why

Workloads are terminal on Failed with no Foreman-side retry, so a failed coder run (transient infra, or — like now — a fix that shipped after the run) stayed dead with no way to re-run except manual kubectl. This makes retry a standing capability.

Deploy note

Needs list/get/delete added to the bridge's workloads RBAC (currently create-only) — separate home-ops PR. Bump the home-ops HelmRelease to 0.3.0 once the image builds.

Tests

  • pytest -q → 30 passed (8 new): attempt_of, item_from_workload, retry-below-max (delete+recreate at attempt+1 with current gateProfile), give-up-at-max (no-op tombstone), annotation stamping.

Foreman makes Workloads terminal on Failed with no built-in retry, so a
failed coder run (transient infra, or a fix that landed after the run) sat
dead. Each bridge tick now reconciles failures before claiming new work:

- Failed bridge Workloads under RETRY_MAX_ATTEMPTS (default 3) are deleted
  and recreated at attempt+1, so the re-run picks up the CURRENT config
  (gateProfile, agent refs). Deterministic name -> same branch; the delete
  blocks (foreground + poll) so the recreate doesn't race the cascade.
- At the cap, the Workload is left as a Failed tombstone (the issue stays
  claimed, so the groomer won't re-serve it into a loop) for human triage.

Workloads carry attempt/issue-id/agent-name annotations to drive this.
ClaimedItem gains issue_id (traceability). New retry.py is pure + injected
(list/create/delete) so it's unit-tested without a cluster. VERSION 0.2.0->0.3.0.
@smurf-bot

smurf-bot Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

📦 App Size Analysis

Note

Comparing ghcr.io/joryirving/foreman-dispatch-bridge:rollingghcr.io/joryirving/foreman-dispatch-bridge:sandbox

📈 Size Comparison Table

OS/Platform Previous Current Change Trend
linux/amd64 48 MB 48 MB +2.47 kB (+0.01%) 🔼
linux/arm64 47.37 MB 47.37 MB +1.8 kB (+0%) 🔼

@its-saffron its-saffron Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI Automated Review

Full PR review.

Analysis engine: MiniMax-M2.7@https://litellm.jory.dev/v1 (anthropic) — routed smart (risk match: db_or_migration_changes)

Summary

Approve — This PR adds bounded retry logic for Failed foreman-dispatch-bridge Workloads (delete + recreate at attempt+1, up to 3 total by default). The changes are well-structured, dependency-injected, and comprehensively unit-tested.

Classification note: The PR is tagged db_or_migration_changes, but the diff contains no database schema changes. It manages Kubernetes Custom Resource lifecycle only. The two mandatory migration checks are therefore not applicable:

  • Migration data loss risk review → not applicable (no DB migrations)
  • Test migration on production schema copy → not applicable (no DB migrations)

Change Analysis

bridge/retry.py (new, 85 lines)

  • Pure function reconcile_failures with injected list_failed, create_workload, delete_workload — cleanly unit-testable
  • attempt_of(): graceful handling of missing/garbage annotations (defaults to 1)
  • item_from_workload(): reconstructs ClaimedItem from Workload manifest for current-config retry
  • Bounded: attempt >= max_attempts leaves Workload as a Failed tombstone; issue stays claimed → groomer won't loop

bridge/main.py (+48 lines)

  • delete_workload(): uses propagation_policy="Foreground" + 60s polling for 404 — ensures cascade deletion (including owned AgenticTasks) completes before retry creates
  • list_failed_workloads(): filters status.phase=="Failed" + created-by=dispatch-bridge
  • Retry runs before new claim work, so re-runs use current gateProfile/agent refs

bridge/workload.py (+20 lines)

  • build_workload gains agent_name and attempt parameters; stamps foreman.llmkube.dev/{attempt,issue-id,agent-name} annotations
  • attempt defaults to 1 (backward compatible)

bridge/models.py, bridge/claim.py

  • ClaimedItem.issue_id field added; populated from dispatch API issueId/id

Tests

  • 8 new tests in test_retry.py covering: attempt_of defaults/parsing, item_from_workload reconstruction, retry-below-max (delete+recreate at attempt+1), give-up-at-max (tombstone, no action), first-attempt-missing annotation, empty list no-op
  • test_workload.py: annotation stamping verification
  • test_claim.py: updated for new issue_id field
  • CI: all checks passed (30 tests passing per PR description)

Standards Compliance

  • Conventional commits: ✅ feat(foreman-dispatch-bridge): retry failed workloads (bounded)
  • Dependency-injected, unit-testable logic: ✅ retry.py uses closures
  • No secrets in code: ✅ Only env vars (RETRY_MAX_ATTEMPTS)
  • Version bump: ✅ 0.2.0 → 0.3.0 in docker-bake.hcl
  • No Dockerfile changes: ✅ container image unchanged

Linked Issue Fit

No linked issues. The PR description clearly defines the problem (Failed Workloads have no Foreman-side retry), the solution (bounded delete+recreate), and the escape hatch (tombstone at cap).


Evidence Provider Findings

No evidence providers configured.


Tool Harness Findings

No tool calls issued; reviewing corpus directly.


Risks and Mitigations

Risk Assessment
Cascade deletion timeout (>60s) Raises TimeoutError — logged for operator alert; acceptable
Race between list and retry Within single tick; no concurrent modification risk
Same-name recreation conflicts 409 handled idempotently (existing code in create_workload)
Workload stuck in terminating Polling for 404 prevents this; 60s cap prevents infinite wait

Unknowns / Needs Verification

  1. RBAC out-of-band: The PR notes that list/get/delete RBAC on workloads must be added via a separate home-ops PR. This is a known deployment dependency, not a code blocker.

  2. Production schema copy testing: Not applicable — no database migrations present.

@joryirving joryirving merged commit 3ea9f8a into main Jul 1, 2026
14 checks passed
@joryirving joryirving deleted the feat/bridge-retry-failed-workloads branch July 1, 2026 18:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant