feat(foreman-dispatch-bridge): retry failed workloads (bounded) by joryirving · Pull Request #776 · joryirving/containers

joryirving · 2026-07-01T18:21:34Z

Summary

Each bridge tick, before claiming new work, retry Failed bridge Workloads: delete + recreate at attempt+1 (up to RETRY_MAX_ATTEMPTS, default 3) so the re-run picks up the current config (gateProfile, agent refs). At the cap, leave the Workload as a Failed tombstone (issue stays claimed → groomer won't re-serve → no loop) for human triage.
Workloads now carry foreman.llmkube.dev/{attempt,issue-id,agent-name} annotations; ClaimedItem gains issue_id.
New retry.py is pure + dependency-injected (list/create/delete closures) → unit-tested without a cluster. VERSION 0.2.0 → 0.3.0.

Why

Workloads are terminal on Failed with no Foreman-side retry, so a failed coder run (transient infra, or — like now — a fix that shipped after the run) stayed dead with no way to re-run except manual kubectl. This makes retry a standing capability.

Deploy note

Needs list/get/delete added to the bridge's workloads RBAC (currently create-only) — separate home-ops PR. Bump the home-ops HelmRelease to 0.3.0 once the image builds.

Tests

pytest -q → 30 passed (8 new): attempt_of, item_from_workload, retry-below-max (delete+recreate at attempt+1 with current gateProfile), give-up-at-max (no-op tombstone), annotation stamping.

Foreman makes Workloads terminal on Failed with no built-in retry, so a failed coder run (transient infra, or a fix that landed after the run) sat dead. Each bridge tick now reconciles failures before claiming new work: - Failed bridge Workloads under RETRY_MAX_ATTEMPTS (default 3) are deleted and recreated at attempt+1, so the re-run picks up the CURRENT config (gateProfile, agent refs). Deterministic name -> same branch; the delete blocks (foreground + poll) so the recreate doesn't race the cascade. - At the cap, the Workload is left as a Failed tombstone (the issue stays claimed, so the groomer won't re-serve it into a loop) for human triage. Workloads carry attempt/issue-id/agent-name annotations to drive this. ClaimedItem gains issue_id (traceability). New retry.py is pure + injected (list/create/delete) so it's unit-tested without a cluster. VERSION 0.2.0->0.3.0.

smurf-bot · 2026-07-01T18:24:06Z

📦 App Size Analysis

Note

Comparing ghcr.io/joryirving/foreman-dispatch-bridge:rolling ➔ ghcr.io/joryirving/foreman-dispatch-bridge:sandbox

📈 Size Comparison Table

OS/Platform	Previous	Current	Change	Trend
linux/amd64	48 MB	48 MB	+2.47 kB (+0.01%)	🔼
linux/arm64	47.37 MB	47.37 MB	+1.8 kB (+0%)	🔼

its-saffron

AI Automated Review

Full PR review.

Analysis engine: MiniMax-M2.7@https://litellm.jory.dev/v1 (anthropic) — routed smart (risk match: db_or_migration_changes)

Summary

Approve — This PR adds bounded retry logic for Failed foreman-dispatch-bridge Workloads (delete + recreate at attempt+1, up to 3 total by default). The changes are well-structured, dependency-injected, and comprehensively unit-tested.

Classification note: The PR is tagged db_or_migration_changes, but the diff contains no database schema changes. It manages Kubernetes Custom Resource lifecycle only. The two mandatory migration checks are therefore not applicable:

Migration data loss risk review → not applicable (no DB migrations)
Test migration on production schema copy → not applicable (no DB migrations)

Change Analysis

`bridge/retry.py` (new, 85 lines)

Pure function reconcile_failures with injected list_failed, create_workload, delete_workload — cleanly unit-testable
attempt_of(): graceful handling of missing/garbage annotations (defaults to 1)
item_from_workload(): reconstructs ClaimedItem from Workload manifest for current-config retry
Bounded: attempt >= max_attempts leaves Workload as a Failed tombstone; issue stays claimed → groomer won't loop

`bridge/main.py` (+48 lines)

delete_workload(): uses propagation_policy="Foreground" + 60s polling for 404 — ensures cascade deletion (including owned AgenticTasks) completes before retry creates
list_failed_workloads(): filters status.phase=="Failed" + created-by=dispatch-bridge
Retry runs before new claim work, so re-runs use current gateProfile/agent refs

`bridge/workload.py` (+20 lines)

build_workload gains agent_name and attempt parameters; stamps foreman.llmkube.dev/{attempt,issue-id,agent-name} annotations
attempt defaults to 1 (backward compatible)

`bridge/models.py`, `bridge/claim.py`

ClaimedItem.issue_id field added; populated from dispatch API issueId/id

Tests

8 new tests in test_retry.py covering: attempt_of defaults/parsing, item_from_workload reconstruction, retry-below-max (delete+recreate at attempt+1), give-up-at-max (tombstone, no action), first-attempt-missing annotation, empty list no-op
test_workload.py: annotation stamping verification
test_claim.py: updated for new issue_id field
CI: all checks passed (30 tests passing per PR description)

Standards Compliance

Conventional commits: ✅ feat(foreman-dispatch-bridge): retry failed workloads (bounded)
Dependency-injected, unit-testable logic: ✅ retry.py uses closures
No secrets in code: ✅ Only env vars (RETRY_MAX_ATTEMPTS)
Version bump: ✅ 0.2.0 → 0.3.0 in docker-bake.hcl
No Dockerfile changes: ✅ container image unchanged

Linked Issue Fit

No linked issues. The PR description clearly defines the problem (Failed Workloads have no Foreman-side retry), the solution (bounded delete+recreate), and the escape hatch (tombstone at cap).

Evidence Provider Findings

No evidence providers configured.

Tool Harness Findings

No tool calls issued; reviewing corpus directly.

Risks and Mitigations

Risk	Assessment
Cascade deletion timeout (>60s)	Raises `TimeoutError` — logged for operator alert; acceptable
Race between list and retry	Within single tick; no concurrent modification risk
Same-name recreation conflicts	409 handled idempotently (existing code in `create_workload`)
Workload stuck in terminating	Polling for 404 prevents this; 60s cap prevents infinite wait

Unknowns / Needs Verification

RBAC out-of-band: The PR notes that list/get/delete RBAC on workloads must be added via a separate home-ops PR. This is a known deployment dependency, not a code blocker.
Production schema copy testing: Not applicable — no database migrations present.

smurf-bot Bot added area/apps app/foreman-dispatch-bridge labels Jul 1, 2026

its-saffron Bot approved these changes Jul 1, 2026

View reviewed changes

joryirving merged commit 3ea9f8a into main Jul 1, 2026
14 checks passed

joryirving deleted the feat/bridge-retry-failed-workloads branch July 1, 2026 18:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(foreman-dispatch-bridge): retry failed workloads (bounded)#776

feat(foreman-dispatch-bridge): retry failed workloads (bounded)#776
joryirving merged 1 commit into
mainfrom
feat/bridge-retry-failed-workloads

joryirving commented Jul 1, 2026

Uh oh!

smurf-bot Bot commented Jul 1, 2026

Uh oh!

its-saffron Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

joryirving commented Jul 1, 2026

Summary

Why

Deploy note

Tests

Uh oh!

smurf-bot Bot commented Jul 1, 2026

📦 App Size Analysis

📈 Size Comparison Table

Uh oh!

its-saffron Bot left a comment

Choose a reason for hiding this comment

AI Automated Review

Summary

Change Analysis

bridge/retry.py (new, 85 lines)

bridge/main.py (+48 lines)

bridge/workload.py (+20 lines)

bridge/models.py, bridge/claim.py

Tests

Standards Compliance

Linked Issue Fit

Evidence Provider Findings

Tool Harness Findings

Risks and Mitigations

Unknowns / Needs Verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`bridge/retry.py` (new, 85 lines)

`bridge/main.py` (+48 lines)

`bridge/workload.py` (+20 lines)

`bridge/models.py`, `bridge/claim.py`