feat(foreman-dispatch-bridge): retry failed workloads (bounded)#776
Conversation
Foreman makes Workloads terminal on Failed with no built-in retry, so a failed coder run (transient infra, or a fix that landed after the run) sat dead. Each bridge tick now reconciles failures before claiming new work: - Failed bridge Workloads under RETRY_MAX_ATTEMPTS (default 3) are deleted and recreated at attempt+1, so the re-run picks up the CURRENT config (gateProfile, agent refs). Deterministic name -> same branch; the delete blocks (foreground + poll) so the recreate doesn't race the cascade. - At the cap, the Workload is left as a Failed tombstone (the issue stays claimed, so the groomer won't re-serve it into a loop) for human triage. Workloads carry attempt/issue-id/agent-name annotations to drive this. ClaimedItem gains issue_id (traceability). New retry.py is pure + injected (list/create/delete) so it's unit-tested without a cluster. VERSION 0.2.0->0.3.0.
📦 App Size AnalysisNote Comparing 📈 Size Comparison Table
|
There was a problem hiding this comment.
AI Automated Review
Full PR review.
Analysis engine: MiniMax-M2.7@https://litellm.jory.dev/v1 (anthropic) — routed smart (risk match: db_or_migration_changes)
Summary
Approve — This PR adds bounded retry logic for Failed foreman-dispatch-bridge Workloads (delete + recreate at attempt+1, up to 3 total by default). The changes are well-structured, dependency-injected, and comprehensively unit-tested.
Classification note: The PR is tagged db_or_migration_changes, but the diff contains no database schema changes. It manages Kubernetes Custom Resource lifecycle only. The two mandatory migration checks are therefore not applicable:
- Migration data loss risk review → not applicable (no DB migrations)
- Test migration on production schema copy → not applicable (no DB migrations)
Change Analysis
bridge/retry.py (new, 85 lines)
- Pure function
reconcile_failureswith injectedlist_failed,create_workload,delete_workload— cleanly unit-testable attempt_of(): graceful handling of missing/garbage annotations (defaults to 1)item_from_workload(): reconstructsClaimedItemfrom Workload manifest for current-config retry- Bounded:
attempt >= max_attemptsleaves Workload as a Failed tombstone; issue stays claimed → groomer won't loop
bridge/main.py (+48 lines)
delete_workload(): usespropagation_policy="Foreground"+ 60s polling for 404 — ensures cascade deletion (including owned AgenticTasks) completes before retry createslist_failed_workloads(): filtersstatus.phase=="Failed"+created-by=dispatch-bridge- Retry runs before new claim work, so re-runs use current gateProfile/agent refs
bridge/workload.py (+20 lines)
build_workloadgainsagent_nameandattemptparameters; stampsforeman.llmkube.dev/{attempt,issue-id,agent-name}annotationsattemptdefaults to 1 (backward compatible)
bridge/models.py, bridge/claim.py
ClaimedItem.issue_idfield added; populated from dispatch APIissueId/id
Tests
- 8 new tests in
test_retry.pycovering:attempt_ofdefaults/parsing,item_from_workloadreconstruction, retry-below-max (delete+recreate at attempt+1), give-up-at-max (tombstone, no action), first-attempt-missing annotation, empty list no-op test_workload.py: annotation stamping verificationtest_claim.py: updated for newissue_idfield- CI: all checks passed (30 tests passing per PR description)
Standards Compliance
- Conventional commits: ✅
feat(foreman-dispatch-bridge): retry failed workloads (bounded) - Dependency-injected, unit-testable logic: ✅
retry.pyuses closures - No secrets in code: ✅ Only env vars (
RETRY_MAX_ATTEMPTS) - Version bump: ✅
0.2.0 → 0.3.0indocker-bake.hcl - No Dockerfile changes: ✅ container image unchanged
Linked Issue Fit
No linked issues. The PR description clearly defines the problem (Failed Workloads have no Foreman-side retry), the solution (bounded delete+recreate), and the escape hatch (tombstone at cap).
Evidence Provider Findings
No evidence providers configured.
Tool Harness Findings
No tool calls issued; reviewing corpus directly.
Risks and Mitigations
| Risk | Assessment |
|---|---|
| Cascade deletion timeout (>60s) | Raises TimeoutError — logged for operator alert; acceptable |
| Race between list and retry | Within single tick; no concurrent modification risk |
| Same-name recreation conflicts | 409 handled idempotently (existing code in create_workload) |
| Workload stuck in terminating | Polling for 404 prevents this; 60s cap prevents infinite wait |
Unknowns / Needs Verification
-
RBAC out-of-band: The PR notes that
list/get/deleteRBAC on workloads must be added via a separate home-ops PR. This is a known deployment dependency, not a code blocker. -
Production schema copy testing: Not applicable — no database migrations present.
Summary
Failedbridge Workloads: delete + recreate atattempt+1(up toRETRY_MAX_ATTEMPTS, default 3) so the re-run picks up the current config (gateProfile, agent refs). At the cap, leave the Workload as aFailedtombstone (issue stays claimed → groomer won't re-serve → no loop) for human triage.foreman.llmkube.dev/{attempt,issue-id,agent-name}annotations;ClaimedItemgainsissue_id.retry.pyis pure + dependency-injected (list/create/delete closures) → unit-tested without a cluster.VERSION0.2.0 → 0.3.0.Why
Workloads are terminal on
Failedwith no Foreman-side retry, so a failed coder run (transient infra, or — like now — a fix that shipped after the run) stayed dead with no way to re-run except manualkubectl. This makes retry a standing capability.Deploy note
Needs
list/get/deleteadded to the bridge'sworkloadsRBAC (currently create-only) — separate home-ops PR. Bump the home-ops HelmRelease to 0.3.0 once the image builds.Tests
pytest -q→ 30 passed (8 new):attempt_of,item_from_workload, retry-below-max (delete+recreate at attempt+1 with current gateProfile), give-up-at-max (no-op tombstone), annotation stamping.