Skip to content

feat(foreman): coder-tier escalation, re-dispatch a failed coder to a larger model#964

Merged
Defilan merged 9 commits into
defilantech:mainfrom
Defilan:feat/coder-escalation-tier
Jul 3, 2026
Merged

feat(foreman): coder-tier escalation, re-dispatch a failed coder to a larger model#964
Defilan merged 9 commits into
defilantech:mainfrom
Defilan:feat/coder-escalation-tier

Conversation

@Defilan

@Defilan Defilan commented Jul 3, 2026

Copy link
Copy Markdown
Member

What

Adds an opt-in coder escalation tier to Foreman's Workload controller. When a base coder task fails at its model's ceiling, the WorkloadReconciler re-dispatches that issue once to a larger coder model, carrying the failed model's own diagnosis as a prompt hint. Off by default; enabled per-Workload via a new escalationCoderAgentRef. This is the coder-side twin of the existing escalation-reviewer tier (#546).

Why

Fixes #963

A single coder model is a fixed capability ceiling. In a four-issue batch on one MoE coder, the model cleanly resolved the tractable bug but failed three others at its ceiling in distinct ways. The clearest case: the model produced a correct diagnosis it could not itself execute and honestly returned NO-GO. That is exactly the case a larger, denser model should take over.

Density is not a blanket upgrade, though. It helps the reasoning-limited failures but hurts convergence-heavy refactors, where a slower dense model exhausts its turn budget sooner. So this is a cascade, not a swap: keep the cheap, fast model for the tractable majority and pay for the larger model only on the specific failure modes where its extra reasoning is worth the cost.

How

  • New WorkloadSpec.EscalationCoderAgentRef (singular, unlike the plural escalationReviewerAgentRefs): coders are sequential, so N parallel escalation coders would produce N competing branches. One tier. Unset means the feature is off, fully backward compatible.
  • Trigger is capability failures only. Escalate iff the terminal base coder is verdict NO-GO with top-level outcome MODEL-DECIDED, or nested modelExtra.outcome == CODER-GATE-FAILED. Do not escalate on a model-decided INCOMPLETE (the model gave up / ran out of turns), STUCK-LOOP-DETECTED, a trivial no-changes NO-GO, or ERROR: those are scope/harness failures a larger, slower model will not fix and may worsen by timing out.
    • The discriminator is the nested modelExtra.outcome, not the top-level extra.outcome, which is MODEL-DECIDED for the NO-GO, gate-fail, and gave-up cases alike. A naive top-level read would wrongly escalate the gave-up case. The unit truth table pins this against three real batch outcomes: a NO-GO / MODEL-DECIDED (escalate), an INCOMPLETE with nested CODER-GATE-FAILED (escalate), and an INCOMPLETE with no gate outcome (do not escalate).
  • Context is fresh branch plus diagnosis. On a triggering failure the reconciler emits a code-<N>-esc + verify-<N>-esc pair at the escalation Agent on a fresh branch foreman/<w>/issue-<N>-esc, carrying the prior model's summary. A fresh branch (not a restore of the failed attempt) keeps the high-signal insight without anchoring the strong model on the weak model's broken code, and makes the feature independent of the branch-restore machinery that is still in flight elsewhere.
  • The hint reaches the coder via a new AgenticTaskPayload.PromptPrefix, rendered before the fetched issue body, so the escalated coder sees both the diagnosis and the issue's acceptance criteria (leaving Prompt empty so the issue-body fetch still runs).
  • Second-pass emission hook emitCoderEscalations, a structural twin of emitEscalations, wired into Reconcile before reviewer escalation (a coder that did not GO has no branch to review). One tier deep (-esc tasks are never re-scanned as base tasks), idempotent (permanent step names, skip when the -esc child already exists), MaxTasks- and sovereignty-gated.
  • The controller does not manage serving. The escalation Agent must point at an already-reachable model, exactly as escalation reviewers do. The recommended deployment is dual-box with both models hot so escalation is a routing hop, not a cold model load. Documented in the Foreman runbook.

AI-assisted contribution (band 3). Implemented with AI assistance under human review; human-accountable. Commits are kept clean per the project's no-attribution-in-commits convention, with the disclosure here in the PR body per CONTRIBUTING.md.

Checklist

  • Tests added/updated (unit trigger truth table + step synthesis; envtest for emission, non-escalation of gave-up/stuck, backward-compat unset, idempotency)
  • make test passes locally (new envtest green; controller coverage 80.2%)
  • make lint passes locally (0 issues; also GOOS=linux golangci-lint run ./... 0 issues)
  • Commit messages follow conventional commits
  • All commits are signed off (git commit -s) per DCO
  • AI assistance disclosed above, per CONTRIBUTING.md
  • Documentation updated (docs/site/foreman/README.md: coder escalation tier + dual-box serving)

@codecov

codecov Bot commented Jul 3, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 69.89796% with 59 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...al/foreman/controller/workload_coder_escalation.go 72.15% 33 Missing and 16 partials ⚠️
api/foreman/v1alpha1/zz_generated.deepcopy.go 0.00% 8 Missing ⚠️
internal/foreman/controller/workload_controller.go 33.33% 1 Missing and 1 partial ⚠️

📢 Thoughts on this report? Let us know!

@joryirving joryirving left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing as the operator running the outer-layer equivalent of this in production (our dispatch-bridge re-lanes attempt-exhausted issues to a frontier coder), so I checked the trigger design against our live fleet's task envelopes rather than just the diff.

Field validation of the truth table — it holds on real CRs:

  • wl-misospace-dispatch-525-code-525: verdict INCOMPLETE, top-level extra.outcome: MODEL-DECIDED, nested modelExtra.outcome: STUCK-LOOP-DETECTED. This is a live instance of exactly the trap the PR calls out — a naive top-level read escalates it; your discriminator correctly refuses.
  • wl-misospace-dispatch-528-code-528: NO-GO with top NO-CHANGES — correctly excluded.
  • wl-misospace-kubetix-152-code-152 and ...-action-365-code-365: NO-GO / MODEL-DECIDED — correctly escalate. The coderResultEnvelope shape matches production status.result verbatim.

Also confirmed no interaction hazard with #959: its iteration fires on a review NO-GO (coder GO'd), yours on a coder-terminal NO-GO (reviews never ran) — mutually exclusive by construction, and the code-<n>-r<k> step names can't collide with your exact-match code-<n> lookup.

Blocking question — who reviews the -esc branch? The tier emits code-<N>-esc + verify-<N>-esc only. On a Workload with reviewers (our fleet's default shape), the escalated coder can GO and gate-pass, but no review-<N>-esc exists — so no reviewer verdict, and post-#956 no openPullRequest carrier either. The escalated fix ends as a pushed, unreviewed branch with no PR, on the path where the operator has explicitly declared they want reviews. Related: what does rollup make of that issue — the base review-<N> tasks presumably cascade-failed when code-<N> failed, so does the Workload still roll up Failed even after a successful escalation? If skipping reviews on the esc tier is intentional scoping, a doc note + follow-up issue works for me and I'll approve; if not, mirroring the base review steps (with the #937 stamp) onto the esc branch looks mechanical.

Non-blocking, from the same field data: 2 of our 3 real NO-GO/MODEL-DECIDED cases were "this issue is already resolved by " honest bails. Under this design both burn an escalation run for the larger model to re-confirm staleness. Not this PR's problem — but it suggests a future dedicated outcome (e.g. ALREADY-RESOLVED) so the honest-bail class can split into "couldn't do it" (escalate) vs "nothing to do" (close/flag). Happy to file it upstream with the CR evidence if useful.

We'll also adopt this taxonomy in our bridge's outer tier — attempt-counting escalated our PUSH-FAILED/harness failures to the big model, which your trigger correctly never would. The two layers compose nicely: in-Workload capability escalation first, lane-level re-dispatch as the outer loop.

Defilan added 9 commits July 3, 2026 12:58
Groundwork for the coder escalation tier: an opt-in larger-model coder
that re-attempts an issue when the base coder fails, and a prompt-prefix
field to carry the prior model's diagnosis alongside the issue body.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
Escalate only capability failures (model NO-GO or CODER-GATE-FAILED),
never a model-decided INCOMPLETE / stuck-loop / ERROR. Reads the typed
verdict plus the nested modelExtra.outcome, since the top-level
extra.outcome is MODEL-DECIDED for all model-terminated runs alike.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
… failure

Second-pass hook mirroring emitEscalations, wired before reviewer
escalation. Emits code-<N>-esc/verify-<N>-esc at EscalationCoderAgentRef
with the prior model's diagnosis as a prompt prefix; bounded to one
tier, idempotent, MaxTasks- and sovereignty-gated.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
The topOutcome param is kept configurable to mirror the executor
envelope shape even though current cases all pass MODEL-DECIDED,
matching the existing pressure_test.go convention. Clears the
make lint gate.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
A base coder capability-failure re-dispatched to the escalation tier
emitted only code-<N>-esc and verify-<N>-esc, so the escalated coder
could GO and gate-pass but nothing produced a reviewer verdict or the
openPullRequest carrier: the fix ended as a pushed, unreviewed branch
with no PR.

coderEscalationSteps now also appends one review-<N>-esc-<i> per
ReviewerAgentRef, dependent on verify-<N>-esc, carrying the esc branch
and mirroring the base review step's openPullRequest computation. A
Workload with no reviewers is unchanged (code+verify only).

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
…ists

When a base code-<N> capability-failure was escalated, the base
verify-<N>/review-<N>-<i> that cascade-failed stayed in the rollup
slice, so a SUCCESSFUL escalation still rolled the Workload up Failed:
the issue was judged by the dead base attempt, not the esc attempt.

activeChildren now drops the base-attempt synthesized steps for an
issue once a code-<N>-esc task exists, alongside the existing
fix-iteration supersession. The esc steps themselves never parse as
base issue steps, so the escalation attempt is never dropped by its
own rule; unescalated issues are untouched. emitCoderEscalations
labels its in-flight placeholders so the supersession applies in the
same reconcile pass, matching emitReviewIterations.

The envtest drives a base NO-GO plus its cascade-failed verify/review
through a full escalation to on-target success and asserts the
Workload reaches Completed rather than Failed.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
…on scan

The coder-escalation tier emits review-<N>-esc-<i> steps, which carry the
review-<N>- prefix that the reviewer-escalation tier (defilantech#546) scans as base
reviews. When both tiers are enabled on a Workload, a NO-GO on an escalated
review would fan escalate-<N>-<j> steps against the base branch instead of
the escalated one. Exclude review-<N>-esc-<i> from the base-review scan so
the two tiers compose.

Refs defilantech#963

Signed-off-by: Christopher Maher <chris@mahercode.io>
@Defilan Defilan force-pushed the feat/coder-escalation-tier branch from b8bc040 to b5a9c82 Compare July 3, 2026 20:24
@Defilan

Defilan commented Jul 3, 2026

Copy link
Copy Markdown
Member Author

Thanks for the field-validation against live CRs, that is exactly the confidence I wanted on the trigger, and the #959 mutual-exclusivity check is reassuring.

Addressed your blocking point (full mirror, not scoped out) plus one interaction I found along the way. Rebased onto current main (your #959/#967/#956 all landed).

Reviewers on the escalated branch. coderEscalationSteps now emits review-<N>-esc-<i> per ReviewerAgentRef, DependsOn: verify-<N>-esc, on the esc branch, with the openPullRequest stamp computed the same way the base does. No reviewers configured means code+verify only, unchanged.

Rollup. You were right to ask. I verified the semantics rather than assume: a base code-<N> NO-GO ends Succeeded/NoGo (counts incomplete), and the cascade-failed verify-<N>/review-<N>-<i> count failed, so without intervention a successful escalation still rolls up Failed. Extended activeChildren so that once code-<N>-esc exists, the base attempt for N is superseded and dropped from the rollup slice. The envtest proves it: with the fix the escalated issue reaches Completed, and reverting the supersession forces Failed (leftover base incomplete+failed).

Cross-tier guard (found while wiring the above). escalationSteps scans review-<N>- by prefix, which also matches my new review-<N>-esc-<i>. With both EscalationReviewerAgentRefs and EscalationCoderAgentRef set (your fleet's shape), a NO-GO on an escalated review would have fanned escalate-<N>-<j> against the base branch. Added an exclusion so the esc reviews are skipped by the base scan; unit-tested.

All green: unit + envtest, make test, lint both arches, no CRD drift.

On the non-blocking ALREADY-RESOLVED idea: agreed, the "nothing to do" honest-bail is a distinct class from "couldn't do it," and splitting it would stop escalation burning a run to re-confirm staleness. Please do file it upstream with the CR evidence, that is a clean follow-up and your live data makes the case. And yes, the two-layer composition (in-Workload capability escalation, then your lane-level re-dispatch) is exactly the shape I was hoping for.

Ready for another look.

@Defilan Defilan requested a review from joryirving July 3, 2026 20:24

@joryirving joryirving left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve. Traced the three changes against the parsers rather than the descriptions; all correct.

Reviewer fan-out. review-<N>-esc-<i> per ref, DependsOn: verify-<N>-esc, openPullRequest computed identically to the base — resolves the unreviewed-branch blocker. Code+verify-only when no reviewers, unchanged.

Supersession. Verified the matcher does exactly what's needed and nothing more: issueStepIteration matches every base shape (code-N/verify-N via the step==base arm, review-N-<i> via the digit-checked review parser) so escalated[n] drops the base coder's NO-GO and its cascade-failed verify/review — but rejects all three -esc shapes (verify-N-esc misses the -r prefix; review-N-esc-<i> fails reviewIterationOf's digit loop on the e of esc), so the escalation attempt's own verify/review stay in the rollup. The same-pass labeled placeholders keep the Workload in-flight while esc runs, so no premature Completed. Confirmed the placeholder labelStep (TrimPrefix(name, w.Name+"-")) equals the real task's step.Name under absoluteTaskName.

Cross-tier guard. Switch order puts review-N-esc- before review-N-, so esc reviews are skipped by the base reviewer-escalation scan while review-N-<i> still counts. Correct.

One non-blocking edge (inherited, not introduced here). coderEscalationSteps gates the whole issue on existingEsc[code-<N>-esc], so if a reconcile creates code-<N>-esc but a transient API error aborts before review-<N>-esc-<i> (renderAndCreate creates in order, returns on the first non-AlreadyExists error), the next pass sees code-<N>-esc exists → continue → the missing review-esc is never re-proposed, and the branch lands unreviewed again — the exact class this commit closes. Low probability, and it's the same issue-level dedup escalationSteps (#546) uses. Since renderAndCreate already skips AlreadyExists, keying the skip per-step (or emitting all esc steps every pass and letting the create dedup) would make it self-healing. Fine as a follow-up.

Filing the ALREADY-RESOLVED outcome issue upstream with the CR evidence as agreed.

@Defilan Defilan merged commit ce8f655 into defilantech:main Jul 3, 2026
24 checks passed
@github-actions github-actions Bot mentioned this pull request Jul 3, 2026
eleboucher pushed a commit to eleboucher/homelab that referenced this pull request Jul 4, 2026
…27 ➔ 0.8.28) (#1405)

This PR contains the following updates:

| Package | Update | Change |
|---|---|---|
| [ghcr.io/defilantech/charts/llmkube](https://github.com/defilantech/LLMKube) | patch | `0.8.27` → `0.8.28` |

---

### Release Notes

<details>
<summary>defilantech/LLMKube (ghcr.io/defilantech/charts/llmkube)</summary>

### [`v0.8.28`](https://github.com/defilantech/LLMKube/blob/HEAD/CHANGELOG.md#0828-2026-07-04)

[Compare Source](defilantech/LLMKube@v0.8.27...v0.8.28)

##### Features

- **foreman:** bounded fix iteration on reviewer NO-GO instead of terminal failure ([#&#8203;959](defilantech/LLMKube#959)) ([d820fff](defilantech/LLMKube@d820fff))
- **foreman:** coder-tier escalation, re-dispatch a failed coder to a larger model ([#&#8203;964](defilantech/LLMKube#964)) ([ce8f655](defilantech/LLMKube@ce8f655))
- **foreman:** executor-owned revise-from-branch restore for revision tasks ([#&#8203;967](defilantech/LLMKube#967)) ([b76051c](defilantech/LLMKube@b76051c))
- **foreman:** open the pull request on review GO ([#&#8203;956](defilantech/LLMKube#956)) ([fd852e1](defilantech/LLMKube@fd852e1))
- **inference:** add spec.modelCache.claimName for user-owned cache PVCs ([#&#8203;960](defilantech/LLMKube#960)) ([aab5a58](defilantech/LLMKube@aab5a58))

##### Bug Fixes

- **foreman:** accept workspace-internal absolute paths in resolveInside ([#&#8203;957](defilantech/LLMKube#957)) ([34b126c](defilantech/LLMKube@34b126c))
- **foreman:** defer generic self-gate when runtime is missing from coder image ([#&#8203;958](defilantech/LLMKube#958)) ([df185ec](defilantech/LLMKube@df185ec))
- **foreman:** reject no-op str\_replace where old\_string equals new\_string ([#&#8203;969](defilantech/LLMKube#969)) ([c71f38b](defilantech/LLMKube@c71f38b))
- **foreman:** scope-overlap check catches Go files in new directories ([#&#8203;962](defilantech/LLMKube#962)) ([486a944](defilantech/LLMKube@486a944))
- **inference:** warn when modelCache.claimName is silently ignored ([#&#8203;966](defilantech/LLMKube#966)) ([d49cd22](defilantech/LLMKube@d49cd22))

##### Documentation

- register the Karpenter GPU autoscaling guide in nav.yaml ([#&#8203;954](defilantech/LLMKube#954)) ([88b9c7d](defilantech/LLMKube@88b9c7d))

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4xMDEuMSIsInVwZGF0ZWRJblZlciI6IjQzLjEwMS4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJyZW5vdmF0ZS9jb250YWluZXIiLCJ0eXBlL3BhdGNoIl19-->

Reviewed-on: https://git.erwanleboucher.dev/eleboucher/homelab/pulls/1405
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] Foreman: coder-tier escalation — re-dispatch a failed coder task to a larger model

2 participants