feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339
feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339dolho wants to merge 2 commits into
Conversation
…rror (#1167) Sequential agent loops (#740) were fail-fast: the first failed iteration aborted the whole loop. Add a per-loop policy to tolerate failures and keep going, bounded so a fully-broken agent still terminates. - config: `on_failure` ('abort' default = current fail-fast, backward compatible; 'continue' tolerates a failed iteration) + `max_consecutive_failures` (default 3, range 1–100). Both plumbed end to end (Invariant #13). - runner (loop_service.py): both failure surfaces gated — raised exception AND non-success TaskExecutionResult. Continue mode finalizes the failed agent_loop_runs row, increments failed_runs/consecutive_failures, and proceeds; a success resets the streak. Reaching max_runs (or stop-signal) with tolerated failures finalizes as `completed_with_errors`; hitting the consecutive cap finalizes `failed`/`stop_reason=max_consecutive_failures`. {{previous_response}} keeps the last *successful* response (a failed iteration never overwrites it). - schema/migration: agent_loops gains on_failure, max_consecutive_failures, failed_runs (schema.py + tables.py Core + versioned migration). New terminal status `completed_with_errors`. - api: POST /loops accepts on_failure/max_consecutive_failures; LoopStatusResponse surfaces failed_runs/on_failure/max_consecutive_failures. - mcp: run_agent_loop gains the two params (tools/loops.ts + client.ts). - ui: LoopsPanel failure-policy controls + failed_runs/completed_with_errors surfacing. - tests: abort unchanged, continue past a failed run (both surfaces), consecutive cutoff, streak-reset; schema-parity + migrations green. 42 passed. Related to #1167 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
dolho
left a comment
There was a problem hiding this comment.
Sequential review — #1167 loop failure policy
Reviewed the diff directly (correctness, backward-compat, plumbing). No blocking issues. One low-severity consistency nit and a few intentional-semantics notes.
Finding (low) — exception surface skips the inter-run delay in continue mode
In services/loop_service.py, the raised-exception failure path ends with continue, which jumps straight to the next iteration and skips the delay_seconds inter-run pause. The non-success-TaskExecutionResult path instead falls through to the delay block before the next run. So a loop whose iterations raise (vs. return a failure result) will retry back-to-back with no pacing.
- Impact is bounded by
max_consecutive_failures, so it can't spin forever, but a flapping agent could hammple the backend faster than intended whendelay_secondsis set. - Suggest applying the delay before
continue(or restructuring both surfaces to share one tail) for parity. Optional — not blocking.
Notes (intentional, documenting)
completed_with_errors+ stop-signal: if the stop-signal matches on a success after earlier tolerated failures, status stayscompleted→ promoted tocompleted_with_errors. Defensible (the run did have failures); flagging the semantic.runs_completedcounts attempted iterations including tolerated failures, paired with the newfailed_runschip in the UI — consistent with the prior non-success path.max_consecutive_failures > max_runsis allowed (cutoff then never fires); harmless — the loop simply ends atmax_runsascompleted_with_errors.
Verified good
- Both failure surfaces gated; backward-compat preserved (default
abort; existingtest_iteration_exception_aborts_loopstill green). {{previous_response}}keeps the last successful response on a tolerated failure (no poisoning of the next prompt).- Schema dual-definition correct (
schema.pyDDL andtables.pyCore object) — schema-parity + migrations suites green. on_failurevalidated viaLiteral;max_consecutive_failuresbounded 1–100.- Minimal scope:
stores/loops.jscorrectly left untouched (passes payload through); abort-mode requests don't even send the new fields. - Cooperative stop still honored each iteration in continue mode (stop check is at loop top).
Verdict: ship-ready; the delay-on-exception nit is the only thing worth a follow-up commit if you want surface parity.
…1167 review) Continue mode skipped delay_seconds when an iteration *raised* (the exception path `continue`d past the delay block), while a non-success result honored it. Extract the inter-run delay (with its cooperative-stop check) into a helper and call it on both surfaces so continue-mode pacing is consistent. Add a test asserting the delay fires after a raised iteration.
|
✅ Addressed the review finding (delay-on-exception parity) in ed9dce2. The inter-run delay (with its cooperative-stop check) is now extracted into a single The three semantic notes were intentional and left as-is. |
vybe
left a comment
There was a problem hiding this comment.
Validated via /validate-pr — blocking on Invariant #3 (dual-track migrations).
This adds 3 columns to agent_loops (on_failure, max_consecutive_failures, failed_runs) with a SQLite migration (_migrate_agent_loops_failure_policy in db/migrations.py) and updates db/schema.py + db/tables.py — but there's no Alembic revision under src/backend/migrations/versions/. On a PostgreSQL backend, init_database() runs alembic_runner.upgrade_to_head(), which won't add these columns, so loop_service crashes reading them. schema-parity is green because it only validates the SQLite track, not Alembic.
Fix: add a new Alembic revision adding the 3 agent_loops columns. The failure-policy logic, MCP surface, tests, and docs all look complete otherwise — this is the only blocker.
— posted via /validate-pr
Summary
Sequential agent loops (#740) were fail-fast: the first failed iteration aborted the whole loop. This adds a per-loop failure policy so a batch-style loop can tolerate a bad iteration and keep going — bounded by a consecutive-failure circuit so a fully-broken agent still terminates.
theme-reliability.Config (default preserves current behavior)
on_failure:abort(default — fail-fast, backward compatible) orcontinue.max_consecutive_failures: default 3, range 1–100. In continue mode, the loop aborts asfailed(stop_reason=max_consecutive_failures) once that many iterations fail in a row; a success resets the streak.Behavior
execute_taskAND a non-successTaskExecutionResult(TIMEOUT/AGENT_ERROR/CIRCUIT_OPEN/AUTH). Each failed iteration finalizes itsagent_loop_runsrow asfailed, then continue mode proceeds to the next run.max_runs(or matching the stop-signal) with ≥1 tolerated failure → newcompleted_with_errorsstatus;failed_runscount surfaced on the loop row + API + UI.{{previous_response}}carries the last successful response — a failed iteration never overwrites it (documented semantics).Surfaces (Invariant #13)
agent_loopsgainson_failure,max_consecutive_failures,failed_runs(schema.py+tables.pyCore object + versioned migration).POST /api/agents/{name}/loopsaccepts the two fields;LoopStatusResponsesurfacesfailed_runs/on_failure/max_consecutive_failures.run_agent_loopgains the two params (tools/loops.ts+client.ts).LoopsPanel.vuefailure-policy controls +failed_runs/completed_with_errorssurfacing.Acceptance criteria
on_failure: abort|continue, defaultabort(backward compatible)max_consecutive_failures(cutoff aborts asfailed)failed, then proceedscompleted_with_errors+failed_runs){{previous_response}}on a continued iteration defined (last successful)Testing
42 passed— newTestFailurePolicy(abort default, continue past failure, consecutive cutoff, raised-exception surface, streak-reset) + existing loop suite (backward-compattest_iteration_exception_aborts_loopstill green) + schema-parity + migrations.Related to #1167
🤖 Generated with Claude Code