feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167) by dolho · Pull Request #1339 · Abilityai/trinity

dolho · 2026-06-24T12:14:39Z

Summary

Sequential agent loops (#740) were fail-fast: the first failed iteration aborted the whole loop. This adds a per-loop failure policy so a batch-style loop can tolerate a bad iteration and keep going — bounded by a consecutive-failure circuit so a fully-broken agent still terminates. theme-reliability.

Config (default preserves current behavior)

on_failure: abort (default — fail-fast, backward compatible) or continue.
max_consecutive_failures: default 3, range 1–100. In continue mode, the loop aborts as failed (stop_reason=max_consecutive_failures) once that many iterations fail in a row; a success resets the streak.

Behavior

Both failure surfaces gated: a raised exception from execute_task AND a non-success TaskExecutionResult (TIMEOUT/AGENT_ERROR/CIRCUIT_OPEN/AUTH). Each failed iteration finalizes its agent_loop_runs row as failed, then continue mode proceeds to the next run.
Terminal status: reaching max_runs (or matching the stop-signal) with ≥1 tolerated failure → new completed_with_errors status; failed_runs count surfaced on the loop row + API + UI.
{{previous_response}} carries the last successful response — a failed iteration never overwrites it (documented semantics).

Surfaces (Invariant #13)

schema/migration: agent_loops gains on_failure, max_consecutive_failures, failed_runs (schema.py + tables.py Core object + versioned migration).
api: POST /api/agents/{name}/loops accepts the two fields; LoopStatusResponse surfaces failed_runs/on_failure/max_consecutive_failures.
mcp: run_agent_loop gains the two params (tools/loops.ts + client.ts).
ui: LoopsPanel.vue failure-policy controls + failed_runs/completed_with_errors surfacing.

Acceptance criteria

on_failure: abort|continue, default abort (backward compatible)
Continue bounded by max_consecutive_failures (cutoff aborts as failed)
Failed iteration finalizes its run row as failed, then proceeds
Both failure surfaces (exception + non-success result)
Partial-success terminal status documented (completed_with_errors + failed_runs)
{{previous_response}} on a continued iteration defined (last successful)
Plumbed through schema/migration, API, MCP, UI
Tests: abort unchanged, continue past a failed run, consecutive cutoff

Testing

42 passed — new TestFailurePolicy (abort default, continue past failure, consecutive cutoff, raised-exception surface, streak-reset) + existing loop suite (backward-compat test_iteration_exception_aborts_loop still green) + schema-parity + migrations.

Related to #1167

🤖 Generated with Claude Code

…rror (#1167) Sequential agent loops (#740) were fail-fast: the first failed iteration aborted the whole loop. Add a per-loop policy to tolerate failures and keep going, bounded so a fully-broken agent still terminates. - config: `on_failure` ('abort' default = current fail-fast, backward compatible; 'continue' tolerates a failed iteration) + `max_consecutive_failures` (default 3, range 1–100). Both plumbed end to end (Invariant #13). - runner (loop_service.py): both failure surfaces gated — raised exception AND non-success TaskExecutionResult. Continue mode finalizes the failed agent_loop_runs row, increments failed_runs/consecutive_failures, and proceeds; a success resets the streak. Reaching max_runs (or stop-signal) with tolerated failures finalizes as `completed_with_errors`; hitting the consecutive cap finalizes `failed`/`stop_reason=max_consecutive_failures`. {{previous_response}} keeps the last *successful* response (a failed iteration never overwrites it). - schema/migration: agent_loops gains on_failure, max_consecutive_failures, failed_runs (schema.py + tables.py Core + versioned migration). New terminal status `completed_with_errors`. - api: POST /loops accepts on_failure/max_consecutive_failures; LoopStatusResponse surfaces failed_runs/on_failure/max_consecutive_failures. - mcp: run_agent_loop gains the two params (tools/loops.ts + client.ts). - ui: LoopsPanel failure-policy controls + failed_runs/completed_with_errors surfacing. - tests: abort unchanged, continue past a failed run (both surfaces), consecutive cutoff, streak-reset; schema-parity + migrations green. 42 passed. Related to #1167 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

dolho

Sequential review — #1167 loop failure policy

Reviewed the diff directly (correctness, backward-compat, plumbing). No blocking issues. One low-severity consistency nit and a few intentional-semantics notes.

Finding (low) — exception surface skips the inter-run delay in continue mode

In services/loop_service.py, the raised-exception failure path ends with continue, which jumps straight to the next iteration and skips the delay_seconds inter-run pause. The non-success-TaskExecutionResult path instead falls through to the delay block before the next run. So a loop whose iterations raise (vs. return a failure result) will retry back-to-back with no pacing.

Impact is bounded by max_consecutive_failures, so it can't spin forever, but a flapping agent could hammple the backend faster than intended when delay_seconds is set.
Suggest applying the delay before continue (or restructuring both surfaces to share one tail) for parity. Optional — not blocking.

Notes (intentional, documenting)

completed_with_errors + stop-signal: if the stop-signal matches on a success after earlier tolerated failures, status stays completed → promoted to completed_with_errors. Defensible (the run did have failures); flagging the semantic.
runs_completed counts attempted iterations including tolerated failures, paired with the new failed_runs chip in the UI — consistent with the prior non-success path.
max_consecutive_failures > max_runs is allowed (cutoff then never fires); harmless — the loop simply ends at max_runs as completed_with_errors.

Verified good

Both failure surfaces gated; backward-compat preserved (default abort; existing test_iteration_exception_aborts_loop still green).
{{previous_response}} keeps the last successful response on a tolerated failure (no poisoning of the next prompt).
Schema dual-definition correct (schema.py DDL and tables.py Core object) — schema-parity + migrations suites green.
on_failure validated via Literal; max_consecutive_failures bounded 1–100.
Minimal scope: stores/loops.js correctly left untouched (passes payload through); abort-mode requests don't even send the new fields.
Cooperative stop still honored each iteration in continue mode (stop check is at loop top).

Verdict: ship-ready; the delay-on-exception nit is the only thing worth a follow-up commit if you want surface parity.

…1167 review) Continue mode skipped delay_seconds when an iteration *raised* (the exception path `continue`d past the delay block), while a non-success result honored it. Extract the inter-run delay (with its cooperative-stop check) into a helper and call it on both surfaces so continue-mode pacing is consistent. Add a test asserting the delay fires after a raised iteration.

dolho · 2026-06-24T12:30:52Z

✅ Addressed the review finding (delay-on-exception parity) in ed9dce2.

The inter-run delay (with its cooperative-stop check) is now extracted into a single _inter_run_delay() helper and applied on both failure surfaces, so a continue-mode loop whose iterations raise paces with delay_seconds exactly like the non-success-result path. Added a test asserting the delay fires after a raised iteration. 22 passed.

The three semantic notes were intentional and left as-is.

vybe

Validated via /validate-pr — blocking on Invariant #3 (dual-track migrations).

This adds 3 columns to agent_loops (on_failure, max_consecutive_failures, failed_runs) with a SQLite migration (_migrate_agent_loops_failure_policy in db/migrations.py) and updates db/schema.py + db/tables.py — but there's no Alembic revision under src/backend/migrations/versions/. On a PostgreSQL backend, init_database() runs alembic_runner.upgrade_to_head(), which won't add these columns, so loop_service crashes reading them. schema-parity is green because it only validates the SQLite track, not Alembic.

Fix: add a new Alembic revision adding the 3 agent_loops columns. The failure-policy logic, MCP surface, tests, and docs all look complete otherwise — this is the only blocker.

— posted via /validate-pr

dolho commented Jun 24, 2026

View reviewed changes

dolho requested a review from vybe June 24, 2026 13:40

vybe requested changes Jun 24, 2026

View reviewed changes

vybe mentioned this pull request Jun 24, 2026

feat(sharing): per-agent custom instructions for public & channel chats (#1205) #1337

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339

feat(loops): configurable failure policy — fail-fast vs continue-on-error (#1167)#1339
dolho wants to merge 2 commits into
devfrom
feature/1167-loop-failure-policy

dolho commented Jun 24, 2026

Uh oh!

dolho left a comment

Uh oh!

dolho commented Jun 24, 2026

Uh oh!

vybe left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

dolho commented Jun 24, 2026

Summary

Config (default preserves current behavior)

Behavior

Surfaces (Invariant #13)

Acceptance criteria

Testing

Uh oh!

dolho left a comment

Choose a reason for hiding this comment

Sequential review — #1167 loop failure policy

Finding (low) — exception surface skips the inter-run delay in continue mode

Notes (intentional, documenting)

Verified good

Uh oh!

dolho commented Jun 24, 2026

Uh oh!

vybe left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants