Skip to content

bug: createFixTask inherits steward as assignee → fix tasks stall permanently #105

@komoreka

Description

@komoreka

Summary

When the merge steward creates a fix task (test failure or merge conflict), createFixTask inherits assignedAgent from the parent task's orchestrator metadata as the new fix task's assignee. This is correct in the common case where assignedAgent is the worker who completed the parent task. It is broken when the steward dispatch paths have overwritten assignedAgent with the steward's own ID — the fix task lands on the steward, and stewards have no mechanism to act on OPEN-status fix tasks, so it sits permanently unactioned.

Root cause chain

  1. Steward dispatch overwrites assignedAgent. Two paths in packages/smithy/src/services/dispatch-daemon.ts write the steward's ID into the parent task's orchestrator metadata:
    • Line 3359 (merge steward spawn, mergeStatus: 'testing')
    • Line 3579 (recovery steward spawn)
  2. createFixTask reads back the polluted value. packages/smithy/src/services/merge-steward-service.ts:931 does:
    assignee: orchestratorMeta?.assignedAgent,
    Plus copies it into the new fix task's own metadata at line 938.
  3. Stewards do not pick up OPEN tasks. Stewards act on REVIEW-status tasks with mergeStatus: pending|testing. An OPEN-status fix task assigned to a steward has no consumer.
  4. Result: the fix task is created, notified to the steward (lines 952-979), and never progresses. The director does not auto-dispatch a worker because the task already has an assignee.

Reproduction

  1. Run the daemon with orphanRecoveryEnabled: true (or any path that triggers a merge steward or recovery steward dispatch on a task).
  2. Confirm the parent task's metadata.orchestrator.assignedAgent is now the steward's ID (e.g. via sf task show <id> --format json or by inspecting the SQLite cache).
  3. Force a fix-task creation: simulate a test failure or merge conflict in the steward's worktree, or call the daemon path that invokes MergeStewardService.createFixTask with that parent task's metadata.
  4. Inspect the new fix task: assignee and metadata.orchestrator.assignedAgent both equal the steward's entity ID.
  5. The fix task remains in OPEN status indefinitely. The dispatch daemon does not auto-route it to a worker because it already has an assignee.

Two fix options

Option A — Minimal patch in createFixTask

Filter out steward agents when inheriting:

// merge-steward-service.ts:931
const inheritedAssignee = orchestratorMeta?.assignedAgent;
const inheritedAgent = inheritedAssignee
  ? await this.agentRegistry.getAgent(inheritedAssignee)
  : undefined;
const assignee =
  inheritedAgent && inheritedAgent.agentRole !== 'steward'
    ? inheritedAssignee
    : undefined;

When assignee is undefined, the dispatch daemon will pick up the OPEN fix task and route it to a worker via the normal dispatch flow.

Pros: small change, low risk, immediately unblocks. Cons: patches the symptom; the underlying invariant violation remains.

Option B — Stop polluting assignedAgent with steward IDs

assignedAgent should always mean "the agent responsible for forward progress" — which by the system's own contract is always a worker, never a steward. Steward dispatch should record steward identity in a separate metadata field (e.g. recoveringStewardId, mergeStewardId, or in the existing sessionHistory), leaving assignedAgent as the canonical worker reference.

This means:

  • dispatch-daemon.ts:3359 and :3579 stop writing assignedAgent: stewardId
  • A new dedicated field captures steward-of-record for the active session
  • All readers of assignedAgent now reliably mean "the worker"

Pros: removes a class of bugs; aligns the data model with the documented role contract. Cons: requires audit of all assignedAgent readers; bigger surface area.

Recommendation

Ship Option A as the v1 fix to unblock users immediately, file/track Option B as a follow-up architectural cleanup. Option A is a 5-10 line change and fully addresses the user-visible symptom.

Why this matters

The merge steward's fix-task path is the system's only automatic recovery from test failures and merge conflicts during the review/merge workflow. When this path mis-assigns, work silently stalls. There is no log warning, no UI banner, no escalation. The operator only notices when they manually inspect why a task has not moved.

Related

Environment

  • stoneforge: master @ 0a7052a
  • Reproduced in a real orchestration session

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions