Skip to content

coding-loop can select unexpected CLI route and obscure auth failures #90

Description

@danshapiro

Summary

Running kilroy run coding-loop from a Codex-based workflow unexpectedly selected the Claude CLI route. The selected stage command also modified the environment by unsetting the Anthropic API key before invoking claude.

When the local Claude CLI session was expired or otherwise invalid, Kilroy did not clearly tell the user that the selected agent was misconfigured. One run failed after repeated auth errors; another appeared to hang in the first agent stage for several minutes until manually terminated.

No secrets, local paths, run IDs, or account-specific details are included here.

What Happened

  1. User was operating from Codex and invoked Kilroy coding-loop.

  2. Kilroy selected Claude CLI for the workflow's default policy classes. This was unexpected because the active agent/workflow context was Codex.

  3. The generated stage command used this shape:

    env -u ANTHROPIC_API_KEY claude --dangerously-skip-permissions --print ...
  4. That env -u ... behavior meant Kilroy deliberately removed a potentially valid environment credential and relied on the Claude CLI's stored auth/session.

  5. The Claude CLI auth/session was invalid. A direct repro of that auth surface returned a 401 authentication error.

  6. Kilroy's reported failure was generic/deterministic, e.g. "agent exited with code 1" or a deterministic failure cycle. In another run, the stage started and emitted init output but did not write a valid stage status; with no stage timeout, the run sat for minutes until manually terminated.

  7. Rerouting the workflow to the Codex CLI path allowed the same coding-loop workflow to complete successfully.

Expected Behavior

  • If Kilroy is launched from a Codex workflow/session, it should not silently choose Claude unless the policy/config clearly says that is the intended route.
  • Kilroy should avoid surprising environment manipulation. If it needs to isolate credentials, that should be explicit and visible in the run plan/check output.
  • If the selected agent is not configured correctly, Kilroy should fail clearly and immediately with an actionable message, e.g. "Claude CLI auth failed; run the CLI login flow or choose another policy/driver."
  • kilroy check and kilroy auth check should not report a route as healthy solely because a CLI binary/session exists if a non-interactive prompt would fail with 401.

Actual Behavior

  • kilroy check coding-loop reported the workflow route as OK.
  • kilroy auth check reported the CLI auth chain as OK.
  • The real non-interactive CLI invocation failed with 401 auth errors.
  • One coding-loop run failed with a deterministic provider/auth cycle instead of a clear agent-misconfigured message.
  • Another run looked hung in the first agent stage for several minutes because:
    • stage_timeout_ms was 0
    • stall_timeout_ms was very long
    • the failed/misconfigured agent stage did not produce a useful status artifact

Suggested Fixes

  • Make check perform, or optionally perform, a cheap non-interactive agent auth probe for selected CLI routes.
  • Distinguish "provider/auth misconfigured" from generic deterministic failures.
  • Fail faster when an agent stage emits an auth failure or exits without a valid stage status.
  • Surface the exact selected driver/tool before launch, especially when it differs from the operator's current agent context.
  • Avoid unsetting credential environment variables unless explicitly configured; if this is intentional, show it in check/run output.
  • Consider a default stage timeout or a shorter auth-failure watchdog so a misconfigured agent does not look like a workflow hang.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions