coding-loop can select unexpected CLI route and obscure auth failures

## Summary

Running `kilroy run coding-loop` from a Codex-based workflow unexpectedly selected the Claude CLI route. The selected stage command also modified the environment by unsetting the Anthropic API key before invoking `claude`.

When the local Claude CLI session was expired or otherwise invalid, Kilroy did not clearly tell the user that the selected agent was misconfigured. One run failed after repeated auth errors; another appeared to hang in the first agent stage for several minutes until manually terminated.

No secrets, local paths, run IDs, or account-specific details are included here.

## What Happened

1. User was operating from Codex and invoked Kilroy `coding-loop`.
2. Kilroy selected Claude CLI for the workflow's default policy classes. This was unexpected because the active agent/workflow context was Codex.
3. The generated stage command used this shape:

   ```sh
   env -u ANTHROPIC_API_KEY claude --dangerously-skip-permissions --print ...
   ```

4. That `env -u ...` behavior meant Kilroy deliberately removed a potentially valid environment credential and relied on the Claude CLI's stored auth/session.
5. The Claude CLI auth/session was invalid. A direct repro of that auth surface returned a 401 authentication error.
6. Kilroy's reported failure was generic/deterministic, e.g. "agent exited with code 1" or a deterministic failure cycle. In another run, the stage started and emitted init output but did not write a valid stage status; with no stage timeout, the run sat for minutes until manually terminated.
7. Rerouting the workflow to the Codex CLI path allowed the same coding-loop workflow to complete successfully.

## Expected Behavior

- If Kilroy is launched from a Codex workflow/session, it should not silently choose Claude unless the policy/config clearly says that is the intended route.
- Kilroy should avoid surprising environment manipulation. If it needs to isolate credentials, that should be explicit and visible in the run plan/check output.
- If the selected agent is not configured correctly, Kilroy should fail clearly and immediately with an actionable message, e.g. "Claude CLI auth failed; run the CLI login flow or choose another policy/driver."
- `kilroy check` and `kilroy auth check` should not report a route as healthy solely because a CLI binary/session exists if a non-interactive prompt would fail with 401.

## Actual Behavior

- `kilroy check coding-loop` reported the workflow route as OK.
- `kilroy auth check` reported the CLI auth chain as OK.
- The real non-interactive CLI invocation failed with 401 auth errors.
- One `coding-loop` run failed with a deterministic provider/auth cycle instead of a clear agent-misconfigured message.
- Another run looked hung in the first agent stage for several minutes because:
  - `stage_timeout_ms` was `0`
  - `stall_timeout_ms` was very long
  - the failed/misconfigured agent stage did not produce a useful status artifact

## Suggested Fixes

- Make `check` perform, or optionally perform, a cheap non-interactive agent auth probe for selected CLI routes.
- Distinguish "provider/auth misconfigured" from generic deterministic failures.
- Fail faster when an agent stage emits an auth failure or exits without a valid stage status.
- Surface the exact selected driver/tool before launch, especially when it differs from the operator's current agent context.
- Avoid unsetting credential environment variables unless explicitly configured; if this is intentional, show it in `check`/`run` output.
- Consider a default stage timeout or a shorter auth-failure watchdog so a misconfigured agent does not look like a workflow hang.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coding-loop can select unexpected CLI route and obscure auth failures #90

Summary

What Happened

Expected Behavior

Actual Behavior

Suggested Fixes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

coding-loop can select unexpected CLI route and obscure auth failures #90

Description

Summary

What Happened

Expected Behavior

Actual Behavior

Suggested Fixes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions