agent_judge runs on the case engine, breaking openai-protocol engines (codex) with anthropic judge models

## Summary

`agent_judge` is always executed on the **eval's case engine**, ignoring the provider implied by the judge's own `model`. When the case engine speaks a different protocol than the judge model requires, the judge call fails.

## Root cause

`internal/evaluator/evaluator.go` `resolveJudgeAgent` builds the judge agent with the eval engine name:

```go
judgeParams := credential.ResolveJudgeInitParams(e.evalCfg.Engine.Name, judgeCfg, ...)
judgeAgent, err = agentDetectWithInitParams(e.evalCfg.Engine.Name, judgeParams, ...)
//                                           ^^^^^^^^^^^^^^^^^^^^ always the case engine
```

`ResolveJudgeInitParams` parses `judge.model` into `(provider, model)` (e.g. `anthropic/claude-sonnet-4-6` → provider `anthropic`), but the **engine** stays the case engine. So a `codex` (OpenAI-protocol) eval with an `anthropic/...` judge model asks codex to drive an Anthropic model → `codex run failed`.

## Reproduction

A case whose `judge.model: anthropic/claude-sonnet-4-6`, run under different engines:

| case engine | judge | result |
| --- | --- | --- |
| `claude_code` | anthropic model on claude engine | ✅ native |
| `qodercli` | qoder ignores configured model | ✅ runs |
| `codex` | anthropic model on **openai** protocol | ❌ `codex run failed` |

Verified locally: the identical codex eval **passes end-to-end** when the judge model is switched to an OpenAI-protocol model (`openai/gpt-5.5`), and fails with an `anthropic/...` judge model.

## Expected behavior

`agent_judge` should select its engine from the **judge model's provider** (e.g. `anthropic/*` → `claude_code`, `openai/*` → `codex`), not inherit the case engine. Semantically the judge is meant to be a fixed, strong model independent of the agent under test.

## Impact

Skills whose evals pin an anthropic judge (the common case) cannot be evaluated under codex. Discovered while dogfooding the Agent Skill Eval action across engines (see the skill-eval-self workflow).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

agent_judge runs on the case engine, breaking openai-protocol engines (codex) with anthropic judge models #104

Summary

Root cause

Reproduction

Expected behavior

Impact

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

case engine	judge	result
`claude_code`	anthropic model on claude engine	✅ native
`qodercli`	qoder ignores configured model	✅ runs
`codex`	anthropic model on openai protocol	❌ `codex run failed`

Uh oh!

agent_judge runs on the case engine, breaking openai-protocol engines (codex) with anthropic judge models #104

Description

Summary

Root cause

Reproduction

Expected behavior

Impact

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions