Summary
agent_judge is always executed on the eval's case engine, ignoring the provider implied by the judge's own model. When the case engine speaks a different protocol than the judge model requires, the judge call fails.
Root cause
internal/evaluator/evaluator.go resolveJudgeAgent builds the judge agent with the eval engine name:
judgeParams := credential.ResolveJudgeInitParams(e.evalCfg.Engine.Name, judgeCfg, ...)
judgeAgent, err = agentDetectWithInitParams(e.evalCfg.Engine.Name, judgeParams, ...)
// ^^^^^^^^^^^^^^^^^^^^ always the case engine
ResolveJudgeInitParams parses judge.model into (provider, model) (e.g. anthropic/claude-sonnet-4-6 → provider anthropic), but the engine stays the case engine. So a codex (OpenAI-protocol) eval with an anthropic/... judge model asks codex to drive an Anthropic model → codex run failed.
Reproduction
A case whose judge.model: anthropic/claude-sonnet-4-6, run under different engines:
| case engine |
judge |
result |
claude_code |
anthropic model on claude engine |
✅ native |
qodercli |
qoder ignores configured model |
✅ runs |
codex |
anthropic model on openai protocol |
❌ codex run failed |
Verified locally: the identical codex eval passes end-to-end when the judge model is switched to an OpenAI-protocol model (openai/gpt-5.5), and fails with an anthropic/... judge model.
Expected behavior
agent_judge should select its engine from the judge model's provider (e.g. anthropic/* → claude_code, openai/* → codex), not inherit the case engine. Semantically the judge is meant to be a fixed, strong model independent of the agent under test.
Impact
Skills whose evals pin an anthropic judge (the common case) cannot be evaluated under codex. Discovered while dogfooding the Agent Skill Eval action across engines (see the skill-eval-self workflow).
Summary
agent_judgeis always executed on the eval's case engine, ignoring the provider implied by the judge's ownmodel. When the case engine speaks a different protocol than the judge model requires, the judge call fails.Root cause
internal/evaluator/evaluator.goresolveJudgeAgentbuilds the judge agent with the eval engine name:ResolveJudgeInitParamsparsesjudge.modelinto(provider, model)(e.g.anthropic/claude-sonnet-4-6→ provideranthropic), but the engine stays the case engine. So acodex(OpenAI-protocol) eval with ananthropic/...judge model asks codex to drive an Anthropic model →codex run failed.Reproduction
A case whose
judge.model: anthropic/claude-sonnet-4-6, run under different engines:claude_codeqoderclicodexcodex run failedVerified locally: the identical codex eval passes end-to-end when the judge model is switched to an OpenAI-protocol model (
openai/gpt-5.5), and fails with ananthropic/...judge model.Expected behavior
agent_judgeshould select its engine from the judge model's provider (e.g.anthropic/*→claude_code,openai/*→codex), not inherit the case engine. Semantically the judge is meant to be a fixed, strong model independent of the agent under test.Impact
Skills whose evals pin an anthropic judge (the common case) cannot be evaluated under codex. Discovered while dogfooding the Agent Skill Eval action across engines (see the skill-eval-self workflow).