Goal
Add benchmark tasks that help evaluate and improve prompt framing for each supported agent role in CLI-only benchmark runs.
This issue is about making role prompts easier to compare and tune. It should not add TUI behavior.
Scope
- Identify the agent roles that should participate in CLI benchmark runs.
- Define a small set of framing checks per role, such as instruction clarity, task boundary handling, tool-use restraint, output format consistency, and refusal or approval behavior where relevant.
- Add benchmark task metadata that records which role prompt behavior is being exercised.
- Produce result fields that make per-role prompt comparisons easier to inspect after a CLI benchmark run.
Suggested first pass
- Search for the existing role definitions or role-selection logic in the Rust workspace.
- Pick 2-3 roles for the initial benchmark framing pass.
- Add task metadata that states the intended role behavior for each task.
- Document how maintainers should use benchmark results to tune role prompts.
Acceptance criteria
- CLI benchmark tasks can be grouped or filtered by agent role.
- Each seed framing task states what role behavior it is testing.
- Benchmark output includes enough metadata to compare role prompt behavior across runs.
- Documentation explains that this is a prompt-optimization aid, not a production prompt auto-tuner.
- No TUI changes are required.
Non-goals
- No automatic prompt rewriting.
- No hidden eval service.
- No changes that weaken sandboxing or approval behavior.
Goal
Add benchmark tasks that help evaluate and improve prompt framing for each supported agent role in CLI-only benchmark runs.
This issue is about making role prompts easier to compare and tune. It should not add TUI behavior.
Scope
Suggested first pass
Acceptance criteria
Non-goals