Problem
Before we can evolve prompts (#353), we need to know what "good" means. There's no eval framework for measuring whether one system prompt is better than another for a given role.
Proposal
Run a swarm of agents (architect + reviewer + safety_auditor) to design an eval framework for prompt selection. The agents should debate and converge on:
Questions to answer
-
What metrics per role?
- finder: recall (did it find all relevant files?), precision (false positives?)
- reviewer: true positive rate on known bugs, false alarm rate
- fixer: does the fix compile? does it pass tests? is the diff minimal?
- safety_auditor: detection rate on seeded UAF/double-free bugs
- test_writer: do generated tests catch the bug? do they leak-check?
-
What benchmark tasks?
- Seeded bug suites (inject known bugs, measure detection/fix rate)
- Historical issues from this repo (ground truth from merged PRs)
- Cross-repo generalization (does a prompt that works here work elsewhere?)
-
How to compare prompt variants?
- Elo rating from head-to-head comparisons?
- Multi-objective Pareto front (quality vs cost vs speed)?
- Statistical significance — how many runs needed to distinguish prompts?
-
How to avoid overfitting?
- Train/test split on benchmark tasks
- Held-out codebase for generalization testing
- Diversity pressure to prevent prompt collapse
Deliverable
A concrete eval spec that #353 can implement as the fitness function for evolutionary prompt optimization.
Connects to
Problem
Before we can evolve prompts (#353), we need to know what "good" means. There's no eval framework for measuring whether one system prompt is better than another for a given role.
Proposal
Run a swarm of agents (architect + reviewer + safety_auditor) to design an eval framework for prompt selection. The agents should debate and converge on:
Questions to answer
What metrics per role?
What benchmark tasks?
How to compare prompt variants?
How to avoid overfitting?
Deliverable
A concrete eval spec that #353 can implement as the fitness function for evolutionary prompt optimization.
Connects to