research: multi-agent eval framework — what metrics select the best role prompts?

## Problem

Before we can evolve prompts (#353), we need to know what "good" means. There's no eval framework for measuring whether one system prompt is better than another for a given role.

## Proposal

Run a swarm of agents (architect + reviewer + safety_auditor) to design an eval framework for prompt selection. The agents should debate and converge on:

### Questions to answer

1. **What metrics per role?**
   - finder: recall (did it find all relevant files?), precision (false positives?)
   - reviewer: true positive rate on known bugs, false alarm rate
   - fixer: does the fix compile? does it pass tests? is the diff minimal?
   - safety_auditor: detection rate on seeded UAF/double-free bugs
   - test_writer: do generated tests catch the bug? do they leak-check?

2. **What benchmark tasks?**
   - Seeded bug suites (inject known bugs, measure detection/fix rate)
   - Historical issues from this repo (ground truth from merged PRs)
   - Cross-repo generalization (does a prompt that works here work elsewhere?)

3. **How to compare prompt variants?**
   - Elo rating from head-to-head comparisons?
   - Multi-objective Pareto front (quality vs cost vs speed)?
   - Statistical significance — how many runs needed to distinguish prompts?

4. **How to avoid overfitting?**
   - Train/test split on benchmark tasks
   - Held-out codebase for generalization testing
   - Diversity pressure to prevent prompt collapse

### Deliverable

A concrete eval spec that #353 can implement as the fitness function for evolutionary prompt optimization.

### Connects to

- #353 (evolutionary prompt optimization — this issue defines the fitness function)
- #274 (evolutionary grid tuning — shared eval infrastructure)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

research: multi-agent eval framework — what metrics select the best role prompts? #354

Problem

Proposal

Questions to answer

Deliverable

Connects to

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

research: multi-agent eval framework — what metrics select the best role prompts? #354

Description

Problem

Proposal

Questions to answer

Deliverable

Connects to

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions