Skip to content

research: multi-agent eval framework — what metrics select the best role prompts? #354

@justrach

Description

@justrach

Problem

Before we can evolve prompts (#353), we need to know what "good" means. There's no eval framework for measuring whether one system prompt is better than another for a given role.

Proposal

Run a swarm of agents (architect + reviewer + safety_auditor) to design an eval framework for prompt selection. The agents should debate and converge on:

Questions to answer

  1. What metrics per role?

    • finder: recall (did it find all relevant files?), precision (false positives?)
    • reviewer: true positive rate on known bugs, false alarm rate
    • fixer: does the fix compile? does it pass tests? is the diff minimal?
    • safety_auditor: detection rate on seeded UAF/double-free bugs
    • test_writer: do generated tests catch the bug? do they leak-check?
  2. What benchmark tasks?

    • Seeded bug suites (inject known bugs, measure detection/fix rate)
    • Historical issues from this repo (ground truth from merged PRs)
    • Cross-repo generalization (does a prompt that works here work elsewhere?)
  3. How to compare prompt variants?

    • Elo rating from head-to-head comparisons?
    • Multi-objective Pareto front (quality vs cost vs speed)?
    • Statistical significance — how many runs needed to distinguish prompts?
  4. How to avoid overfitting?

    • Train/test split on benchmark tasks
    • Held-out codebase for generalization testing
    • Diversity pressure to prevent prompt collapse

Deliverable

A concrete eval spec that #353 can implement as the fitness function for evolutionary prompt optimization.

Connects to

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions