Regression detection should use primary criterion, not just composite

## Problem

When a `primary` criterion is set, regression detection still compares composites. An iteration that drops the primary criterion score (e.g., precision 7.5 → 5) but holds other criteria steady can maintain the same composite (~6.7), so the system doesn't flag it as a regression. The generator then builds on the worse version instead of rolling back to the best-on-primary iteration.

## Evidence

From Noospheric Orrery extraction spec simmering run:
- Iter 3: precision 7.5 (best!) — composite 6.7
- Iter 4: precision 5 (regressed!) — composite 6.7
- `regressed=False` on iter 4 because composite didn't drop
- Generator built on iter 4 instead of rolling back to iter 3

## Root Cause

The reflect agent writes "Best candidate" based on composite. `_find_best_from_trajectory()` in `reflect.py` reads that line. When `primary` is set, the best-so-far should be determined by primary criterion first, composite as tiebreaker — matching the skill spec.

The Python utility `find_best()` in `reflect.py` handles primary correctly, but the LLM-based reflect agent may not be weighting primary properly in its trajectory table output.

## Fix

Either:
1. Strengthen the reflect prompt to explicitly say "When PRIMARY is set, Best candidate is the iteration with the highest PRIMARY score, composite as tiebreaker"
2. Post-process: after reading trajectory.md, use Python `find_best()` to verify the "Best candidate" line matches primary-weighted ranking
3. Both — LLM writes it correctly, Python verifies

## Impact

Affects runs with `primary` set where individual criteria oscillate while composite stays flat. Most visible in extraction spec simmering where precision and recall trade off.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression detection should use primary criterion, not just composite #1

Problem

Evidence

Root Cause

Fix

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression detection should use primary criterion, not just composite #1

Description

Problem

Evidence

Root Cause

Fix

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions