Problem
When a primary criterion is set, regression detection still compares composites. An iteration that drops the primary criterion score (e.g., precision 7.5 → 5) but holds other criteria steady can maintain the same composite (~6.7), so the system doesn't flag it as a regression. The generator then builds on the worse version instead of rolling back to the best-on-primary iteration.
Evidence
From Noospheric Orrery extraction spec simmering run:
- Iter 3: precision 7.5 (best!) — composite 6.7
- Iter 4: precision 5 (regressed!) — composite 6.7
regressed=False on iter 4 because composite didn't drop
- Generator built on iter 4 instead of rolling back to iter 3
Root Cause
The reflect agent writes "Best candidate" based on composite. _find_best_from_trajectory() in reflect.py reads that line. When primary is set, the best-so-far should be determined by primary criterion first, composite as tiebreaker — matching the skill spec.
The Python utility find_best() in reflect.py handles primary correctly, but the LLM-based reflect agent may not be weighting primary properly in its trajectory table output.
Fix
Either:
- Strengthen the reflect prompt to explicitly say "When PRIMARY is set, Best candidate is the iteration with the highest PRIMARY score, composite as tiebreaker"
- Post-process: after reading trajectory.md, use Python
find_best() to verify the "Best candidate" line matches primary-weighted ranking
- Both — LLM writes it correctly, Python verifies
Impact
Affects runs with primary set where individual criteria oscillate while composite stays flat. Most visible in extraction spec simmering where precision and recall trade off.
Problem
When a
primarycriterion is set, regression detection still compares composites. An iteration that drops the primary criterion score (e.g., precision 7.5 → 5) but holds other criteria steady can maintain the same composite (~6.7), so the system doesn't flag it as a regression. The generator then builds on the worse version instead of rolling back to the best-on-primary iteration.Evidence
From Noospheric Orrery extraction spec simmering run:
regressed=Falseon iter 4 because composite didn't dropRoot Cause
The reflect agent writes "Best candidate" based on composite.
_find_best_from_trajectory()inreflect.pyreads that line. Whenprimaryis set, the best-so-far should be determined by primary criterion first, composite as tiebreaker — matching the skill spec.The Python utility
find_best()inreflect.pyhandles primary correctly, but the LLM-based reflect agent may not be weighting primary properly in its trajectory table output.Fix
Either:
find_best()to verify the "Best candidate" line matches primary-weighted rankingImpact
Affects runs with
primaryset where individual criteria oscillate while composite stays flat. Most visible in extraction spec simmering where precision and recall trade off.