Skip to content

Regression detection should use primary criterion, not just composite #1

@sugi-chan

Description

@sugi-chan

Problem

When a primary criterion is set, regression detection still compares composites. An iteration that drops the primary criterion score (e.g., precision 7.5 → 5) but holds other criteria steady can maintain the same composite (~6.7), so the system doesn't flag it as a regression. The generator then builds on the worse version instead of rolling back to the best-on-primary iteration.

Evidence

From Noospheric Orrery extraction spec simmering run:

  • Iter 3: precision 7.5 (best!) — composite 6.7
  • Iter 4: precision 5 (regressed!) — composite 6.7
  • regressed=False on iter 4 because composite didn't drop
  • Generator built on iter 4 instead of rolling back to iter 3

Root Cause

The reflect agent writes "Best candidate" based on composite. _find_best_from_trajectory() in reflect.py reads that line. When primary is set, the best-so-far should be determined by primary criterion first, composite as tiebreaker — matching the skill spec.

The Python utility find_best() in reflect.py handles primary correctly, but the LLM-based reflect agent may not be weighting primary properly in its trajectory table output.

Fix

Either:

  1. Strengthen the reflect prompt to explicitly say "When PRIMARY is set, Best candidate is the iteration with the highest PRIMARY score, composite as tiebreaker"
  2. Post-process: after reading trajectory.md, use Python find_best() to verify the "Best candidate" line matches primary-weighted ranking
  3. Both — LLM writes it correctly, Python verifies

Impact

Affects runs with primary set where individual criteria oscillate while composite stays flat. Most visible in extraction spec simmering where precision and recall trade off.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions