Problem
When the extraction spec phase produces identical composite scores across multiple iterations, simmer-sdk runs all configured iterations anyway. In a real run, 6 iterations all scored 6.7/10 with identical composites — the generator made changes, judges evaluated them differently on individual criteria, but the composite always averaged to the same value.
Related to #1 — if regression detection compared per-criterion instead of just composite, some of these iterations would have been flagged as regressions and the best candidate preserved.
Observed behavior
extraction_spec iter 0: composite=6.7 (seed)
extraction_spec iter 1: composite=6.7
extraction_spec iter 2: composite=6.7
extraction_spec iter 3: composite=6.7
extraction_spec iter 4: composite=6.7
extraction_spec iter 5: composite=6.7
Individual criterion scores varied (precision went 6 → 6.5 → 7.5 → 5 → 6), but the composite was flat because coverage and format_compliance offset the changes.
Expected behavior
After 3 consecutive iterations with identical composite scores (configurable), trigger on_plateau callback and stop early. The current run wasted 3 iterations of compute (~15 minutes of Claude CLI time) producing no improvement.
Suggested fix
# In refine loop
if len(trajectory) >= 3:
last_3 = [t.composite for t in trajectory[-3:]]
if len(set(last_3)) == 1: # all identical
if on_plateau:
await on_plateau(trajectory)
break # stop early
Problem
When the extraction spec phase produces identical composite scores across multiple iterations, simmer-sdk runs all configured iterations anyway. In a real run, 6 iterations all scored 6.7/10 with identical composites — the generator made changes, judges evaluated them differently on individual criteria, but the composite always averaged to the same value.
Related to #1 — if regression detection compared per-criterion instead of just composite, some of these iterations would have been flagged as regressions and the best candidate preserved.
Observed behavior
Individual criterion scores varied (precision went 6 → 6.5 → 7.5 → 5 → 6), but the composite was flat because coverage and format_compliance offset the changes.
Expected behavior
After 3 consecutive iterations with identical composite scores (configurable), trigger
on_plateaucallback and stop early. The current run wasted 3 iterations of compute (~15 minutes of Claude CLI time) producing no improvement.Suggested fix