Skip to content

Plateau detection should stop early after N identical composite scores #3

@sugi-chan

Description

@sugi-chan

Problem

When the extraction spec phase produces identical composite scores across multiple iterations, simmer-sdk runs all configured iterations anyway. In a real run, 6 iterations all scored 6.7/10 with identical composites — the generator made changes, judges evaluated them differently on individual criteria, but the composite always averaged to the same value.

Related to #1 — if regression detection compared per-criterion instead of just composite, some of these iterations would have been flagged as regressions and the best candidate preserved.

Observed behavior

extraction_spec iter 0: composite=6.7 (seed)
extraction_spec iter 1: composite=6.7
extraction_spec iter 2: composite=6.7
extraction_spec iter 3: composite=6.7
extraction_spec iter 4: composite=6.7
extraction_spec iter 5: composite=6.7

Individual criterion scores varied (precision went 6 → 6.5 → 7.5 → 5 → 6), but the composite was flat because coverage and format_compliance offset the changes.

Expected behavior

After 3 consecutive iterations with identical composite scores (configurable), trigger on_plateau callback and stop early. The current run wasted 3 iterations of compute (~15 minutes of Claude CLI time) producing no improvement.

Suggested fix

# In refine loop
if len(trajectory) >= 3:
    last_3 = [t.composite for t in trajectory[-3:]]
    if len(set(last_3)) == 1:  # all identical
        if on_plateau:
            await on_plateau(trajectory)
        break  # stop early

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions