Skip to content

feat(metrics): add imp@baseline to catch slow multi-week regressions#1

Open
nanookclaw wants to merge 1 commit intochoutos:mainfrom
nanookclaw:feat/imp-at-baseline
Open

feat(metrics): add imp@baseline to catch slow multi-week regressions#1
nanookclaw wants to merge 1 commit intochoutos:mainfrom
nanookclaw:feat/imp-at-baseline

Conversation

@nanookclaw
Copy link
Copy Markdown

Problem

imp@week compares each week to the immediately prior week. This misses slow, monotonic regressions: if an agent degrades 0.05/week, each consecutive delta reads "stable" while cumulative drift grows silently.

Example: Ada drops from 4.5 → 4.2 → 3.9 → 3.6 over four weeks. Each imp@week is −0.3 (visible), but if the drop were 0.05/week instead, all four weeks would report "stable" despite a cumulative −0.2 loss.

Solution

Add --baseline <WEEK> flag to compute-impk.sh that prints an additional imp@baseline section anchored to a fixed reference week:

imp@baseline = avg_score(current_week) - avg_score(baseline_week)

The baseline report runs alongside the existing imp@week report. Omitting --baseline preserves full backward compatibility.

Usage

# Both imp@week (W16 vs W15) and imp@baseline (W16 vs W12) in one run
./metrics/scripts/compute-impk.sh 2026-W16 --baseline 2026-W12

Example output:

imp@week report
Current week : 2026-W16
Previous week: 2026-W15

Agent        Current   Previous  imp@week    Trend
------------ --------  --------  ----------  ----------
Ada          3.44      3.56      -0.12       regressing

imp@baseline report
Baseline week: 2026-W12
Current week : 2026-W16

Agent        Current   Baseline  imp@baseline  Trend
------------ --------  --------  ------------  ----------
Ada          3.44      4.67      -1.23         regressing

imp@week shows a small regression this week; imp@baseline reveals −1.23 cumulative drift since W12 — the real story.

Changes

  • metrics/scripts/compute-impk.sh--baseline flag, new imp@baseline output section, declare -A current_avgs to avoid double-computing agent scores
  • evals/METHODOLOGY.md — New imp@baseline subsection with rationale + example output
  • metrics/README.md — Table updated, compute-impk.sh docs extended with --baseline usage
  • README.md — Core Metrics table includes imp@baseline, Trend Classification section notes the blind spot

Background

This follows naturally from the METHODOLOGY.md principle: "Always look at the delta, not the absolute." Two deltas — consecutive and anchored — give you the full picture. Weekly volatility + cumulative drift.

imp@week compares each week to the immediately prior week. This misses
slow monotonic regressions: if an agent degrades 0.05/week, each
consecutive delta reads 'stable' while cumulative drift grows silently.

Add --baseline <WEEK> flag to compute-impk.sh that also prints an
imp@baseline section anchored to a fixed reference week:

  imp@baseline = avg_score(current) - avg_score(anchor)

Usage:
  ./compute-impk.sh --baseline 2026-W12
  ./compute-impk.sh 2026-W16 --baseline 2026-W12

The baseline report runs alongside the existing imp@week report.
Omitting --baseline preserves full backward compatibility.

Also document imp@baseline in:
- evals/METHODOLOGY.md (new subsection with example output)
- metrics/README.md (table + updated compute-impk.sh docs)
- README.md (Core Metrics table + Trend Classification note)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant