feat(metrics): add imp@baseline to catch slow multi-week regressions by nanookclaw · Pull Request #1 · choutos/agent-reliability-engineering

nanookclaw · 2026-03-29T15:05:11Z

Problem

imp@week compares each week to the immediately prior week. This misses slow, monotonic regressions: if an agent degrades 0.05/week, each consecutive delta reads "stable" while cumulative drift grows silently.

Example: Ada drops from 4.5 → 4.2 → 3.9 → 3.6 over four weeks. Each imp@week is −0.3 (visible), but if the drop were 0.05/week instead, all four weeks would report "stable" despite a cumulative −0.2 loss.

Solution

Add --baseline <WEEK> flag to compute-impk.sh that prints an additional imp@baseline section anchored to a fixed reference week:

imp@baseline = avg_score(current_week) - avg_score(baseline_week)

The baseline report runs alongside the existing imp@week report. Omitting --baseline preserves full backward compatibility.

Usage

# Both imp@week (W16 vs W15) and imp@baseline (W16 vs W12) in one run
./metrics/scripts/compute-impk.sh 2026-W16 --baseline 2026-W12

Example output:

imp@week report
Current week : 2026-W16
Previous week: 2026-W15

Agent        Current   Previous  imp@week    Trend
------------ --------  --------  ----------  ----------
Ada          3.44      3.56      -0.12       regressing

imp@baseline report
Baseline week: 2026-W12
Current week : 2026-W16

Agent        Current   Baseline  imp@baseline  Trend
------------ --------  --------  ------------  ----------
Ada          3.44      4.67      -1.23         regressing

imp@week shows a small regression this week; imp@baseline reveals −1.23 cumulative drift since W12 — the real story.

Changes

metrics/scripts/compute-impk.sh — --baseline flag, new imp@baseline output section, declare -A current_avgs to avoid double-computing agent scores
evals/METHODOLOGY.md — New imp@baseline subsection with rationale + example output
metrics/README.md — Table updated, compute-impk.sh docs extended with --baseline usage
README.md — Core Metrics table includes imp@baseline, Trend Classification section notes the blind spot

Background

This follows naturally from the METHODOLOGY.md principle: "Always look at the delta, not the absolute." Two deltas — consecutive and anchored — give you the full picture. Weekly volatility + cumulative drift.

imp@week compares each week to the immediately prior week. This misses slow monotonic regressions: if an agent degrades 0.05/week, each consecutive delta reads 'stable' while cumulative drift grows silently. Add --baseline <WEEK> flag to compute-impk.sh that also prints an imp@baseline section anchored to a fixed reference week: imp@baseline = avg_score(current) - avg_score(anchor) Usage: ./compute-impk.sh --baseline 2026-W12 ./compute-impk.sh 2026-W16 --baseline 2026-W12 The baseline report runs alongside the existing imp@week report. Omitting --baseline preserves full backward compatibility. Also document imp@baseline in: - evals/METHODOLOGY.md (new subsection with example output) - metrics/README.md (table + updated compute-impk.sh docs) - README.md (Core Metrics table + Trend Classification note)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): add imp@baseline to catch slow multi-week regressions#1

feat(metrics): add imp@baseline to catch slow multi-week regressions#1
nanookclaw wants to merge 1 commit intochoutos:mainfrom
nanookclaw:feat/imp-at-baseline

nanookclaw commented Mar 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nanookclaw commented Mar 29, 2026

Problem

Solution

Usage

Changes

Background

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant