feat(analysis): RunTrendAnalyzer — detect performance regression across self-modification iterations


## Context

HyperAgents tracks performance across self-modification iterations via `archive.jsonl` + `plot_progress.py`. The analysis pipeline computes best/avg scores (`analysis_utils.py`) with bootstrap CI and significance testing between methods — solid statistical grounding for \*snapshot\* evaluation.

What's missing is **longitudinal trend detection across sequential runs**: detecting when the self-modification process begins to regress, not just whether method A beats method B in aggregate.

## The Gap

Currently `plot_progress.py` computes:
- `best_scores` — monotonically increasing (always $\ge$ previous best)
- `avg_scores` — cumulative average (smoothes out volatility)

Neither surface detects **directional change rate** or **regression onset**. In a self-modifying system where agents mutate their own code across generations, the question isn't just `did we improve?` but `are we getting less reliable?`.

A concrete example: if generation 0→7 shows steady improvement, then 8→15 shows increasing variance and declining best scores, the current pipeline shows this as a flat-ish line in `avg_scores` (because the cumulative average is anchored to early low values) and `best_scores` stays flat (never decreases). The regression signal is invisible.

## Proposed: `RunTrendAnalyzer`

A lightweight addition to `analysis/` that fits alongside the existing statistical toolkit:

```python
# analysis/trend_analyzer.py

@dataclass
class TrendReport:
    genid: str
    iteration: int
    score: float
    ema_score: float  # exponentially weighted moving average
    delta: float  # score change from prior
    is_regression: bool  # score < ema

@dataclass
class TrendSummary:
    slope: float  # OLS slope over last N iterations (score/time)
    r_squared: float  # goodness of fit (1.0 = monotonic, <0 = accelerating decline)
    regression_count: int  # consecutive regressions in recent window
    trend_class: Literal['improving', 'stable', 'degrading', 'unstable']
    inflection_points: list[int]  # iteration indices where trend direction changed
```

**Key design decisions:**

1. **OLS slope over fixed windows** (default last 10 iterations) rather than all-time — recent trend matters more than historical average for detecting onset
2. **Inflection point detection** — use CUSUM or simple sign-change counting on `delta` to identify where improvement plateaus/reverses, since this is where the self-modification policy may need adjustment
3. **Trend classification** maps to actionable states:
   - `improving`: slope > 0, `r_squared` > 0.5, 0 consecutive regressions
   - `stable`: |slope| ≈ 0 (< threshold), no degradation
   - `degrading`: slope < 0, r² > 0.3 (consistent decline)
   - `unstable`: r² < 0.3 (high variance, direction unclear)

## Where it Integrates

- **New file:** `analysis/trend_analyzer.py` (pure functions, no new deps — `numpy` already used in `analysis_utils.py`)
- **CLI entry point:** `python -m analysis.trend_analyzer --archive archive.jsonl --scores scores.npy`
- **Plot integration:** `plot_progress.py` could overlay a trend line and mark inflection points on the existing progress charts
- **No breaking changes** — additive to the existing `compute_bootstrap_ci()` and `save_significance_tests()` utilities

## Why This Matters for Self-Modifying Agents

The core thesis of HyperAgents is that agents can improve themselves through iterative self-modification. But if the self-modification process is producing **monotonically worse** or **increasingly unreliable** agents, the current metrics don't flag it until a human notices the chart pattern.

A TrendAnalyzer that runs alongside each experiment would:
1. Alert when the trend turns negative (before a human notices)
2. Identify the iteration where improvement stops (useful for early termination)
3. Provide a quantitative metric for comparing self-modification strategies (the `slope` and `trend_class` become first-class experiment outcomes)

This is also relevant for the broader agent evaluation ecosystem — I've now filed similar proposals across 125+ repos, and this gap appears universally: eval systems track \*current\* performance but not \*rate of change\* in performance. For self-modifying systems where the system's own output becomes the input, this is the missing feedback signal.

Happy to discuss design or submit a PR if this aligns with the project direction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analysis): RunTrendAnalyzer — detect performance regression across self-modification iterations #20

Context

The Gap

Proposed: `RunTrendAnalyzer`

Where it Integrates

Why This Matters for Self-Modifying Agents

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat(analysis): RunTrendAnalyzer — detect performance regression across self-modification iterations #20

Description

Context

The Gap

Proposed: RunTrendAnalyzer

Where it Integrates

Why This Matters for Self-Modifying Agents

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Proposed: `RunTrendAnalyzer`