Skip to content

feat(analysis): RunTrendAnalyzer — detect performance regression across self-modification iterations #20

@nanookclaw

Description

@nanookclaw

Context

HyperAgents tracks performance across self-modification iterations via archive.jsonl + plot_progress.py. The analysis pipeline computes best/avg scores (analysis_utils.py) with bootstrap CI and significance testing between methods — solid statistical grounding for *snapshot* evaluation.

What's missing is longitudinal trend detection across sequential runs: detecting when the self-modification process begins to regress, not just whether method A beats method B in aggregate.

The Gap

Currently plot_progress.py computes:

  • best_scores — monotonically increasing (always $\ge$ previous best)
  • avg_scores — cumulative average (smoothes out volatility)

Neither surface detects directional change rate or regression onset. In a self-modifying system where agents mutate their own code across generations, the question isn't just did we improve? but are we getting less reliable?.

A concrete example: if generation 0→7 shows steady improvement, then 8→15 shows increasing variance and declining best scores, the current pipeline shows this as a flat-ish line in avg_scores (because the cumulative average is anchored to early low values) and best_scores stays flat (never decreases). The regression signal is invisible.

Proposed: RunTrendAnalyzer

A lightweight addition to analysis/ that fits alongside the existing statistical toolkit:

# analysis/trend_analyzer.py

@dataclass
class TrendReport:
    genid: str
    iteration: int
    score: float
    ema_score: float  # exponentially weighted moving average
    delta: float  # score change from prior
    is_regression: bool  # score < ema

@dataclass
class TrendSummary:
    slope: float  # OLS slope over last N iterations (score/time)
    r_squared: float  # goodness of fit (1.0 = monotonic, <0 = accelerating decline)
    regression_count: int  # consecutive regressions in recent window
    trend_class: Literal['improving', 'stable', 'degrading', 'unstable']
    inflection_points: list[int]  # iteration indices where trend direction changed

Key design decisions:

  1. OLS slope over fixed windows (default last 10 iterations) rather than all-time — recent trend matters more than historical average for detecting onset
  2. Inflection point detection — use CUSUM or simple sign-change counting on delta to identify where improvement plateaus/reverses, since this is where the self-modification policy may need adjustment
  3. Trend classification maps to actionable states:
    • improving: slope > 0, r_squared > 0.5, 0 consecutive regressions
    • stable: |slope| ≈ 0 (< threshold), no degradation
    • degrading: slope < 0, r² > 0.3 (consistent decline)
    • unstable: r² < 0.3 (high variance, direction unclear)

Where it Integrates

  • New file: analysis/trend_analyzer.py (pure functions, no new deps — numpy already used in analysis_utils.py)
  • CLI entry point: python -m analysis.trend_analyzer --archive archive.jsonl --scores scores.npy
  • Plot integration: plot_progress.py could overlay a trend line and mark inflection points on the existing progress charts
  • No breaking changes — additive to the existing compute_bootstrap_ci() and save_significance_tests() utilities

Why This Matters for Self-Modifying Agents

The core thesis of HyperAgents is that agents can improve themselves through iterative self-modification. But if the self-modification process is producing monotonically worse or increasingly unreliable agents, the current metrics don't flag it until a human notices the chart pattern.

A TrendAnalyzer that runs alongside each experiment would:

  1. Alert when the trend turns negative (before a human notices)
  2. Identify the iteration where improvement stops (useful for early termination)
  3. Provide a quantitative metric for comparing self-modification strategies (the slope and trend_class become first-class experiment outcomes)

This is also relevant for the broader agent evaluation ecosystem — I've now filed similar proposals across 125+ repos, and this gap appears universally: eval systems track *current* performance but not *rate of change* in performance. For self-modifying systems where the system's own output becomes the input, this is the missing feedback signal.

Happy to discuss design or submit a PR if this aligns with the project direction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions