Context
HyperAgents tracks performance across self-modification iterations via archive.jsonl + plot_progress.py. The analysis pipeline computes best/avg scores (analysis_utils.py) with bootstrap CI and significance testing between methods — solid statistical grounding for *snapshot* evaluation.
What's missing is longitudinal trend detection across sequential runs: detecting when the self-modification process begins to regress, not just whether method A beats method B in aggregate.
The Gap
Currently plot_progress.py computes:
-
best_scores — monotonically increasing (always $\ge$ previous best)
-
avg_scores — cumulative average (smoothes out volatility)
Neither surface detects directional change rate or regression onset. In a self-modifying system where agents mutate their own code across generations, the question isn't just did we improve? but are we getting less reliable?.
A concrete example: if generation 0→7 shows steady improvement, then 8→15 shows increasing variance and declining best scores, the current pipeline shows this as a flat-ish line in avg_scores (because the cumulative average is anchored to early low values) and best_scores stays flat (never decreases). The regression signal is invisible.
Proposed: RunTrendAnalyzer
A lightweight addition to analysis/ that fits alongside the existing statistical toolkit:
# analysis/trend_analyzer.py
@dataclass
class TrendReport:
genid: str
iteration: int
score: float
ema_score: float # exponentially weighted moving average
delta: float # score change from prior
is_regression: bool # score < ema
@dataclass
class TrendSummary:
slope: float # OLS slope over last N iterations (score/time)
r_squared: float # goodness of fit (1.0 = monotonic, <0 = accelerating decline)
regression_count: int # consecutive regressions in recent window
trend_class: Literal['improving', 'stable', 'degrading', 'unstable']
inflection_points: list[int] # iteration indices where trend direction changed
Key design decisions:
- OLS slope over fixed windows (default last 10 iterations) rather than all-time — recent trend matters more than historical average for detecting onset
- Inflection point detection — use CUSUM or simple sign-change counting on
delta to identify where improvement plateaus/reverses, since this is where the self-modification policy may need adjustment
- Trend classification maps to actionable states:
improving: slope > 0, r_squared > 0.5, 0 consecutive regressions
stable: |slope| ≈ 0 (< threshold), no degradation
degrading: slope < 0, r² > 0.3 (consistent decline)
unstable: r² < 0.3 (high variance, direction unclear)
Where it Integrates
- New file:
analysis/trend_analyzer.py (pure functions, no new deps — numpy already used in analysis_utils.py)
- CLI entry point:
python -m analysis.trend_analyzer --archive archive.jsonl --scores scores.npy
- Plot integration:
plot_progress.py could overlay a trend line and mark inflection points on the existing progress charts
- No breaking changes — additive to the existing
compute_bootstrap_ci() and save_significance_tests() utilities
Why This Matters for Self-Modifying Agents
The core thesis of HyperAgents is that agents can improve themselves through iterative self-modification. But if the self-modification process is producing monotonically worse or increasingly unreliable agents, the current metrics don't flag it until a human notices the chart pattern.
A TrendAnalyzer that runs alongside each experiment would:
- Alert when the trend turns negative (before a human notices)
- Identify the iteration where improvement stops (useful for early termination)
- Provide a quantitative metric for comparing self-modification strategies (the
slope and trend_class become first-class experiment outcomes)
This is also relevant for the broader agent evaluation ecosystem — I've now filed similar proposals across 125+ repos, and this gap appears universally: eval systems track *current* performance but not *rate of change* in performance. For self-modifying systems where the system's own output becomes the input, this is the missing feedback signal.
Happy to discuss design or submit a PR if this aligns with the project direction.
Context
HyperAgents tracks performance across self-modification iterations via
archive.jsonl+plot_progress.py. The analysis pipeline computes best/avg scores (analysis_utils.py) with bootstrap CI and significance testing between methods — solid statistical grounding for *snapshot* evaluation.What's missing is longitudinal trend detection across sequential runs: detecting when the self-modification process begins to regress, not just whether method A beats method B in aggregate.
The Gap
Currently
plot_progress.pycomputes:best_scores— monotonically increasing (alwaysavg_scores— cumulative average (smoothes out volatility)Neither surface detects directional change rate or regression onset. In a self-modifying system where agents mutate their own code across generations, the question isn't just
did we improve?butare we getting less reliable?.A concrete example: if generation 0→7 shows steady improvement, then 8→15 shows increasing variance and declining best scores, the current pipeline shows this as a flat-ish line in
avg_scores(because the cumulative average is anchored to early low values) andbest_scoresstays flat (never decreases). The regression signal is invisible.Proposed:
RunTrendAnalyzerA lightweight addition to
analysis/that fits alongside the existing statistical toolkit:Key design decisions:
deltato identify where improvement plateaus/reverses, since this is where the self-modification policy may need adjustmentimproving: slope > 0,r_squared> 0.5, 0 consecutive regressionsstable: |slope| ≈ 0 (< threshold), no degradationdegrading: slope < 0, r² > 0.3 (consistent decline)unstable: r² < 0.3 (high variance, direction unclear)Where it Integrates
analysis/trend_analyzer.py(pure functions, no new deps —numpyalready used inanalysis_utils.py)python -m analysis.trend_analyzer --archive archive.jsonl --scores scores.npyplot_progress.pycould overlay a trend line and mark inflection points on the existing progress chartscompute_bootstrap_ci()andsave_significance_tests()utilitiesWhy This Matters for Self-Modifying Agents
The core thesis of HyperAgents is that agents can improve themselves through iterative self-modification. But if the self-modification process is producing monotonically worse or increasingly unreliable agents, the current metrics don't flag it until a human notices the chart pattern.
A TrendAnalyzer that runs alongside each experiment would:
slopeandtrend_classbecome first-class experiment outcomes)This is also relevant for the broader agent evaluation ecosystem — I've now filed similar proposals across 125+ repos, and this gap appears universally: eval systems track *current* performance but not *rate of change* in performance. For self-modifying systems where the system's own output becomes the input, this is the missing feedback signal.
Happy to discuss design or submit a PR if this aligns with the project direction.