Skip to content

fix(agent): OpenRouter transient error retry + heartbeat retry#16

Merged
rz1989s merged 3 commits intomainfrom
fix/openrouter-retry-mitigation
Feb 20, 2026
Merged

fix(agent): OpenRouter transient error retry + heartbeat retry#16
rz1989s merged 3 commits intomainfrom
fix/openrouter-retry-mitigation

Conversation

@rz1989s
Copy link
Copy Markdown
Member

@rz1989s rz1989s commented Feb 20, 2026

Summary

Mitigates the Feb 19 incident where a transient OpenRouter 401 ("User not found") caused an empty report to be committed (0 narratives despite 17 anomalies detected).

Three-layered defense:

  • Transient retry in OpenRouter — 401/408/429 now retry the same model up to 2x with 3s/6s backoff before falling to the next model in the chain. Previously these threw immediately, bypassing the fallback chain entirely.

  • Pipeline result reportingrunPipeline() returns { narratives, anomalies } counts so the heartbeat daemon can detect empty reports (LLM failure was silently caught by clustering's error handler).

  • Heartbeat retry mechanism — Two retry paths:

    1. Pipeline crash → retry after 30min instead of waiting 24h (max 2 retries/day)
    2. Empty report safety net → if anomalies > 0 but narratives = 0, skip commit and retry after 30min

Test plan

  • 403 tests pass (180 web + 223 agent, +11 new)
  • TypeScript strict mode clean
  • Deploy agent to VPS and verify next pipeline run produces report with narratives
  • Verify retry behavior if OpenRouter returns transient error (can't easily simulate in prod)

401/408/429 now retry same model (up to 2x with backoff) before falling
to next model in the chain. Previously 401 threw immediately, causing
empty reports on transient OpenRouter "User not found" errors.
runPipeline() now returns { narratives, anomalies } counts so callers
can detect empty reports (0 narratives with anomalies present) and
decide whether to commit or retry.
Two retry mechanisms:
1. Pipeline crash: retry after 30min instead of waiting 24h (max 2/day)
2. Empty report safety net: if anomalies detected but 0 narratives
   (LLM failure was silently caught), skip commit and retry after 30min

Prevents both the Feb 19 scenario (transient 401 → empty report committed)
and general pipeline failures from wasting an entire day.
@rz1989s rz1989s merged commit 026ee3d into main Feb 20, 2026
1 check passed
@rz1989s rz1989s deleted the fix/openrouter-retry-mitigation branch February 20, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant