Skip to content

Improve AI broker retry logs for rate-limit fixed wait handling#836

Open
GVCUTV wants to merge 1 commit intoissue665-iterative-correctionfrom
codex/enhance-retry-logging-in-agentbrokerimpl
Open

Improve AI broker retry logs for rate-limit fixed wait handling#836
GVCUTV wants to merge 1 commit intoissue665-iterative-correctionfrom
codex/enhance-retry-logging-in-agentbrokerimpl

Conversation

@GVCUTV
Copy link
Contributor

@GVCUTV GVCUTV commented Feb 19, 2026

Motivation

  • Improve observability and incident response for AI provider retries by making retry logs fully structured and by clearly distinguishing normal retry backoffs from the special fixed 60s 429 window-reset wait used for GEMMA.
  • Provide explicit fields that SRE/incident responders can rely on (provider, attempt, correlation_id, error_code, backoff_ms) so automated tooling and runbooks can surface and filter relevant events.

Description

  • Updated src/main/java/org/cswteams/ms3/ai/broker/AgentBrokerImpl.java to include provider in ai_broker_attempt_failed logs and to consistently emit structured fields for provider, attempt, correlation_id, and error_code across retry paths.
  • Added a dedicated fixed-window 429 event event=ai_broker_rate_limit_fixed_wait (with wait_reason=rate_limit_window_reset_429) for the GEMMA 60s forced wait, separate from normal retry backoff events.
  • Split retry logging into clear categories: event=ai_broker_retry_backoff with wait_reason=normal_retry_backoff or wait_reason=rate_limit_backoff, and the dedicated ai_broker_rate_limit_fixed_wait for forced 429 waits, and included explicit backoff_ms in these messages.
  • Introduced getErrorCode(...) helper to normalize extraction of an error_code string for logs and updated the ai_broker_rate_limit_retry_skipped log to include provider and error_code.
  • Updated runbook/docs at docs/AI_powered_rescheduling/sprint_4/story_5.md to document the new event taxonomy and guidance for incident handling (how to distinguish normal retry backoff vs forced 429 window reset waits).

Testing

  • Attempted to run unit tests for the broker class with mvn -q -Dtest=AgentBrokerImplTest test, but the build could not complete in this environment because Maven failed to resolve the parent POM from Maven Central (HTTP 403), so tests could not be executed to completion.
  • No other automated tests were run in this environment due to external artifact resolution being blocked.

Codex Task

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant