Skip to content

feat: harden HTTP resiliency with connection retries, request timeout, and docs#2148

Merged
iMicknl merged 7 commits into
mainfrom
fix/retry-connection-failure-all-requests
Jun 28, 2026
Merged

feat: harden HTTP resiliency with connection retries, request timeout, and docs#2148
iMicknl merged 7 commits into
mainfrom
fix/retry-connection-failure-all-requests

Conversation

@iMicknl

@iMicknl iMicknl commented Jun 22, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes #2147.

Hardens the client's behaviour on transient network failures and documents the full resiliency strategy. Builds on the original connection-retry centralization with a request timeout, smarter ServerDisconnectedError handling, and user docs.

Changes

Centralize connection retry (original fix)

Previously only fetch_events was decorated with @retry_on_connection_failure, so a transient TimeoutError / aiohttp.ClientConnectorError raised from any other method — including the command path (execute_action_group -> _execute_action_group_direct) and all setup/state/refresh calls — propagated raw on the very first occurrence instead of being retried.

This surfaced in Home Assistant (home-assistant/core#173155): a ConnectionTimeoutError (subclass of the builtin TimeoutError) raised from a cover close command escaped as an unhandled traceback.

The retry is now centralized on the _get/_post/_put/_delete request helpers so every request gets uniform transient-connection retry, with connection retry sitting innermost (closest to the request). The budget is intentionally fail-fast: 3 tries within ~30s.

Default request timeout

The client-owned session now applies a default per-request timeout (total=15s, sock_connect=10s), kept below the connection-retry budget so a hung socket fails fast as a TimeoutError and is retried, instead of blocking on aiohttp's 300s default. max_time only bounds the scheduling of retries between attempts; it cannot interrupt an in-flight request, so the timeout is what makes the fail-fast budget meaningful at runtime.

Callers passing their own ClientSession own its timeout; pyOverkiz does not override it.

ServerDisconnectedError no longer forces a relogin

ServerDisconnectedError (a dropped keep-alive socket) was bucketed with NotAuthenticatedError and triggered a full login() + event-listener re-register on every occurrence. It is now treated as a transient transport failure and simply retried, with no relogin. A genuine session expiry is still reported by the server as a Not authenticated response, which raises NotAuthenticatedError and escalates to relogin via the separate auth policy.

Docs

Adds docs/resiliency.md explaining what the library retries (with budgets and on-retry actions), the fail-fast rationale, exponential backoff with full jitter, request-timeout behaviour and how to configure a custom session, and which exceptions consumers should handle themselves. Linked from the nav and cross-referenced from troubleshooting.md.

Tests

  • test_backoff_retries_command_on_connection_failure / test_backoff_gives_up_after_max_tries_on_connection_failure — command/GET path connection retry and give-up.
  • test_backoff_retries_on_server_disconnected_without_relogin — a disconnect is retried without calling login().
  • test_server_disconnected_escalates_to_relogin_when_auth_expired — a retry that surfaces a real auth expiry still triggers relogin.
  • test_default_session_has_request_timeout — the client-owned session applies a default timeout within the retry budget.

Verification

  • 552 tests pass
  • ruff clean
  • mypy clean on client.py

Previously only fetch_events was decorated with
@retry_on_connection_failure, so a transient TimeoutError or
ClientConnectorError raised from any other method — including the
command path (execute_action_group) and all setup/state/refresh
calls — propagated raw on the first occurrence.

Centralize the retry on the _get/_post/_put/_delete helpers so every
request gets uniform transient-connection retry, and drop the now
redundant decorator from fetch_events.

Fixes #2147
@iMicknl iMicknl requested a review from tetienne as a code owner June 22, 2026 14:54
@github-actions github-actions Bot added the bug Something isn't working label Jun 22, 2026
iMicknl added 2 commits June 22, 2026 14:57
Tighten retry_on_connection_failure from 5 tries / 120s to 3 tries /
30s so a flaky connection gives up faster (~3s worst-case sleep)
instead of blocking a command or poll for up to ~15s. Add a test
covering the give-up-after-max-tries path.
The give-up-after-max-tries test already exercises _get through the
decorator, so the GET retry-once test added no coverage beyond the
command-path (_post) regression test.
@iMicknl iMicknl changed the title fix: retry transient connection errors on all HTTP requests fix: Retry transient connection errors on all HTTP requests Jun 22, 2026
iMicknl added 4 commits June 22, 2026 15:03
- Treat ServerDisconnectedError as a transient transport failure (retry
  fast, no relogin) instead of an auth error. A genuine session expiry
  still escalates to relogin via the outer auth decorator.
- Add a default per-request ClientTimeout (total=15s, sock_connect=10s)
  to the client-owned session so a hung socket fails fast within the
  connection-retry budget instead of blocking on aiohttp's 300s default.
- Add docs/resiliency.md explaining the retry strategy, timeouts, and
  which exceptions consumers should handle themselves.
- Remove the '(see docs)' comments from client.py now that the strategy
  is fully documented.
- Document that retry delays use exponential ceilings with full jitter,
  so they are randomized rather than a fixed ramp.
@iMicknl iMicknl changed the title fix: Retry transient connection errors on all HTTP requests feat: harden HTTP resiliency with connection retries, request timeout, and docs Jun 28, 2026
@github-actions github-actions Bot added the feature New feature or capability label Jun 28, 2026
@iMicknl iMicknl removed the bug Something isn't working label Jun 28, 2026
@iMicknl iMicknl merged commit 493a3c3 into main Jun 28, 2026
15 checks passed
@iMicknl iMicknl deleted the fix/retry-connection-failure-all-requests branch June 28, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature New feature or capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Transient connection errors (TimeoutError/ClientConnectorError) are only retried in fetch_events

1 participant