feat: harden HTTP resiliency with connection retries, request timeout, and docs#2148
Merged
Merged
Conversation
Previously only fetch_events was decorated with @retry_on_connection_failure, so a transient TimeoutError or ClientConnectorError raised from any other method — including the command path (execute_action_group) and all setup/state/refresh calls — propagated raw on the first occurrence. Centralize the retry on the _get/_post/_put/_delete helpers so every request gets uniform transient-connection retry, and drop the now redundant decorator from fetch_events. Fixes #2147
Tighten retry_on_connection_failure from 5 tries / 120s to 3 tries / 30s so a flaky connection gives up faster (~3s worst-case sleep) instead of blocking a command or poll for up to ~15s. Add a test covering the give-up-after-max-tries path.
The give-up-after-max-tries test already exercises _get through the decorator, so the GET retry-once test added no coverage beyond the command-path (_post) regression test.
- Treat ServerDisconnectedError as a transient transport failure (retry fast, no relogin) instead of an auth error. A genuine session expiry still escalates to relogin via the outer auth decorator. - Add a default per-request ClientTimeout (total=15s, sock_connect=10s) to the client-owned session so a hung socket fails fast within the connection-retry budget instead of blocking on aiohttp's 300s default. - Add docs/resiliency.md explaining the retry strategy, timeouts, and which exceptions consumers should handle themselves.
- Remove the '(see docs)' comments from client.py now that the strategy is fully documented. - Document that retry delays use exponential ceilings with full jitter, so they are randomized rather than a fixed ramp.
21 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #2147.
Hardens the client's behaviour on transient network failures and documents the full resiliency strategy. Builds on the original connection-retry centralization with a request timeout, smarter
ServerDisconnectedErrorhandling, and user docs.Changes
Centralize connection retry (original fix)
Previously only
fetch_eventswas decorated with@retry_on_connection_failure, so a transientTimeoutError/aiohttp.ClientConnectorErrorraised from any other method — including the command path (execute_action_group->_execute_action_group_direct) and all setup/state/refresh calls — propagated raw on the very first occurrence instead of being retried.This surfaced in Home Assistant (home-assistant/core#173155): a
ConnectionTimeoutError(subclass of the builtinTimeoutError) raised from a coverclosecommand escaped as an unhandled traceback.The retry is now centralized on the
_get/_post/_put/_deleterequest helpers so every request gets uniform transient-connection retry, with connection retry sitting innermost (closest to the request). The budget is intentionally fail-fast: 3 tries within ~30s.Default request timeout
The client-owned session now applies a default per-request timeout (
total=15s,sock_connect=10s), kept below the connection-retry budget so a hung socket fails fast as aTimeoutErrorand is retried, instead of blocking on aiohttp's 300s default.max_timeonly bounds the scheduling of retries between attempts; it cannot interrupt an in-flight request, so the timeout is what makes the fail-fast budget meaningful at runtime.Callers passing their own
ClientSessionown its timeout; pyOverkiz does not override it.ServerDisconnectedErrorno longer forces a reloginServerDisconnectedError(a dropped keep-alive socket) was bucketed withNotAuthenticatedErrorand triggered a fulllogin()+ event-listener re-register on every occurrence. It is now treated as a transient transport failure and simply retried, with no relogin. A genuine session expiry is still reported by the server as aNot authenticatedresponse, which raisesNotAuthenticatedErrorand escalates to relogin via the separate auth policy.Docs
Adds
docs/resiliency.mdexplaining what the library retries (with budgets and on-retry actions), the fail-fast rationale, exponential backoff with full jitter, request-timeout behaviour and how to configure a custom session, and which exceptions consumers should handle themselves. Linked from the nav and cross-referenced fromtroubleshooting.md.Tests
test_backoff_retries_command_on_connection_failure/test_backoff_gives_up_after_max_tries_on_connection_failure— command/GET path connection retry and give-up.test_backoff_retries_on_server_disconnected_without_relogin— a disconnect is retried without callinglogin().test_server_disconnected_escalates_to_relogin_when_auth_expired— a retry that surfaces a real auth expiry still triggers relogin.test_default_session_has_request_timeout— the client-owned session applies a default timeout within the retry budget.Verification
client.py