feat(sdk): debounce ICE blips + emit iceStateChange / networkOnline / networkOffline#1117
Open
tfrere wants to merge 5 commits into
Open
feat(sdk): debounce ICE blips + emit iceStateChange / networkOnline / networkOffline#1117tfrere wants to merge 5 commits into
tfrere wants to merge 5 commits into
Conversation
WHY
───
Routine mobile/laptop network noise (Wi-Fi blip, AP roam, brief
screen-off, BT route change on iOS) would tear down sessions even
when the underlying WebRTC link recovered on its own a moment
later. The pre-existing handler reacted on the FIRST
`iceConnectionState` transition into `disconnected` / `failed`,
firing an `error` event (and rejecting the session promise) before
the spec-defined recovery window had a chance to play out.
Consumer-side symptoms:
- Mobile shell: full-screen "session ended unexpectedly" on a 2 s
Wi-Fi flicker.
- Embedded Spaces / dev consoles: red banner during a normal
rapid app-switch on iOS.
- The mobile app's own OpenAI bridge already debounced ICE blips
on its peer (`ICE_DISCONNECT_GRACE_MS = 5000`); the robot peer
didn't — so the AI side survived while Reachy dropped.
WHAT CHANGES
────────────
1. `oniceconnectionstatechange` is rewritten to debounce:
- `disconnected` → 3 s grace (deferred to `visibilitychange`
if the tab is hidden, since timers are throttled there).
- `failed` → 1 s grace before rejecting the session.
- `connected` / `completed` → cancel any pending grace.
The original `error` payload shape is preserved so existing
consumers that already listen for `error` keep working.
2. A new public event `iceStateChange` fires on EVERY transition.
Consumers that want a finer-grained "Reconnecting…" badge or
their own restart logic can subscribe instead of attaching a
custom handler to `_pc`.
3. Two new public events `networkOnline` / `networkOffline` forward
`window.online` / `window.offline` and (on supporting browsers)
`navigator.connection.change`. Listeners are scoped to the
live session — installed in `startSession()`, removed in
`stopSession()` / `disconnect()`. Lets every consumer (mobile
shell, iframed Spaces, standalone Spaces) react to transport
swaps without re-implementing the plumbing.
4. `stopSession()` and `disconnect()` symmetrically tear down the
grace timer + visibility handler + network listeners BEFORE
closing `_pc`, so a queued callback can't dereference a dead
peer connection a couple of ticks later.
NON-GOALS (deferred)
────────────────────
This SDK is the SDP answerer (the daemon's GStreamer `webrtcsink`
is the offerer). A true end-to-end ICE restart therefore
requires the daemon to emit a new offer with `iceRestart=true`,
which means a new signaling-protocol message + GStreamer-side
wiring. Tracked as a separate follow-up; this PR is the
client-side debounce + observability layer it will eventually
plug into.
Replaces the abandoned WIP commit 5e35bec on the
`test---sdh-improvement` branch, which only stripped the
`disconnected` error event without adding the grace window or
the visibility-aware deferral.
Co-authored-by: Cursor <cursoragent@cursor.com>
Polish pass on the prior commit. No behaviour change for the existing event payloads; one new event is added. 1. Extract `_initResilienceState()` from the constructor. The constructor was growing a 25-line resilience-fields block that obscured the rest of the bring-up. Move it to a dedicated private method with its own JSDoc explaining what each field backs and which sites mutate / reset it. Net: -22 lines in the constructor, +1 method whose intent is obvious from its name. 2. Split `networkChange` out of `networkOnline`. `window.online` and `navigator.connection.change` are semantically different: the first flips on connectivity recovery, the second on transport swaps (Wi-Fi → 4G, AP roam) that don't necessarily go through `offline`. Aliasing `connection.change` to `networkOnline` (as in the prior commit) would surprise consumers that expect a symmetric `offline` → `online` flip. `networkChange` carries the NetworkInformation snapshot (`effectiveType`, `downlink`, `rtt`, `saveData`) so consumers can react finely (e.g. cap bitrate on `effectiveType === '2g'`) without re-reading `navigator.connection` themselves. 3. Fix the coalesce comment in `_scheduleIceGrace`. The prior text only described one direction (`disconnected → failed`) but the code is symmetric (the reverse can fire on some Android WebViews). Clarify that "latest signal wins". 4. Trim redundant prose in module-top constants, in `startSession()` where the resilience listeners install, and in the cleanup blocks of `stopSession()` / `disconnect()`. Keeps the comment-to-code ratio readable without losing the WHY (which stays in the dedicated JSDoc blocks on the helpers). Co-authored-by: Cursor <cursoragent@cursor.com>
Adds an explicit JSDoc section explaining what the grace + network awareness pass does NOT cover, so consumers don't design their reconnect UX around primitives that aren't shipping yet: - Spec-compliant ICE restart cannot be initiated from the SDK alone. `pc.restartIce()` only flags the local agent; an *ICE-restart SDP offer* must originate from the offerer side (the daemon's `webrtcsink`) and reach us via the central signaling relay. Until that wiring exists, the SDK can only debounce + report - it can't recover an actually-broken transport. - Full peer-connection recreation (LiveKit-style "full reconnect" under the same session/peer id) is also out of scope: the daemon has to be told "rebuild a fresh offer for this same session" when it sees the local PC drop, otherwise it treats the disconnect as terminal and releases the slot. Both items are tracked separately (P2 / P3). What this SDK ships today is the foundation they'll build on - stable transient classification, scoped network listeners, deterministic teardown. Consumers (mobile shell, embedded conversation app, sub-app Spaces) should design their reconnect UX against this surface so it stays source-compatible when the offerer-side restart lands. Pure docstring; no behaviour change. Co-authored-by: Cursor <cursoragent@cursor.com>
…nection fix/drop hopeless reconnection
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
Routine mobile/laptop network noise (Wi-Fi blip, AP roam, brief screen-off, BT route change on iOS) was tearing down WebRTC sessions even when the underlying ICE link recovered a moment later. The previous handler reacted on the FIRST transition into
disconnected/failed, firingerror(and rejecting the session promise) before the spec-defined recovery window had a chance to play out.Concrete consumer-side symptoms today:
This PR rewrites the SDK side using the same pattern the OpenAI bridge uses, and adds two public events that let every consumer (mobile shell, iframed Spaces, standalone Spaces) react to transport changes without re-implementing the plumbing.
What changes
1.
oniceconnectionstatechangeis debouncedconnected/completeddisconnectedvisibilitychange(background timers are throttled, so a foreground timer would fire unpredictably). On grace expiry: if stilldisconnected, emiterrorwith the same payload shape as before.failederror. We've observed realfailed → connectedtransitions on rapid AP roams and BT route changes; 1 s is enough to absorb those without delaying a real failure noticeably.Existing consumers listening for
errorkeep working.2. New event `iceStateChange`
Fires on every `iceConnectionState` transition with `{ state: RTCIceConnectionState }`. Lets consumers render finer-grained UX (e.g. a transient "Reconnecting…" badge during `disconnected`) without attaching their own handler to `_pc`.
3. New events `networkOnline` / `networkOffline`
Forwarded from `window.online` / `window.offline` and (on browsers that ship it: Chrome, Android WebView) `navigator.connection.change`. Listeners are scoped to the live session - installed in `startSession()`, removed in `stopSession()` / `disconnect()`. Lets the mobile shell / iframed Spaces freeze motion + show a badge on offline and probe the data channel on online, without each consumer wiring its own browser listeners.
4. Symmetric cleanup
`stopSession()` and `disconnect()` both call `_clearIceGrace()` and `_uninstallNetworkListeners()` BEFORE closing `_pc`, so a queued grace callback can't dereference a dead peer connection a couple of ticks later.
Non-goals (deferred)
This SDK is the SDP answerer (the daemon's GStreamer `webrtcsink` is the offerer). A true end-to-end ICE restart therefore requires the daemon to emit a new offer with `iceRestart=true`, which means a new signaling-protocol message + GStreamer-side wiring on `reachy_mini_central` and the daemon. That's a separate follow-up; this PR is the client-side debounce + observability layer it will eventually plug into.
Also out of scope: TURN servers (no STUN-only configuration changed here), full PC recreation as a final fallback.
Replaces
Supersedes the abandoned WIP commit `5e35bec2` on the `test---sdh-improvement` branch, which only stripped the `disconnected` error event without adding the grace window or the visibility-aware deferral.
Test plan
Made with Cursor