Skip to content

feat(sdk): debounce ICE blips + emit iceStateChange / networkOnline / networkOffline#1117

Open
tfrere wants to merge 5 commits into
mainfrom
feat/sdk-resilience-grace-and-net-events
Open

feat(sdk): debounce ICE blips + emit iceStateChange / networkOnline / networkOffline#1117
tfrere wants to merge 5 commits into
mainfrom
feat/sdk-resilience-grace-and-net-events

Conversation

@tfrere
Copy link
Copy Markdown
Contributor

@tfrere tfrere commented May 12, 2026

Why

Routine mobile/laptop network noise (Wi-Fi blip, AP roam, brief screen-off, BT route change on iOS) was tearing down WebRTC sessions even when the underlying ICE link recovered a moment later. The previous handler reacted on the FIRST transition into disconnected / failed, firing error (and rejecting the session promise) before the spec-defined recovery window had a chance to play out.

Concrete consumer-side symptoms today:

  • Mobile shell: full-screen "session ended unexpectedly" on a 2 s Wi-Fi flicker.
  • Embedded Spaces / dev consoles: red banner during a normal rapid app-switch on iOS.
  • The mobile app's own OpenAI bridge already debounces ICE blips on its peer (5 s grace). The robot peer didn't, so the AI side survived while Reachy dropped.

This PR rewrites the SDK side using the same pattern the OpenAI bridge uses, and adds two public events that let every consumer (mobile shell, iframed Spaces, standalone Spaces) react to transport changes without re-implementing the plumbing.

What changes

1. oniceconnectionstatechange is debounced

State Behaviour
connected / completed Clears any pending grace, marks ICE up, proceeds with session-ready check (unchanged).
disconnected Schedules a 3 s grace. If the tab is hidden, defers the grace to the next visibilitychange (background timers are throttled, so a foreground timer would fire unpredictably). On grace expiry: if still disconnected, emit error with the same payload shape as before.
failed Schedules a 1 s grace before rejecting the session promise + emitting error. We've observed real failed → connected transitions on rapid AP roams and BT route changes; 1 s is enough to absorb those without delaying a real failure noticeably.

Existing consumers listening for error keep working.

2. New event `iceStateChange`

Fires on every `iceConnectionState` transition with `{ state: RTCIceConnectionState }`. Lets consumers render finer-grained UX (e.g. a transient "Reconnecting…" badge during `disconnected`) without attaching their own handler to `_pc`.

3. New events `networkOnline` / `networkOffline`

Forwarded from `window.online` / `window.offline` and (on browsers that ship it: Chrome, Android WebView) `navigator.connection.change`. Listeners are scoped to the live session - installed in `startSession()`, removed in `stopSession()` / `disconnect()`. Lets the mobile shell / iframed Spaces freeze motion + show a badge on offline and probe the data channel on online, without each consumer wiring its own browser listeners.

4. Symmetric cleanup

`stopSession()` and `disconnect()` both call `_clearIceGrace()` and `_uninstallNetworkListeners()` BEFORE closing `_pc`, so a queued grace callback can't dereference a dead peer connection a couple of ticks later.

Non-goals (deferred)

This SDK is the SDP answerer (the daemon's GStreamer `webrtcsink` is the offerer). A true end-to-end ICE restart therefore requires the daemon to emit a new offer with `iceRestart=true`, which means a new signaling-protocol message + GStreamer-side wiring on `reachy_mini_central` and the daemon. That's a separate follow-up; this PR is the client-side debounce + observability layer it will eventually plug into.

Also out of scope: TURN servers (no STUN-only configuration changed here), full PC recreation as a final fallback.

Replaces

Supersedes the abandoned WIP commit `5e35bec2` on the `test---sdh-improvement` branch, which only stripped the `disconnected` error event without adding the grace window or the visibility-aware deferral.

Test plan

  • Healthy session: connect, talk, stop. `iceStateChange` visible in DevTools (`checking` → `connected`). No regression.
  • Wi-Fi flicker (toggle AP off ~2 s): session survives, `iceStateChange` shows `disconnected → connected`, no `error`.
  • Airplane mode 5 s then off: `networkOffline` fires, then `networkOnline`, ICE recovers, session survives.
  • Hard network kill (10 s+): grace expires, `error` is emitted with the spec-compliant payload. Session can be restarted cleanly.
  • iOS Safari WKWebView: lock screen 30 s during a session, unlock. Visibility-aware deferral fires correctly on return.
  • Mobile-app vendored SDK swap: re-vendor `vendor/reachy-mini.js` once merged.

Made with Cursor

tfrere and others added 5 commits May 12, 2026 23:19
WHY
───
Routine mobile/laptop network noise (Wi-Fi blip, AP roam, brief
screen-off, BT route change on iOS) would tear down sessions even
when the underlying WebRTC link recovered on its own a moment
later.  The pre-existing handler reacted on the FIRST
`iceConnectionState` transition into `disconnected` / `failed`,
firing an `error` event (and rejecting the session promise) before
the spec-defined recovery window had a chance to play out.

Consumer-side symptoms:
  - Mobile shell: full-screen "session ended unexpectedly" on a 2 s
    Wi-Fi flicker.
  - Embedded Spaces / dev consoles: red banner during a normal
    rapid app-switch on iOS.
  - The mobile app's own OpenAI bridge already debounced ICE blips
    on its peer (`ICE_DISCONNECT_GRACE_MS = 5000`); the robot peer
    didn't — so the AI side survived while Reachy dropped.

WHAT CHANGES
────────────
1. `oniceconnectionstatechange` is rewritten to debounce:
     - `disconnected` → 3 s grace (deferred to `visibilitychange`
       if the tab is hidden, since timers are throttled there).
     - `failed`       → 1 s grace before rejecting the session.
     - `connected` / `completed` → cancel any pending grace.
   The original `error` payload shape is preserved so existing
   consumers that already listen for `error` keep working.

2. A new public event `iceStateChange` fires on EVERY transition.
   Consumers that want a finer-grained "Reconnecting…" badge or
   their own restart logic can subscribe instead of attaching a
   custom handler to `_pc`.

3. Two new public events `networkOnline` / `networkOffline` forward
   `window.online` / `window.offline` and (on supporting browsers)
   `navigator.connection.change`.  Listeners are scoped to the
   live session — installed in `startSession()`, removed in
   `stopSession()` / `disconnect()`.  Lets every consumer (mobile
   shell, iframed Spaces, standalone Spaces) react to transport
   swaps without re-implementing the plumbing.

4. `stopSession()` and `disconnect()` symmetrically tear down the
   grace timer + visibility handler + network listeners BEFORE
   closing `_pc`, so a queued callback can't dereference a dead
   peer connection a couple of ticks later.

NON-GOALS (deferred)
────────────────────
This SDK is the SDP answerer (the daemon's GStreamer `webrtcsink`
is the offerer).  A true end-to-end ICE restart therefore
requires the daemon to emit a new offer with `iceRestart=true`,
which means a new signaling-protocol message + GStreamer-side
wiring.  Tracked as a separate follow-up; this PR is the
client-side debounce + observability layer it will eventually
plug into.

Replaces the abandoned WIP commit 5e35bec on the
`test---sdh-improvement` branch, which only stripped the
`disconnected` error event without adding the grace window or
the visibility-aware deferral.

Co-authored-by: Cursor <cursoragent@cursor.com>
Polish pass on the prior commit. No behaviour change for the existing
event payloads; one new event is added.

1. Extract `_initResilienceState()` from the constructor.
   The constructor was growing a 25-line resilience-fields block
   that obscured the rest of the bring-up. Move it to a dedicated
   private method with its own JSDoc explaining what each field
   backs and which sites mutate / reset it. Net: -22 lines in the
   constructor, +1 method whose intent is obvious from its name.

2. Split `networkChange` out of `networkOnline`.
   `window.online` and `navigator.connection.change` are
   semantically different: the first flips on connectivity
   recovery, the second on transport swaps (Wi-Fi → 4G, AP roam)
   that don't necessarily go through `offline`. Aliasing
   `connection.change` to `networkOnline` (as in the prior
   commit) would surprise consumers that expect a symmetric
   `offline` → `online` flip.

   `networkChange` carries the NetworkInformation snapshot
   (`effectiveType`, `downlink`, `rtt`, `saveData`) so consumers
   can react finely (e.g. cap bitrate on `effectiveType === '2g'`)
   without re-reading `navigator.connection` themselves.

3. Fix the coalesce comment in `_scheduleIceGrace`.
   The prior text only described one direction (`disconnected →
   failed`) but the code is symmetric (the reverse can fire on
   some Android WebViews). Clarify that "latest signal wins".

4. Trim redundant prose in module-top constants, in
   `startSession()` where the resilience listeners install, and
   in the cleanup blocks of `stopSession()` / `disconnect()`.
   Keeps the comment-to-code ratio readable without losing the
   WHY (which stays in the dedicated JSDoc blocks on the helpers).

Co-authored-by: Cursor <cursoragent@cursor.com>
Adds an explicit JSDoc section explaining what the grace + network
awareness pass does NOT cover, so consumers don't design their
reconnect UX around primitives that aren't shipping yet:

- Spec-compliant ICE restart cannot be initiated from the SDK alone.
  `pc.restartIce()` only flags the local agent; an *ICE-restart SDP
  offer* must originate from the offerer side (the daemon's
  `webrtcsink`) and reach us via the central signaling relay. Until
  that wiring exists, the SDK can only debounce + report - it can't
  recover an actually-broken transport.
- Full peer-connection recreation (LiveKit-style "full reconnect"
  under the same session/peer id) is also out of scope: the daemon
  has to be told "rebuild a fresh offer for this same session" when
  it sees the local PC drop, otherwise it treats the disconnect as
  terminal and releases the slot.

Both items are tracked separately (P2 / P3). What this SDK ships
today is the foundation they'll build on - stable transient
classification, scoped network listeners, deterministic teardown.
Consumers (mobile shell, embedded conversation app, sub-app
Spaces) should design their reconnect UX against this surface so
it stays source-compatible when the offerer-side restart lands.

Pure docstring; no behaviour change.

Co-authored-by: Cursor <cursoragent@cursor.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants