Skip to content

Update libsrt to v1.5.5 and harden Mux SRT streaming#1087

Merged
fusion2004 merged 3 commits into
mainfrom
update-srt-version
May 21, 2026
Merged

Update libsrt to v1.5.5 and harden Mux SRT streaming#1087
fusion2004 merged 3 commits into
mainfrom
update-srt-version

Conversation

@fusion2004
Copy link
Copy Markdown
Owner

Summary

  • Pin libsrt build to v1.5.5 (was a 3-year-old default from @eyevinn/srt)
  • Switch Renovate to config:best-practices + group:allNonMajor
  • Make Mux SRT streaming survive transient srt_sendmsg2: Connection was broken drops by reconnecting in-place, with stats logging for diagnosis

The streaming changes

Production hit "Connection was broken" mid-party twice, which killed the stream both times. Three coordinated changes make these recoverable:

  • Mux livestream: set reconnect_window: 30. Low Latency mode defaults this to 0 (rejects any reconnect), so we have to opt back in.
  • SRT caller: poll srt_bstats every 2s and log RTT, retrans/loss counters, send-rate, flight size, and send-buffer depth. The libsrt error string doesn't tell us why the connection broke; this gives us ground truth for future incidents.
  • streamPacketFile: on "Connection was broken", tear down the broken caller and reopen against the same livestream creds, retrying the failed write. Bounded to 5 attempts per file invocation. packetCtx is preserved across the gap so PCR/PTS stay continuous. Disconnect/reconnect messages route through debugWarn/debugInfo so they reach the Discord debug channel.

Trade-off worth knowing: a reconnected write re-sends the chunk that failed, which can produce a CC duplicate on Mux's side if libsrt had partially delivered. Acceptable vs. a dead party; the new stats logging will tell us if it's actually a problem.

Test plan

  • mise run lint clean
  • mise run test — 178 tests passing
  • Run a real listening party against Mux and confirm the stats log line appears in pino output every ~2s
  • Confirm that a forced disconnect (e.g. SIGSTOP libsrt's worker, or kill network briefly) triggers the reconnect path and the party continues; verify the disconnect + reconnect messages show up in the Discord debug channel

🤖 Generated with Claude Code

fusion2004 and others added 3 commits May 21, 2026 01:00
@eyevinn/srt defaults to a 3-year-old version; this puts us on latest
stable.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Use the config:best-practices preset with group:allNonMajor instead of
hand-rolling the non-major grouping. Keep the node engines opt-out and
the bump rangeStrategy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Production has been seeing "Connection was broken" from srt_sendmsg2
mid-stream, which kills the whole party. Three changes work together to
make these recoverable:

- Mux livestream: set reconnect_window to 30s. Low Latency mode defaults
  this to 0 (no reconnect allowed), so we have to opt back in.
- SRT caller: poll srt_bstats every 2s and log RTT, retrans/loss
  counters, send-rate, and buffer depth. Until now we had no ground
  truth on why connections drop; this gives us one.
- streamPacketFile: on "Connection was broken", tear down the broken
  caller and reopen against the same livestream creds, retrying the
  failed write. Bounded to 5 attempts per file invocation. packetCtx is
  preserved so PCR/PTS stay continuous across the gap. Disconnect and
  reconnect messages both reach the Discord debug channel via
  debugWarn/debugInfo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@fusion2004 fusion2004 merged commit b5b0c16 into main May 21, 2026
6 checks passed
@fusion2004 fusion2004 deleted the update-srt-version branch May 21, 2026 15:05
fusion2004 added a commit that referenced this pull request May 21, 2026
Two bugs surfaced in production stats and logs after #1087 deployed:

- srtWrite retried the same Buffer after a reconnect, but @eyevinn/srt
  transfers the chunk's ArrayBuffer to its worker via postMessage on
  the first call, leaving it detached in our process. The retry threw
  "Cannot transfer object of unsupported type" and killed the party.
  Refactor srtWrite to take the persistent chunkBuf as a source and
  allocate a fresh allocUnsafeSlow + copy each loop iteration.

- 15-second pacing lead against a 1000ms SRTO_LATENCY meant libsrt was
  TLPKTDROP'ing ~50% of attempted packets from the very first stats
  snapshot (msSndBuf hovering at ~800ms with zero loss/retrans). Mux
  saw 26 seconds of gappy audio before dropping the connection.
  Convert PACING_LEAD_SEC=15 to PACING_LEAD_MS=900 so the lead stays
  comfortably under the SRT latency window.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant