Feat/webrtc transport with Sam config merged on top #2595
Draft
spomichter wants to merge 99 commits into
Draft
Conversation
…e SFU Implements a new pubsub transport backed by WebRTC DataChannels over Cloudflare's Realtime SFU. Two new classes in dimos/protocol/pubsub/impl/webrtcpubsub.py: - CloudflareSession: manages the WebRTC PeerConnection lifecycle. Opens two CF sessions (publisher + subscriber) so a single process can do loopback pubsub. Runs aiortc on a dedicated background asyncio thread with its own ThreadPoolExecutor (so we don't leak asyncio_N worker threads). Uses negotiated=True placeholder DCs with id=100 during transport establishment to avoid stream-id collisions with CF-assigned ids. - WebRTCPubSub: bytes-on-the-wire pubsub facade matching the LCMPubSubBase / BytesSharedMemory interface (string topics, bytes payloads). Lazily creates pub/sub DataChannel pairs on first publish/subscribe per topic. Also adds: - WebRTCTransport in dimos/core/transport.py (mirrors LCMTransport pattern, no encoding - bytes only). - WebRTC benchmark testcase in dimos/protocol/pubsub/benchmark/testdata.py, gated on aiortc + CF_TELEOP_APP_ID / CF_TELEOP_APP_SECRET env vars. - Integration test in dimos/protocol/pubsub/impl/test_webrtcpubsub.py covering basic pub/sub, latency, and throughput (all live tests skip without CF credentials). - aiortc + httpx as new 'webrtc' optional extra in pyproject.toml. Live benchmark (us-east-2 -> CF edge): - 64-256B: ~10K msgs/s, 0% loss - 1KiB: ~7K msgs/s, 0% loss - >= 64KiB: dropped (above SCTP message size) - Median single-RTT: ~2.5 ms
…sport - Add BrokerProvider: DataChannelProvider that works through the hosted teleop broker (dimensional-teleop) instead of directly with CF credentials. Handles session registration, heartbeat loop, and DataChannel creation when an operator joins via the broker's bridge-datachannel API. - Extend WebRTCTransport with optional msg_type parameter for typed LCM encode/decode with fingerprint-based filtering. Multiple transports can share a single multiplexed DataChannel and each receives only its type. - Add hosted teleop blueprints (dimos/teleop/hosted/) demonstrating the module-free architecture: make_teleop_hosted_go2() uses pure transport (zero modules), make_teleop_hosted_go2_scaled() adds a thin TeleopScalerModule for speed scaling only. - Add unit tests for typed mode, fingerprint filtering, multiplexed dispatch, and BrokerProvider credential validation.
- Rebase on main and regenerate uv.lock (resolve conflict) - Add _LoopbackProvider (in-process, no network) to benchmark testdata - Enables local WebRTC transport benchmarking without CF credentials - All 12 message sizes pass locally (2.78s total)
The previous lock regen dropped the `exclude-newer-span` marker, leaving only the frozen `exclude-newer` timestamp. uv then treats every resolve as "cooldown was newly added" and forces a re-resolve against today minus 7 days — which currently excludes md-babel-py 1.2.0 (published 2026-05-15) and breaks `uv sync --extra all` / `uv lock`. Re-adding the span line tells uv the lock was generated with P7D semantics, so the existing pinned versions are honored.
- Remove __init__.py files (project policy: no init files) - Remove section markers from test_webrtcpubsub.py - Regenerate all_blueprints.py (adds TeleopScalerModule) - Fix WebRTCTransport.__reduce__ to preserve msg_type across pickle - Fix CloudflareProvider.publish() race: snapshot loop ref before use - Fix CloudflareProvider.subscribe() race: check sub_channels inside lock - Add comment clarifying TwistStamped→Twist type safety in blueprint
- Add return type annotations to CloudflareProvider event handlers - Fix type: ignore codes to match actual mypy errors (attr-defined) - Add type annotation to __setstate__ dict parameter - Add type: ignore[arg-type] for WebRTCPubSub→PubSub duck typing in benchmark - Remove TwistStamped subclass comment from blueprint
…ew/webrtc-transport # Conflicts: # dimos/robot/all_blueprints.py # pyproject.toml # uv.lock
…le provider configs - Move webrtcpubsub + providers into protocol/pubsub/impl/webrtc/ with providers/spec.py (Provider protocol, ProviderConfig, AsyncProviderBase) - ProviderConfig: picklable, hashable factory resolving to a per-process singleton provider — transports survive pickling into module workers and share one PeerConnection per process - WebRTCTransport rebuilt on DimosMsg-bound typevar; CloudflareTransport subclass binds BrokerConfig for blueprint use - Fingerprint filter now derives from the wire format (TwistStamped inherits Twist's fingerprint but encodes as LCM TwistStamped) - BrokerProvider: operator rejoin via SCTP id tracking, heartbeat task held and cancelled on disconnect, X-Robot-API-Key auth, id=0 throwaway channel, publish() raises (broker is receive-only for now) - CloudflareProvider: locking discipline, asyncio channel-creation lock, collision-safe DC names - Benchmark: WebRTC case in the standard harness, env-overridable knobs (DIMOS_BENCH_DURATION_S / _MAX_MESSAGES / _RECEIVE_TIMEOUT_S) - teleop-hosted-go2-transport: transport-only go2 blueprint (3 lines) - Delete dimos/teleop/hosted (duplicate scaler), add webrtc extra to all
# Conflicts: # dimos/robot/all_blueprints.py # dimos/robot/test_all_blueprints.py # dimos/teleop/quest_hosted/README.md # dimos/teleop/quest_hosted/blueprints.py # dimos/teleop/quest_hosted/hosted_extensions.py # dimos/teleop/quest_hosted/hosted_teleop_module.py # dimos/teleop/quest_hosted/video_track.py # dimos/teleop/utils/recorder.py # dimos/teleop/utils/report.py # dimos/teleop/utils/stream_stats.py # dimos/teleop/utils/video_stats.py # docs/capabilities/teleoperation/hosted.md
…azy start - delete impl/webrtc/__init__.py + providers/__init__.py (repo forbids non-root inits — was the only red CI check); importers now hit the source modules directly - subscribe_all callbacks fired once per *subscription* on a topic, not once per message; the dispatcher is now attached to each topic exactly once (regression test added) - WebRTCTransport.start() guards first-use init with a lock so two threads racing subscribe()/broadcast() can't construct two WebRTCPubSub wrappers and orphan subscribe_all state - test fixture annotation: Generator[X, None, None] -> Iterator[X]
The main merge silently kept stale hunks from ruthwik's branch at our old merge point (mem2 recorder warned_frames/recv_ts, TwistStamped seq field) that he later dropped before #2411's final squash — non- overlapping hunks, so git auto-merged without conflict. This PR never intended to touch either file; both now match main byte-for-byte.
aiortc stands in for the browser operator (same join/bridge/negotiated- channel protocol as the teleop web client); the robot side is the exact CloudflareTransport the teleop-hosted-go2-transport blueprint binds. Env-gated on TELEOP_API_KEY/TELEOP_ROBOT_ID/TELEOP_OPERATOR_TOKEN, tool- marked. Verified against the live broker: 40/40 TwistStamped delivered and decoded.
BrokerProvider now owns all three broker-bridged channels (topic == DataChannel name): cmd_unreliable + state_reliable inbound, and state_reliable_back outbound via publish() — robot telemetry flows through CloudflareTransport instead of raising. Heartbeat acks carry all three SCTP ids; channels are (re)opened per operator join/leave/rejoin. Publishes drop while no operator is connected (normal pubsub semantics). HostedTeleopModule is deprecated: data planes are covered by CloudflareTransport; the module remains only for video-track publishing until BrokerProvider grows media support. Live e2e (test_broker_e2e.py) now covers both directions against teleop.dimensionalos.com: operator-sim -> cmd_unreliable -> transport (40/40 decoded) and transport -> state_reliable_back -> operator-sim.
The broker derives the canonical robot identity from the API key (dimensional-teleop now accepts session create without robot_id), so BrokerProvider only requires TELEOP_API_KEY. An explicit robot_id is still sent for consistency-checking when configured. Kills the 'TELEOP_ROBOT_ID or BrokerConfig.robot_id required' failure mode when the env var doesn't reach module worker processes.
…replacement
The provider's session offer now always carries a sendonly video track
(CameraVideoTrack + propagate_bundle_candidates moved from quest_hosted
into the webrtc impl; the deprecated module imports them from here).
CloudflareVideoTransport feeds a blueprint's Image stream into that
track via the shared provider singleton, so hosted teleop with video is
a transport mapping on the base blueprint — no module wrapper:
("cmd_vel", Twist): CloudflareTransport("cmd_unreliable", TwistStamped),
("color_image", Image): CloudflareVideoTransport(),
TELEOP_ROBOT_ID also no longer required by the e2e test (broker derives
identity from the key). Live e2e now covers all three planes: cmd
40/40 decoded, telemetry back, and video RTP forwarded to the operator
(decode asserted best-effort: aiortc's receiver is lazy about PLI when
joining mid-stream; the browser requests keyframes immediately).
Note: frames must be flowing before the operator bridges — CF infers
the pulled track's kind from live RTP.
Mirrors the WebRTCTransport -> CloudflareTransport pattern: the generic base works with any ProviderConfig whose provider exposes set_video_frame() (clear NotImplementedError otherwise); CloudflareVideoTransport is the thin BrokerConfig binding. Pickling rebuilds the concrete subclass.
lcm_decode already verifies the wire fingerprint and raises ValueError on other types, so the typed subscriber just attempts decode and skips mismatches. Kills the default-instance-encoding hack (which existed only because _get_packed_fingerprint disagrees with the wire format for subclasses like TwistStamped). The subclass regression test stays, now exercised through lcm_decode.
- aiortc/av/httpx/CameraVideoTrack now import on first provider use, not at module scope: the broker chain hangs off dimos.core.transport, so eager imports taxed every dimos process (~320ms; import now 510->207ms). WEBRTC_AVAILABLE uses find_spec. Verified live: CF loopback tests + broker connect both pass on the lazy path. - tests trimmed against the LCM coverage bar: fingerprint_filter merged into multiple_types_multiplexed (same demux contract), kwargs-plumbing constructor test dropped (pickle test covers config equality), loopback_rtt dropped (DIY benchmarking — the benchmark harness owns RTT/throughput measurement). - WebRTCVideoTransport: custom __reduce__ + rebuild helper deleted — its only state is a picklable frozen dataclass, default pickling preserves subclass and config (unlike WebRTCTransport, which keeps its rebuild path for the unpicklable init lock).
Nothing sets them; only DIMOS_BENCH_RECEIVE_TIMEOUT_S is exercised (networked drain window for the webrtc/CF benchmark runs).
Co-authored-by: Paul Nechifor <paul@nechifor.net>
❌ 4 Tests Failed:
View the top 3 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
|
|
||
| webrtc = [ | ||
| # WebRTC DataChannel pubsub (Cloudflare Realtime SFU) | ||
| "aiortc>=1.14.0", |
Contributor
There was a problem hiding this comment.
You have added just two packages, but uv.lock has thousands of lines different.
I think you should reset uv.lock to main, and run uv lock again. I assume you must have upgraded all the packages producing such a large change.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Closes DIM-XXX
Solution
How to Test
Contributor License Agreement