Vision-Sync is a real-time, distributed AI assistive system that maps a live visual scene into a 2D spatial audio field for visually impaired users. It combines low-latency video streaming, on-device/backend inference, and directional audio feedback so users can hear where a detected person is and roughly how far away they are.
Vision-Sync is engineered as a production-style, real-time perception pipeline rather than a single-script demo:
- Captures live camera frames on a client device.
- Streams frames to a backend for computer vision inference.
- Returns processed video plus metadata (distance + horizontal position).
- Converts metadata into spatialized audio cues and spoken alerts.
The core product goal is to transform visual context into actionable auditory signals in under human-noticeable latency thresholds.
Vision-Sync follows a Client-Server architecture with clear separation of concerns.
- Stack: React + Vite + WebRTC + Web Audio API
- Captures camera stream with constrained resolution/FPS for low-latency transport.
- Sends video to backend over WebRTC.
- Receives processed video track and metadata over a negotiated WebRTC Data Channel.
- Renders spatial cues using
StereoPannerNodeand proximity beeps viaOscillatorNode.
- Stack: FastAPI + Uvicorn + aiortc + Ultralytics YOLOv8n
- Accepts SDP offers at
/offerand establishes peer connection. - Runs real-time person detection and monocular distance estimation:
Distance = (Real Width * Focal Length) / Pixel Width
- Streams annotated video back to the client.
- Publishes low-latency metadata payloads (
distance,pan,label) over Data Channel.
Problem: Early implementations processed every incoming frame sequentially. Inference time exceeded frame arrival time, causing queue growth and multi-second lag.
Solution: Re-architected to an asynchronous "leaky bucket" model:
- Dedicated frame reader always pulls newest frame.
- Old frames are dropped instead of queued.
- Inference runs every Nth frame (frame skipping).
- Output queue is bounded (
maxsize=1) to discard stale processed frames. - Reuses last known detections between inference frames for visual continuity.
Result: Reduced end-to-end latency from ~20-25s to sub-second class performance, targeting <500ms interactive behavior.
Problem: Mobile camera/WebRTC capabilities require secure context (https) and trusted certificates; plain local HTTP blocks critical APIs on many devices.
Solution: Established local PKI with mkcert:
- Generated locally trusted SSL/TLS certificates.
- Served frontend/backend over HTTPS in local Wi-Fi testing setups.
- Enabled cross-device mobile camera access without disabling browser security.
Result: Reliable camera + WebRTC functionality from phone to WSL2 backend across local network.
Problem: Dynamic Data Channel discovery (ondatachannel) was intermittently silent in the target environment, resulting in "empty console" and no audio metadata updates.
Solution: Switched to pre-negotiated Data Channels:
- Explicit channel creation on both peers with:
label: "vision-data"negotiated: trueid: 0
- Added connection-state gating so metadata sends begin only when transport is ready.
- Added handshake telemetry logs for deterministic debugging.
Result: Stable metadata delivery with deterministic channel setup and sub-millisecond channel overhead behavior in normal LAN conditions.
- Vision: YOLOv8n (Ultralytics), OpenCV
- Networking/Realtime: WebRTC (
aiortc), FastAPI, Uvicorn - Frontend: React, Vite, Web Audio API (
StereoPannerNode,OscillatorNode) - Security: SSL/TLS, local PKI via
mkcert - Runtime Environment: Python backend on WSL2/Ubuntu, browser client on desktop/mobile
-
Spatial Audio Mapping
- Maps detected person X-coordinate to stereo pan range
[-1.0, 1.0]. - Enables left/right directional localization with headphones.
- Maps detected person X-coordinate to stereo pan range
-
Proximity Audio Feedback
- Computes distance from bounding box geometry in real time.
- Modulates beep pitch and interval by distance (closer = higher/faster).
-
Speech Announcements
- Announces distance changes with thresholding to avoid audio clutter.
-
Low-Latency Streaming Pipeline
- Frame dropping + inference throttling to prevent queue backlog.
- Optimized camera constraints for realtime perception.
{
"distance": 1.2,
"pan": -0.5,
"label": "person"
}- Python 3.10+ (recommended 3.11)
- Node.js 18+ / npm 9+
ffmpeg(optional but recommended for some WebRTC codecs)mkcert(for HTTPS in local network, optional but recommended)
pip install -r requirements.txtcd frontend
npm install# from repository root
uvicorn server:app --reload --host 0.0.0.0 --port 8000cd frontend
npm run dev -- --host- Go to
https://localhost:5173(or the URL shown in terminal) - Allow camera access
- Click connect/start to initialize WebRTC handshake with
http://localhost:8000/offer
cd frontend
mkcert -install
mkcert localhost 127.0.0.1 ::1
# configure vite / backend to use local certs in their respective config- Watch console logs for
WebRTC connection establishedandreceived metadataevents - Confirm live camera feed, overlay, and spatial audio beeps
Vision-Sync demonstrates engineering depth across distributed systems, realtime media, CV inference optimization, and human-centered assistive UX. It is intentionally built to highlight practical system design tradeoffs (latency vs. quality, determinism vs. dynamic negotiation, security vs. developer speed) and production-minded debugging methodology.