Vision-Sync

Vision-Sync is a real-time, distributed AI assistive system that maps a live visual scene into a 2D spatial audio field for visually impaired users. It combines low-latency video streaming, on-device/backend inference, and directional audio feedback so users can hear where a detected person is and roughly how far away they are.

Project Overview

Vision-Sync is engineered as a production-style, real-time perception pipeline rather than a single-script demo:

Captures live camera frames on a client device.
Streams frames to a backend for computer vision inference.
Returns processed video plus metadata (distance + horizontal position).
Converts metadata into spatialized audio cues and spoken alerts.

The core product goal is to transform visual context into actionable auditory signals in under human-noticeable latency thresholds.

System Architecture

Vision-Sync follows a Client-Server architecture with clear separation of concerns.

Edge (Phone / Browser Client)

Stack: React + Vite + WebRTC + Web Audio API
Captures camera stream with constrained resolution/FPS for low-latency transport.
Sends video to backend over WebRTC.
Receives processed video track and metadata over a negotiated WebRTC Data Channel.
Renders spatial cues using StereoPannerNode and proximity beeps via OscillatorNode.

Brain (WSL2/Ubuntu Backend)

Stack: FastAPI + Uvicorn + aiortc + Ultralytics YOLOv8n
Accepts SDP offers at /offer and establishes peer connection.
Runs real-time person detection and monocular distance estimation:
- Distance = (Real Width * Focal Length) / Pixel Width
Streams annotated video back to the client.
Publishes low-latency metadata payloads (distance, pan, label) over Data Channel.

Engineering Challenges & Solutions

1) Challenge: 25-Second Latency (Buffer Bloat)

Problem: Early implementations processed every incoming frame sequentially. Inference time exceeded frame arrival time, causing queue growth and multi-second lag.

Solution: Re-architected to an asynchronous "leaky bucket" model:

Dedicated frame reader always pulls newest frame.
Old frames are dropped instead of queued.
Inference runs every Nth frame (frame skipping).
Output queue is bounded (maxsize=1) to discard stale processed frames.
Reuses last known detections between inference frames for visual continuity.

Result: Reduced end-to-end latency from ~20-25s to sub-second class performance, targeting <500ms interactive behavior.

2) Challenge: Secure Contexts on Local Networks

Problem: Mobile camera/WebRTC capabilities require secure context (https) and trusted certificates; plain local HTTP blocks critical APIs on many devices.

Solution: Established local PKI with mkcert:

Generated locally trusted SSL/TLS certificates.
Served frontend/backend over HTTPS in local Wi-Fi testing setups.
Enabled cross-device mobile camera access without disabling browser security.

Result: Reliable camera + WebRTC functionality from phone to WSL2 backend across local network.

3) Challenge: WebRTC Data Channel Synchronization

Problem: Dynamic Data Channel discovery (ondatachannel) was intermittently silent in the target environment, resulting in "empty console" and no audio metadata updates.

Solution: Switched to pre-negotiated Data Channels:

Explicit channel creation on both peers with:
- label: "vision-data"
- negotiated: true
- id: 0
Added connection-state gating so metadata sends begin only when transport is ready.
Added handshake telemetry logs for deterministic debugging.

Result: Stable metadata delivery with deterministic channel setup and sub-millisecond channel overhead behavior in normal LAN conditions.

Technical Stack

Vision: YOLOv8n (Ultralytics), OpenCV
Networking/Realtime: WebRTC (aiortc), FastAPI, Uvicorn
Frontend: React, Vite, Web Audio API (StereoPannerNode, OscillatorNode)
Security: SSL/TLS, local PKI via mkcert
Runtime Environment: Python backend on WSL2/Ubuntu, browser client on desktop/mobile

Features

Spatial Audio Mapping
- Maps detected person X-coordinate to stereo pan range [-1.0, 1.0].
- Enables left/right directional localization with headphones.
Proximity Audio Feedback
- Computes distance from bounding box geometry in real time.
- Modulates beep pitch and interval by distance (closer = higher/faster).
Speech Announcements
- Announces distance changes with thresholding to avoid audio clutter.
Low-Latency Streaming Pipeline
- Frame dropping + inference throttling to prevent queue backlog.
- Optimized camera constraints for realtime perception.

Example Metadata Payload

{
  "distance": 1.2,
  "pan": -0.5,
  "label": "person"
}

Getting Started: Install, Run, Execute

1) Prerequisites

Python 3.10+ (recommended 3.11)
Node.js 18+ / npm 9+
ffmpeg (optional but recommended for some WebRTC codecs)
mkcert (for HTTPS in local network, optional but recommended)

2) Install backend dependencies

pip install -r requirements.txt

3) Install frontend dependencies

cd frontend
npm install

4) Start backend (FastAPI + WebRTC/YOLO)

# from repository root
uvicorn server:app --reload --host 0.0.0.0 --port 8000

5) Start frontend (React + Vite)

cd frontend
npm run dev -- --host

6) Open client in browser

Go to https://localhost:5173 (or the URL shown in terminal)
Allow camera access
Click connect/start to initialize WebRTC handshake with http://localhost:8000/offer

7) Optional secure local HTTPS (recommended for mobile device testing)

cd frontend
mkcert -install
mkcert localhost 127.0.0.1 ::1
# configure vite / backend to use local certs in their respective config

8) Quick verification

Watch console logs for WebRTC connection established and received metadata events
Confirm live camera feed, overlay, and spatial audio beeps

Why This Project Matters

Vision-Sync demonstrates engineering depth across distributed systems, realtime media, CV inference optimization, and human-centered assistive UX. It is intentionally built to highlight practical system design tradeoffs (latency vs. quality, determinism vs. dynamic negotiation, security vs. developer speed) and production-minded debugging methodology.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
frontend		frontend
.gitignore		.gitignore
README.md		README.md
main.py		main.py
package-lock.json		package-lock.json
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Vision-Sync

Project Overview

System Architecture

Edge (Phone / Browser Client)

Brain (WSL2/Ubuntu Backend)

Engineering Challenges & Solutions

1) Challenge: 25-Second Latency (Buffer Bloat)

2) Challenge: Secure Contexts on Local Networks

3) Challenge: WebRTC Data Channel Synchronization

Technical Stack

Features

Example Metadata Payload

Getting Started: Install, Run, Execute

1) Prerequisites

2) Install backend dependencies

3) Install frontend dependencies

4) Start backend (FastAPI + WebRTC/YOLO)

5) Start frontend (React + Vite)

6) Open client in browser

7) Optional secure local HTTPS (recommended for mobile device testing)

8) Quick verification

Why This Project Matters

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Vision-Sync

Project Overview

System Architecture

Edge (Phone / Browser Client)

Brain (WSL2/Ubuntu Backend)

Engineering Challenges & Solutions

1) Challenge: 25-Second Latency (Buffer Bloat)

2) Challenge: Secure Contexts on Local Networks

3) Challenge: WebRTC Data Channel Synchronization

Technical Stack

Features

Example Metadata Payload

Getting Started: Install, Run, Execute

1) Prerequisites

2) Install backend dependencies

3) Install frontend dependencies

4) Start backend (FastAPI + WebRTC/YOLO)

5) Start frontend (React + Vite)

6) Open client in browser

7) Optional secure local HTTPS (recommended for mobile device testing)

8) Quick verification

Why This Project Matters

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages