Detect objects in real-time video streams and track them with persistent IDs โ built for speed, designed for clarity.
Real-time multi-object tracking on a live webcam feed โ each object is detected by YOLOv8, assigned a persistent ID by Deep SORT, and annotated with a unique color-coded bounding box.
โก TL;DR: This project combines YOLOv8's blazing-fast object detection with Deep SORT's identity-preserving tracker to deliver frame-by-frame detections with cross-frame object continuity โ the foundation of any serious computer vision pipeline.
This project is a complete, production-ready real-time object detection and multi-object tracking system developed as Task 4 of the CodeAlpha Computer Vision Internship. It processes live video streams โ whether from a webcam or a video file โ and outputs annotated frames where every detected object is bounded, classified, and tracked with a unique, persistent ID.
Object detection alone tells you what is in a frame. Tracking tells you who is where across time. Bridging the two is what transforms a simple detector into a system capable of powering surveillance and security monitoring, autonomous vehicle perception, crowd analytics, sports player tracking, and smart retail insights. This project demonstrates that bridge โ cleanly architected, thoroughly documented, and optimized for real-time performance.
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ โ โ โ โ
โ Video Source โโโโโโโถโ YOLOv8 Detector โโโโโโโถโ Deep SORT Tracker โโโโโโโถโ Annotated Output โ
โ (Webcam / MP4) โ โ (Per-Frame Boxes) โ โ (Persistent IDs) โ โ (Boxes + IDs + FPS) โ
โ โ โ โ โ โ โ โ
โโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โ โ โ โ
cv2.VideoCapture Bounding Boxes (xywh) Kalman Prediction cv2.rectangle()
Frame-by-frame Confidence Scores Hungarian Assignment cv2.putText()
BGR image Class IDs Appearance Embeddings Color-coded IDs
- Video Source โ OpenCV captures frames from a webcam or a video file and feeds them into the pipeline one at a time.
- YOLOv8 Detection โ Each frame is independently processed by the YOLOv8 Nano model, which outputs bounding boxes, confidence scores, and class labels for all 80 COCO categories. Detections below the confidence threshold are discarded immediately.
- Deep SORT Tracking โ The filtered detections are passed to Deep SORT, which matches them to existing tracks using a combination of Kalman Filter motion prediction and cosine-distance appearance embeddings. New objects receive tentative tracks; confirmed tracks retain their IDs across frames.
- Annotated Output โ Confirmed tracks are rendered with color-coded bounding boxes, class names, tracking IDs, and confidence percentages. A smoothed FPS counter is overlaid in the top-left corner.
- ๐ OOP Architecture โ The entire pipeline is encapsulated in a single
ObjectTrackerclass with clean method separation (process_frame,draw_annotations,run), making it trivial to extend or integrate into larger systems. - โก Real-Time FPS Counter โ An exponential moving average FPS readout ensures a stable, jitter-free performance metric displayed on every frame.
- ๐ฏ Deep SORT Re-Identification โ A MobileNet-based appearance embedder allows the tracker to re-associate objects after brief occlusions, surviving up to 30 missed frames (
max_age=30) before retiring an ID. - ๐จ Color-Coded Tracking IDs โ Each unique track ID maps to a deterministic color from a curated 20-color palette, providing instant visual continuity across frames.
- ๐ก๏ธ Robust Error Handling โ Missing video files, unavailable cameras, and per-frame processing failures are all caught and reported gracefully โ the system never crashes silently.
- ๐ Coordinate Clamping โ All bounding box coordinates are clamped to frame boundaries before rendering, eliminating OpenCV drawing errors on edge cases.
- ๐ง Flexible Input โ Supports both live webcam feeds and offline video files via a simple argparse CLI, with automatic source-type detection.
- ๐ง Configurable Confidence โ Detection threshold is adjustable at runtime, letting you trade precision for recall depending on your use case.
Understanding the distinction between detection and tracking is fundamental to appreciating this system's design.
YOLOv8 is a single-frame detector. It processes each image independently: it slices the input into a grid, predicts bounding boxes and class probabilities at each grid cell, and applies non-maximum suppression to remove duplicates. The result is a set of anonymous detections โ bounding boxes with class labels and confidence scores, but no memory between frames. A person detected at position (100, 200) in frame N and at (105, 202) in frame N+1 are, to YOLO, two completely unrelated observations. YOLO answers the question: "What objects exist in this frame, and where?"
Deep SORT transforms anonymous detections into persistent identities by solving the data association problem across time. It operates through three tightly integrated mechanisms:
| Component | Role | How It Works |
|---|---|---|
| Kalman Filter | Motion prediction | Models each track's position and velocity as a linear dynamical system. Before each frame, it predicts where every existing track should appear, then updates its state once a measurement is matched. This allows the tracker to maintain continuity through brief occlusions where no detection is available. |
| Hungarian Algorithm | Optimal assignment | Computes a cost matrix between all predicted track positions and all new detections, then solves for the globally optimal one-to-one matching. The cost is a weighted combination of Mahalanobis distance (motion consistency) and cosine distance (appearance similarity). Unmatched detections spawn new tentative tracks; unmatched tracks age toward deletion. |
| Appearance Embeddings | Re-identification | A lightweight MobileNet CNN extracts a compact feature vector from each detection's image patch. These embeddings are stored in a gallery per track and compared against new detections using cosine similarity. This is what allows Deep SORT to distinguish two visually similar objects moving close together โ the critical advantage over pure motion-based trackers. |
The result: Once a track is confirmed (after surviving n_init=3 consecutive frames), it receives a unique integer ID that persists across the entire video โ through partial occlusions, motion blur, and temporary detection dropouts โ as long as it reappears within the max_age=30 frame window. This is what transforms a detector into a tracker, and what powers applications that require counting distinct objects or following individual trajectories over time.
- Python 3.9+ โ Download
- pip โ Comes bundled with Python
- A webcam (for live demo) or an MP4/AVI video file
# 1. Clone the repository
git clone https://github.com/<your-username>/realtime-object-tracking.git
cd realtime-object-tracking
# 2. Create and activate a virtual environment (recommended)
python -m venv venv
source venv/bin/activate # Linux / macOS
# venv\Scripts\activate # Windows
# 3. Install dependencies
pip install -r requirements.txtNote: On first run, YOLOv8 will automatically download the
yolov8n.ptweights file (~6 MB) and Deep SORT will download its MobileNet embedder. An internet connection is required for the initial setup only.
python object_tracker.py --source 0 --confidence 0.5python object_tracker.py --source ./videos/sample.mp4 --confidence 0.4| Flag | Default | Description |
|---|---|---|
--source |
0 |
Video source โ 0 for default webcam, or a file path to an MP4/AVI video |
--confidence |
0.5 |
Minimum detection confidence threshold (0.0โ1.0) |
--model |
yolov8n.pt |
YOLOv8 weights file โ yolov8n (fastest), yolov8s, yolov8m, yolov8l |
# Balanced speed/accuracy
python object_tracker.py --source 0 --model yolov8s.pt
# Higher accuracy (requires stronger GPU)
python object_tracker.py --source 0 --model yolov8m.ptTip: Press
qat any time to cleanly exit the tracking loop and release all resources.
realtime-object-tracking/
โ
โโโ object_tracker.py # Main script โ ObjectTracker class + CLI entry point
โโโ requirements.txt # Python dependencies
โโโ README.md # You are here
โโโ LICENSE # MIT License
โ
โโโ assets/
โโโ demo.gif # Placeholder for demo animation
|
Meraj Basiri Mani Alagheband ๐ Computer Vision Intern @ CodeAlpha ๐ฌ Passionate about building intelligent systems that perceive, reason, and act in the real world.
|
- CodeAlpha โ For the internship opportunity and structured learning path that made this project possible.
- Ultralytics โ For the state-of-the-art YOLOv8 framework that democratizes real-time object detection.
- Deep SORT Realtime โ For the seamless Python integration of the Deep SORT tracking algorithm.
Built with โค๏ธ and curiosity by Meraj Basiri and Mani Alagheband