Setup and Run Instructions

Run the following commands to setup the environment

./run.sh
python3 -m venv test_env
pip3 install -r requirements.txt
python3 -m pip install 'git+https://github.com/facebookresearch/segment-anything.git'

To do the final inference, run the below script:

python3 main.py

Introduction

Building a generalizable and scalable tracker

Approach

To achieve a generalized and scalable model the following high level requirements were formulated:

The tracker's inference time should not be affected by larger data (100x the test set)
The tracker must work equally well for unseen data and must improve with more annotated data
The tracker should be capable of handling occlusions, motions blurs, and weather effects

1. Addressing Scalability

I did some literature review on trackers in deep learning such as:

'Tracking Everything Everywhere All at Once' seems to have the best results, but just like RAFT, it's built for dense tracking which does not seem to be the requirement in this problem.

Further both perform tracking by optimizing a canonical 3D voloume (in a NERF like representation). While this may be suitable for accurate dense tracking, I was not sure about the scalability to large geometry and large spaces.

On the otherhand, Deep Sort, which is a Kalman Filter based estimator, is viable due to it's speed. However, it does require a well trained object detector to operate. This was addressed below.

2. Addressing Unseen Data

2.1. Choosing a Zero Shot Detector

Previously we saw that Deep Sort would work well if we had an object detector. However, object detectors like YOLO and it's newer variants like YoloV8 are usually trained on huge amounts (over a million images) of data to give tight boxes.

To build a generic tracker without large amounts of annotated data, we would need a general purpose detector. I evaluated two options for this:

Zero Shot Detectors - eg. OWL-ViT, OWLv2
Segment Anything

Zero shot detectors can also be conditioned on query images which made them a first choice. I have attached a few images taken directly from the Huggingface tutorial below:

However, after few trials, I found the detection to be inconsistent (and surprisingly not scale invariant for our test data)

Hence, I moved on to using Segment Anything, which was much more stable at the cost of longer runtime. For scalability, finetuning the zero shot detector would be a better option, but I chose to use Segment Anything for this proof of concept.

2.2. Filtering the Zero Shot Detections

If running inference on non-annotated data, we will not get predicted labels from the zero-shot detector's bounding boxes. Hence, these will need to be filtered.
To build a filter on only 100 images of the video sequence, where only one side of the car is visible, would highly overfit any classification network.
Hence, to build a common pipeline which can finetune a classifier on any unseen data, I needed more views of the object of interest.

I utilized a pre-trianed diffusion model which builds a 3D geometry by optimizing a Signed Distance Function (SDFs) called One-2-3-45. I had to slightly hack this pipeline to output ~5600 views instead of a discrete 4 views of the object.

2.3. Building and Training the Classifier

Finally, I tried using both a pretrained Resnet18 (finetuning just the fully-connected layer) and also my own implementation of ConvNext (which has fewer parameters than what was available online).

I chose to use Resnet18 in the final pipeline due to it's better performance (mostly due to it's pre-training).

2.4. Integrating the Pipeline

Finally, the end-to-end pipeline relied on bounding box predictions by segment anything and filtering using a finetuned classifier.

Pipeline Step 1

Pipeline Step 2

Pipeline Step 3 : Inference

Potential Improvements

Detector

Segment anything is used here as a zero-shot detector. However, it's slow and requires a classifier to filter detections.
A potential improvement could be to finetune an existing zero-shot detector like OWL-ViT which could yield runtime improvements and removes the need for a classifier.

Classifier

The classifier can be improved with better augmentation (currently the generated views of the car only have white background)
I used the SUN dataset with outdoor images (where no cars are present) as negative labels when training the classifier. However, for other objects of interest where it may be difficult to find negative labels, methods like mixup can be used to build the dataset.
Train the classifier with images at different scales. Currently small segments manage to pass through the classifier as shown below:

Tracker

Use an updated tracker like Byte-Sort or Bot-Sort for better tracking

Other Approaches Considered

I also considered the following:

Using the generated multiple views (5600 views) of the object of interest to fintune an existing object detection network like DeTR
Running the zero shot detector over images at different scales

While both the above options were viable, they had lesser chances of success in the short timeframe

Appendix

Installing Pytorch3D

This is a bit tricky, I do the following in a ipynb

import os
import sys
import torch
need_pytorch3d=False
try:
    import pytorch3d
except ModuleNotFoundError:
    need_pytorch3d=True
if need_pytorch3d:
    if torch.__version__.startswith("2.1.") and sys.platform.startswith("linux"):
        # We try to install PyTorch3D via a released wheel.
        pyt_version_str=torch.__version__.split("+")[0].replace(".", "")
        version_str="".join([
            f"py3{sys.version_info.minor}_cu",
            torch.version.cuda.replace(".",""),
            f"_pyt{pyt_version_str}"
        ])
        !pip install fvcore iopath
        !pip install --no-index --no-cache-dir pytorch3d -f https://dl.fbaipublicfiles.com/pytorch3d/packaging/wheels/{version_str}/download.html
    else:
        # We try to install PyTorch3D from source.
        !pip install 'git+https://github.com/facebookresearch/pytorch3d.git@stable'

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
classification		classification
deep_sort		deep_sort
docs		docs
other_notebooks		other_notebooks
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
main.py		main.py
requirements.txt		requirements.txt
run.sh		run.sh
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Setup and Run Instructions

Introduction

Approach

1. Addressing Scalability

2. Addressing Unseen Data

2.1. Choosing a Zero Shot Detector

2.2. Filtering the Zero Shot Detections

2.3. Building and Training the Classifier

2.4. Integrating the Pipeline

Pipeline Step 1

Pipeline Step 2

Pipeline Step 3 : Inference

Potential Improvements

Detector

Classifier

Tracker

Other Approaches Considered

Appendix

Installing Pytorch3D

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

sushanthj/Object-Tracker

Folders and files

Latest commit

History

Repository files navigation

Setup and Run Instructions

Introduction

Approach

1. Addressing Scalability

2. Addressing Unseen Data

2.1. Choosing a Zero Shot Detector

2.2. Filtering the Zero Shot Detections

2.3. Building and Training the Classifier

2.4. Integrating the Pipeline

Pipeline Step 1

Pipeline Step 2

Pipeline Step 3 : Inference

Potential Improvements

Detector

Classifier

Tracker

Other Approaches Considered

Appendix

Installing Pytorch3D

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages