Skip to content

Soumyabrata2003/ViLP

Repository files navigation

【ICVGIP'2023】ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition

Conference arXiv

This is the official implementation of ViLP (ICVGIP'23 Oral) , which leverages cross-modal bridge to enhance video recognition by exploring tri-directional knowledge.

Overview

ViLP explores cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce the combination of pose, visual information, and text attributes which has not been explored yet.

ViLP

Content

Prerequisites

The code is built with following libraries.

  • PyTorch >= 1.8
  • RandAugment
  • pprint
  • tqdm
  • dotmap
  • yaml
  • csv
  • Optional: decord (for on-the-fly video training)
  • Optional: torchnet

Data Preparation

Video Loader

(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of dataset processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace.

Example of annotation
abseiling/-7kbO0v4hag_000107_000117 300 0
abseiling/-bwYZwnwb8E_000013_000023 300 0

(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames.

Example of annotation
  abseiling/-7kbO0v4hag_000107_000117.mp4 0
  abseiling/-bwYZwnwb8E_000013_000023.mp4 0

Annotation

Annotation information consists of two parts: video label, and category description.

Heatmap Preparation

We have followed and cloned this repo of Openpose for generating heatmaps of the corresponding frames. Our implementation is outlined here: img_to_heatmap.

🚀 Training

  1. Single GPU: To train our model with 1 GPU in Single Machine, you can run:
sh scripts/run_train.sh  configs/ucf101/ucf_ViLP.yaml
# For performing ablation studies, replace the train.py file with train_pose_text.py/train_withut_text.py etc in scripts/run_train.sh as per requirement.

⚡ Testing

We support single-view validation (default) and multi-view (4x3 views) validation.

# The testing command for obtaining top-1/top-5 accuracy.
sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.

@inproceedings{chaudhuri2023vilp,
  title={Vilp: Knowledge exploration using vision, language, and pose embeddings for video action recognition},
  author={Chaudhuri, Soumyabrata and Bhattacharya, Saumik},
  booktitle={Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing},
  pages={1--7},
  year={2023}
}

🎗️ Acknowledgement

This repository is built based on BIKE and Text4Vis. Sincere thanks to their wonderful works.

👫 Contact

For any question, please file an issue.

About

This is the official implementation of ViLP (ICVGIP'23 Oral), which leverages cross-modal bridge to enhance video recognition by exploring tri-directional knowledge.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors