GitHub - Soumyabrata2003/ViLP: This is the official implementation of ViLP (ICVGIP'23 Oral), which leverages cross-modal bridge to enhance video recognition by exploring tri-directional knowledge.

【ICVGIP'2023】ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition

This is the official implementation of ViLP (ICVGIP'23 Oral) , which leverages cross-modal bridge to enhance video recognition by exploring tri-directional knowledge.

Overview

ViLP explores cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce the combination of pose, visual information, and text attributes which has not been explored yet.

Prerequisites

The code is built with following libraries.

PyTorch >= 1.8
RandAugment
pprint
tqdm
dotmap
yaml
csv
Optional: decord (for on-the-fly video training)
Optional: torchnet

Data Preparation

Video Loader

(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of dataset processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace.

Example of annotation

abseiling/-7kbO0v4hag_000107_000117 300 0
abseiling/-bwYZwnwb8E_000013_000023 300 0

(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames.

Example of annotation

  abseiling/-7kbO0v4hag_000107_000117.mp4 0
  abseiling/-bwYZwnwb8E_000013_000023.mp4 0

Annotation

Annotation information consists of two parts: video label, and category description.

Video Label: As mentioned above, this part is same as the traditional video recognition. Please refer to lists/ucf101/train_rgb_split_1.txt for the format.
Category Description: We also need a textual description for each video category. Please refer to lists/ucf101/ucf_labels.csv for the format.

Heatmap Preparation

We have followed and cloned this repo of Openpose for generating heatmaps of the corresponding frames. Our implementation is outlined here: img_to_heatmap.

🚀 Training

Single GPU: To train our model with 1 GPU in Single Machine, you can run:

sh scripts/run_train.sh  configs/ucf101/ucf_ViLP.yaml
# For performing ablation studies, replace the train.py file with train_pose_text.py/train_withut_text.py etc in scripts/run_train.sh as per requirement.

⚡ Testing

We support single-view validation (default) and multi-view (4x3 views) validation.

# The testing command for obtaining top-1/top-5 accuracy.
sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt

📌 BibTeX & Citation

If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.

@inproceedings{chaudhuri2023vilp,
  title={Vilp: Knowledge exploration using vision, language, and pose embeddings for video action recognition},
  author={Chaudhuri, Soumyabrata and Bhattacharya, Saumik},
  booktitle={Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing},
  pages={1--7},
  year={2023}
}

🎗️ Acknowledgement

This repository is built based on BIKE and Text4Vis. Sincere thanks to their wonderful works.

👫 Contact

For any question, please file an issue.

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
clip		clip
configs		configs
datasets		datasets
modules		modules
output		output
scripts		scripts
utils		utils
LICENSE		LICENSE
Model_new.png		Model_new.png
README.md		README.md
img_to_heatmap.ipynb		img_to_heatmap.ipynb
train.py		train.py
train_pose_text.py		train_pose_text.py
train_with_pose.py		train_with_pose.py
train_with_pose_kin_pretrain.py		train_with_pose_kin_pretrain.py
train_without_text.py		train_without_text.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

【ICVGIP'2023】ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition

Overview

Content

Prerequisites

Data Preparation

Video Loader

Annotation

Heatmap Preparation

🚀 Training

⚡ Testing

📌 BibTeX & Citation

🎗️ Acknowledgement

👫 Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

【ICVGIP'2023】ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition

Overview

Content

Prerequisites

Data Preparation

Video Loader

Annotation

Heatmap Preparation

🚀 Training

⚡ Testing

📌 BibTeX & Citation

🎗️ Acknowledgement

👫 Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages