【ICVGIP'2023】ViLP: Knowledge Exploration using Vision, Language and Pose Embeddings for Video Action Recognition
This is the official implementation of ViLP (ICVGIP'23 Oral) , which leverages cross-modal bridge to enhance video recognition by exploring tri-directional knowledge.
ViLP explores cross-modal knowledge from the pre-trained vision-language model (e.g., CLIP) to introduce the combination of pose, visual information, and text attributes which has not been explored yet.
The code is built with following libraries.
- PyTorch >= 1.8
- RandAugment
- pprint
- tqdm
- dotmap
- yaml
- csv
- Optional: decord (for on-the-fly video training)
- Optional: torchnet
(Recommend) To train all of our models, we extract videos into frames for fast reading. Please refer to MVFNet repo for the detailed guide of dataset processing.
The annotation file is a text file with multiple lines, and each line indicates the directory to frames of a video, total frames of the video and the label of a video, which are split with a whitespace.
Example of annotation
abseiling/-7kbO0v4hag_000107_000117 300 0
abseiling/-bwYZwnwb8E_000013_000023 300 0(Optional) We can also decode the videos in an online fashion using decord. This manner should work but are not tested. All of the models offered have been trained using offline frames.
Example of annotation
abseiling/-7kbO0v4hag_000107_000117.mp4 0
abseiling/-bwYZwnwb8E_000013_000023.mp4 0Annotation information consists of two parts: video label, and category description.
- Video Label: As mentioned above, this part is same as the traditional video recognition. Please refer to lists/ucf101/train_rgb_split_1.txt for the format.
- Category Description: We also need a textual description for each video category. Please refer to lists/ucf101/ucf_labels.csv for the format.
We have followed and cloned this repo of Openpose for generating heatmaps of the corresponding frames. Our implementation is outlined here: img_to_heatmap.
- Single GPU: To train our model with 1 GPU in Single Machine, you can run:
sh scripts/run_train.sh configs/ucf101/ucf_ViLP.yaml
# For performing ablation studies, replace the train.py file with train_pose_text.py/train_withut_text.py etc in scripts/run_train.sh as per requirement.We support single-view validation (default) and multi-view (4x3 views) validation.
# The testing command for obtaining top-1/top-5 accuracy.
sh scripts/run_test.sh Your-Config.yaml Your-Trained-Model.pt
If you use our code in your research or wish to refer to the baseline results, please use the following BibTeX entry😁.
@inproceedings{chaudhuri2023vilp,
title={Vilp: Knowledge exploration using vision, language, and pose embeddings for video action recognition},
author={Chaudhuri, Soumyabrata and Bhattacharya, Saumik},
booktitle={Proceedings of the Fourteenth Indian Conference on Computer Vision, Graphics and Image Processing},
pages={1--7},
year={2023}
}This repository is built based on BIKE and Text4Vis. Sincere thanks to their wonderful works.
For any question, please file an issue.
