Skip to content

yeliudev/VideoMind

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

50 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Ye Liu1โ€ , Kevin Qinghong Lin2โ€ , Chang Wen Chen1, Mike Zheng Shou2

1The Hong Kong Polytechnic University 2Show Lab, National University of Singapore

TL;DR: Pioneer DeepSearch-like Video Understanding.

VideoMind is a multi-modal agent framework that enhances video reasoning by emulating human-like processes, such as breaking down tasks, localizing and verifying moments, and synthesizing answers. This approach addresses the unique challenges of temporal-grounded reasoning in a progressive strategy.

๐Ÿ”ฅ News

  • 2025.04.05 ๐Ÿ“Š See BENCHMARK.md for evaluation results of VideoMind on public benchmarks.
  • 2025.03.28 ๐Ÿš€ VideoMind-2B is ready on Hugging Face Spaces. Check it out!
  • 2025.03.21 โญ๏ธ Code, model, and dataset release.
  • 2025.03.17 ๐ŸŽ‰ Our tech report is available online.

๐Ÿ† VideoMind on Public Benchmarks

Benchmark Evaluation Results (2B/7B)
ZS CG-Bench (mini) long-acc: 31.0/38.4 rec@IoU: 8.50/9.93 acc@IoU: 4.02/4.67
ZS ReXTime (val) mIoU: 24.83/27.61 Acc: 69.06/74.59 Acc@IoU: 17.26/20.20
ZS NExT-GQA (test) mIoU: 28.6/31.4 mIoP: 36.4/39.0 Acc@GQA: 25.2/28.2
ZS DeVE-QA (val)* mIoU: 26.3/30.1 mIoP: 49.9/51.9 Acc@GQA: 41.2/44.2
ZS Charades-STA (test) R@0.5: 51.1/59.1 R@0.7: 26.0/31.2 mIoU: 45.2/50.2
ZS ActivityNet-Captions (val_2) R@0.5: 26.5/30.3 R@0.7: 12.6/15.7 mIoU: 30.1/33.3
FT QVHighlights (test) R@0.5: 75.42/78.53 R@0.7: 59.35/61.09 mAP: 51.60/54.19
FT TACoS (test) R@0.5: 26.9/36.2 R@0.7: 15.5/21.4 mIoU: 27.4/34.4
ZS Ego4D-NLQ (val) R@0.5: 2.9/3.7 R@0.7: 1.2/1.7 mIoU: 4.7/5.4
ZS ActivityNet-RTL (val) P@0.5: 20.1/28.0 mIoU: 22.7/31.3
ZS Video-MME (w/o subs) All: 55.4/58.2 Long: 46.3/49.2
ZS MLVU M-Avg: 58.7/64.4
ZS LVBench Overall: 35.4/40.8
ZS MVBench Acc: 62.5/64.6
ZS LongVideoBench Acc: 48.8/56.3

ZS and FT refer to zero-shot and fine-tuned settings, respectively. * means third-party results.

See BENCHMARK.md for full evaluation results.

๐Ÿ•น๏ธ Gradio Demo

demo.mp4

Play with our online demo or see DEMO.md for guidelines about how to deploy it locally.

๐Ÿ“ฆ VideoMind-SFT Dataset

We provide raw videos, compressed videos, and pre-processed annotations of 27 video grounding / QA datasets, including our VideoMind-SFT (481K) for training and multiple benchmarks for evaluation. We also release the datasets used during our early exploration (but not included in the final version) to facilitate future research.

The list of source datasets is shown below. See our dataset repo for more details.

Grounder (210K):

Dataset Source Processed (Recommended)
QVHighlights Link qvhighlights
DiDeMo Link didemo
TACoS Link tacos
QuerYD Link queryd
HiREST (Grounding) Link hirest
HiREST (Step Captioning) Link hirest
CosMo-Cap Link cosmo_cap
InternVid-VTime Link internvid_vtime

Verifier (232K):

Dataset Source Processed (Recommended)
QVHighlights-Verify Link verifying, qvhighlights
DiDeMo-Verify Link verifying, didemo
TACoS-Verify Link verifying,tacos

Planner (39K):

Dataset Source Processed (Recommended)
NExT-QA-Plan Link planning, nextqa
QVHighlights-Plan Link planning, qvhighlights

Benchmarks

Dataset Task Source Processed (Recommended)
CG-Bench Grounded VideoQA Link cgbench
ReXTime Grounded VideoQA Link rextime, activitynet, qvhighlights
NExT-GQA Grounded VideoQA Link nextgqa
Charades-STA VTG Link charades_sta
ActivityNet-Captions VTG Link activitynet_captions, activitynet
QVHighlights VTG Link qvhighlights
TACoS VTG Link tacos
Ego4D-NLQ VTG Link ego4d_nlq, ego4d
ActivityNet-RTL VTG Link activitynet_rtl, activitynet
Video-MME General VideoQA Link videomme
MLVU General VideoQA Link mlvu
LVBench General VideoQA Link lvbench
MVBench General VideoQA Link mvbench
LongVideoBench General VideoQA Link longvideobench

The following datasets are not used in our project (partially used during early exploration), but we still share them to facilitate future research.

Dataset Task Training Evaluation Source Processed (Recommended)
QaEgo4D Grounded VideoQA โœ… โœ… Link qa_ego4d, ego4d
Ego4D-NaQ VTG โœ… โœ… Link ego4d_naq, ego4d
Ego-TimeQA VTG โœ… โŒ Link ego_timeqa, ego4d
Vid-Morp VTG โœ… โŒ Link vid_morp
VideoXum VTG (originally VS) โœ… โœ… Link videoxum
YouCook2 VTG (originally DVC) โœ… โœ… Link youcook2
STAR VideoQA โœ… โœ… Link star, charades_sta
COIN - - - Link coin

Notes:

  1. For some datasets (e.g., ReXTime), the annotations and videos are stored in different folders. All the directories in Processed need to be downloaded.
  2. Use the following commands to concatenate and extract video tar splits (e.g., videos.tar.gz.00, videos_3fps_480_noaudio.tar.gz.00).
# videos.tar.gz.00, videos.tar.gz.01
cat videos.tar.gz.* | tar -zxvf -

# videos_3fps_480_noaudio.tar.gz.00, videos_3fps_480_noaudio.tar.gz.01
cat videos_3fps_480_noaudio.tar.gz.* | tar -zxvf -

๐Ÿš€ Training

Our codebase supports training and evaluating on 27 video datasets and benchmarks with the following features.

  • Flexible hardware settings: NVIDIA GPU / Ascend NPU, Single-Node / Multi-Node
  • Efficient training techniques: DeepSpeed ZeRO, BF16, LoRA, SDPA, FlashAttention2, Liger-Kernel
  • Customizing the base LLM and conversation templates
  • Monitoring the training process via Tensorboard / Wandb
  • Group sampling for mixed dataset training
  • Multi-process / multi-device evaluation on public benchmarks

See TRAIN.md for a quick start guide.

๐Ÿ”ฎ Evaluation

See EVAL.md for details about evaluating VideoMind on public benchmarks.

๐Ÿ“– Citation

Please kindly cite our paper if you find this project helpful.

@article{liu2025videomind,
  title={VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning},
  author={Liu, Ye and Lin, Kevin Qinghong and Chen, Chang Wen and Shou, Mike Zheng},
  journal={arXiv preprint arXiv:2503.13444},
  year={2025}
}

Star History Chart

About

๐Ÿ’ก VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published