This is the official repository of the papers SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA.
Introduction | News | Usage | Statement
- We identify the limitations of existing Video TextVQA methods and Video-LLMs, and introduce SFA, the first training-free Video-LLM-based method tailored for the Video TextVQA task, which integrates visual text perception with video content comprehension to enable more accurate answering.
- To effectively guide the model’s attention toward key textual regions, we instantiate a three-step Scan–Focus–Amplify strategy, inspired by the human question-answering process: it first adaptively scans video frames to identify candidate areas, then filters and selects the most relevant text regions, and finally amplifies these regions to enhance textual clarity and improve answer accuracy.
26/11/2025
- The paper is uploaded to arxiv!
You should first download M4-ViteVQA and RoadTextVQA, then organize the data as follows:
|- datasets
|--- M4-ViteVQA
| |--- video
| |--- 00000.mp4
| └--- ...
| └--- Annotations
| |--- ViteVQA_0.0.2_t1s1val.json
| └--- ...
|
|--- RoadTextVQA
| |--- videos
| |--- 1.mp4
| └--- ...
| |--- train.json
| |--- val.json
| └--- test.json
Python_3.10 + PyTorch_2.6.0 + CUDA_12.2
git clone https://github.com/Hxyz-123/SFA.git
cd SFA
conda create -n sfa python=3.10
conda activate sfa
pip install -r requirements.txt
cd GoMatching/third_party
sh install.shWe share the trained GoMatching weight we use in SFA. You can download it to ./GoMatching/models.
sh examples/qwen_infer.shThis project is for research purpose only. For any other questions please contact haibinhe@whu.edu.cn.
If you find SFA helpful, please consider giving this repo a star and citing:
@article{he2025sfa,
title={SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA},
author={He, Haibin and Zhong, Qihuang and Liu, Juhua and Du, Bo and Wang, Peng and Zhang, Jing},
journal={arXiv preprint arXiv:2511.20190},
year={2025}
}