SFA

This is the official repository of the papers SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA.

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

Introduction | News | Usage | Statement

Introduction

We identify the limitations of existing Video TextVQA methods and Video-LLMs, and introduce SFA, the first training-free Video-LLM-based method tailored for the Video TextVQA task, which integrates visual text perception with video content comprehension to enable more accurate answering.
To effectively guide the model’s attention toward key textual regions, we instantiate a three-step Scan–Focus–Amplify strategy, inspired by the human question-answering process: it first adaptively scans video frames to identify candidate areas, then filters and selects the most relevant text regions, and finally amplifies these regions to enhance textual clarity and improve answer accuracy.

News

26/11/2025

The paper is uploaded to arxiv!

Usage

Dataset

You should first download M4-ViteVQA and RoadTextVQA, then organize the data as follows:

|- datasets
		|--- M4-ViteVQA
		|      |--- video
		|            |--- 00000.mp4
		|            └--- ...
		|      └--- Annotations
		|            |--- ViteVQA_0.0.2_t1s1val.json
		|            └--- ...
		|
		|--- RoadTextVQA
		|      |--- videos
		|            |--- 1.mp4
		|            └--- ...
		|      |--- train.json
		|      |--- val.json
		|      └--- test.json

Installation

Python_3.10 + PyTorch_2.6.0 + CUDA_12.2

git clone https://github.com/Hxyz-123/SFA.git
cd SFA
conda create -n sfa python=3.10
conda activate sfa
pip install -r requirements.txt
cd GoMatching/third_party
sh install.sh

Pre-trained model

We share the trained GoMatching weight we use in SFA. You can download it to ./GoMatching/models.

Evaluation

sh examples/qwen_infer.sh

Statement

This project is for research purpose only. For any other questions please contact haibinhe@whu.edu.cn.

Citation

If you find SFA helpful, please consider giving this repo a star and citing:

@article{he2025sfa,
  title={SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA},
  author={He, Haibin and Zhong, Qihuang and Liu, Juhua and Du, Bo and Wang, Peng and Zhang, Jing},
  journal={arXiv preprint arXiv:2511.20190},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
GoMatching		GoMatching
examples		examples
figs		figs
infer_codes		infer_codes
metric		metric
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SFA

This is the official repository of the papers SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA.

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

Introduction

News

Usage

Dataset

Installation

Pre-trained model

Evaluation

Statement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SFA

This is the official repository of the papers SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA.

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

Introduction

News

Usage

Dataset

Installation

Pre-trained model

Evaluation

Statement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages