Skip to content

Hxyz-123/SFA

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SFA

Haibin He, Qihuang Zhong, Juhua Liu, Bo Du, Peng Wang, Jing Zhang

Introduction | News | Usage | Statement

Introduction

  1. We identify the limitations of existing Video TextVQA methods and Video-LLMs, and introduce SFA, the first training-free Video-LLM-based method tailored for the Video TextVQA task, which integrates visual text perception with video content comprehension to enable more accurate answering.
  2. To effectively guide the model’s attention toward key textual regions, we instantiate a three-step Scan–Focus–Amplify strategy, inspired by the human question-answering process: it first adaptively scans video frames to identify candidate areas, then filters and selects the most relevant text regions, and finally amplifies these regions to enhance textual clarity and improve answer accuracy.

News

26/11/2025

  • The paper is uploaded to arxiv!

Usage

Dataset

You should first download M4-ViteVQA and RoadTextVQA, then organize the data as follows:

|- datasets
		|--- M4-ViteVQA
		|      |--- video
		|            |--- 00000.mp4
		|            └--- ...
		|      └--- Annotations
		|            |--- ViteVQA_0.0.2_t1s1val.json
		|            └--- ...
		|
		|--- RoadTextVQA
		|      |--- videos
		|            |--- 1.mp4
		|            └--- ...
		|      |--- train.json
		|      |--- val.json
		|      └--- test.json

Installation

Python_3.10 + PyTorch_2.6.0 + CUDA_12.2

git clone https://github.com/Hxyz-123/SFA.git
cd SFA
conda create -n sfa python=3.10
conda activate sfa
pip install -r requirements.txt
cd GoMatching/third_party
sh install.sh

Pre-trained model

We share the trained GoMatching weight we use in SFA. You can download it to ./GoMatching/models.

Evaluation

sh examples/qwen_infer.sh

Statement

This project is for research purpose only. For any other questions please contact haibinhe@whu.edu.cn.

Citation

If you find SFA helpful, please consider giving this repo a star and citing:

@article{he2025sfa,
  title={SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA},
  author={He, Haibin and Zhong, Qihuang and Liu, Juhua and Du, Bo and Wang, Peng and Zhang, Jing},
  journal={arXiv preprint arXiv:2511.20190},
  year={2025}
}

About

SFA: Scan, Focus, and Amplify toward Guidance-aware Answering for Video TextVQA

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages