v1: Learning to Point Visual Tokens
for Multimodal Grounded Reasoning

Jiwan Chung^* Junhyeok Kim^* Siyeol Kim Jaeyoung Lee Minsoo Kim Youngjae Yu

Installation

conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolation

Demo

Gradio Web UI

Highly Recommended as the copy tokens are displayed on image.

python run_gradio.py

Inference

python inference.py

The script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.

Data

We have released a 100-item sample of our v1g dataset on the Hugging Face Hub. You can load it easily using the datasets library:

from datasets import load_dataset

ds = load_dataset("kjunh/v1g-sample")

Coming Soon

Citation

If you find our work valuable, please cite:

@misc{chung2025v1learningpointvisual,
      title={v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning}, 
      author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
      year={2025},
      eprint={2505.18842},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.18842}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
v1		v1
.gitignore		.gitignore
README.md		README.md
inference.py		inference.py
requirements.txt		requirements.txt
run_gradio.py		run_gradio.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

v1: Learning to Point Visual Tokens
for Multimodal Grounded Reasoning

Installation

Demo

Gradio Web UI

Inference

Data

Coming Soon

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Languages

jun297/v1

Folders and files

Latest commit

History

Repository files navigation

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

Installation

Demo

Gradio Web UI

Inference

Data

Coming Soon

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

v1: Learning to Point Visual Tokens
for Multimodal Grounded Reasoning

Packages