Jiwan Chung* Junhyeok Kim* Siyeol Kim Jaeyoung Lee Minsoo Kim Youngjae Yu
conda create -n v1 python=3.10 -y
conda activate v1
pip install -r requirements.txt
pip install flash-attn --no-build-isolationHighly Recommended as the copy tokens are displayed on image.
python run_gradio.pypython inference.pyThe script uses a default image URL and text prompt. To use your own inputs, you can modify the image variable within the messages list and the text field for the user prompt.
We have released a 100-item sample of our v1g dataset on the Hugging Face Hub. You can load it easily using the datasets library:
from datasets import load_dataset
ds = load_dataset("kjunh/v1g-sample")- Inference code
- Training data sample
- Training data
- Evaluation code
- Training code
If you find our work valuable, please cite:
@misc{chung2025v1learningpointvisual,
title={v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning},
author={Jiwan Chung and Junhyeok Kim and Siyeol Kim and Jaeyoung Lee and Min Soo Kim and Youngjae Yu},
year={2025},
eprint={2505.18842},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.18842},
}
