Planting a SEED of Vision in Large Language Model

Abstract

We present SEED, an elaborate image tokenizer that empowers Large Language Models (LLMs) with the emergent ability to SEE and Draw at the same time. Research on image tokenizers has previously reached an impasse, as frameworks employing quantized visual tokens have lost prominence due to subpar performance and convergence in multimodal comprehension (compared to BLIP-2, etc.) or generation (compared to Stable Diffusion, etc.). Despite the limitations, we remain confident in its natural capacity to unify visual and textual representations, facilitating scalable multimodal training with LLM’s original recipe. In this study, we identify two crucial principles for the architecture and training of SEED that effectively ease subsequent alignment with LLMs. (1) Image tokens should be independent of 2D physical patch positions and instead be produced with a 1D causal dependency, exhibiting intrinsic interdependence that aligns with the left-to-right autoregressive prediction mechanism in LLMs. (2) Image tokens should capture high-level semantics consistent with the degree of semantic abstraction in words, and be optimized for both discriminativeness and reconstruction during the tokenizer training phase. As a result, the off-the-shelf LLM is able to perform both image-to-text and text-to-image generation by incorporating our SEED through efficient LoRA tuning. Comprehensive multimodal pretraining and instruction tuning, which may yield improved results, are reserved for future investigation. This version of SEED was trained in 5.7 days using only 64 V100 GPUs and 5M publicly available image-text pairs. Our preliminary study emphasizes the great potential of discrete visual tokens in versatile multimodal LLMs and the importance of proper image tokenizers in broader research.

SEED Tokenizer for Image Reconstruction

SEED-OPT_2.7B for Multimodal Comprehension

SEED-OPT_2.7B for Multimodal Generation

To Do

Release SEED Tokenizer
Release SEED-LLM

Citation

If you find the work helpful, please consider citing:

@misc{ge2023planting,
      title={Planting a SEED of Vision in Large Language Model}, 
      author={Yuying Ge and Yixiao Ge and Ziyun Zeng and Xintao Wang and Ying Shan},
      year={2023},
      eprint={2307.08041},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

The project is still in progress. Stay tuned for more updates!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
paper_images		paper_images
README.md		README.md
SEED.pdf		SEED.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Planting a SEED of Vision in Large Language Model

Abstract

SEED Tokenizer for Image Reconstruction

SEED-OPT_2.7B for Multimodal Comprehension

SEED-OPT_2.7B for Multimodal Generation

To Do

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Planting a SEED of Vision in Large Language Model

Abstract

SEED Tokenizer for Image Reconstruction

SEED-OPT2.7B for Multimodal Comprehension

SEED-OPT2.7B for Multimodal Generation

To Do

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

SEED-OPT_2.7B for Multimodal Comprehension

SEED-OPT_2.7B for Multimodal Generation

Packages