I recommend adding a line on vision foundation models—from MaskGIT[1], to MUSE[2], to Meissonic[3]. Incorporating both vision and language perspectives will make the overall survey more comprehensive, especially in the context of Figure 1.
[1] MaskGIT: Masked Generative Image Transformer
[2] Muse: Text-To-Image Generation via Masked Generative Transformers
[3] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis
I recommend adding a line on vision foundation models—from MaskGIT[1], to MUSE[2], to Meissonic[3]. Incorporating both vision and language perspectives will make the overall survey more comprehensive, especially in the context of Figure 1.
[1] MaskGIT: Masked Generative Image Transformer
[2] Muse: Text-To-Image Generation via Masked Generative Transformers
[3] Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis