InterleavedEval/index.html at main · PLUM-Lab/InterleavedEval · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
<!DOCTYPE html>
<html>
	<head>
		<title>InterleavedEval</title>
        <link rel="stylesheet" href="./index.css">
        <script src="./index.js"></script>
        <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
        <link rel="shortcut icon" href="./imgs/vision_flan_logo.jpg"/>
	</head>
	<body>
        <!-- NAVBAR -->
        <nav class="navbar">
            <div class="navbarmenu">
                <a class="visionFlan">InterleavedEval</a>
            </div>
        </nav>

        <div class="major_section">
            <h1>InterleavedEval: Holistic Evaluation for Interleaved Text-and-Image Generation</h1>
            <p>Created by Virginia Tech's NLP Lab. </p>
            <hr>
            <br>
<!--             <center>
                <img src="./imgs/vision_flan_logo.jpg" width="400px">
                <i><p>Generated by <a href="https://ideogram.ai/">https://ideogram.ai/</a></p></i>
            </center> -->
            <br>
            <center>
                <div>
                    <a href="https://github.com/VT-NLP/InterleavedBench"><button><i class="fa fa-github"></i> Code</button></a>
                    <a href="https://arxiv.org/abs/2406.14643"><button><i class="fa fa-newspaper-o"></i> Paper</button></a>
<!--                     <a href="./tasks.html"><button><i class="fa fa-book"></i> Tasks</button></a> -->
<!--                     <a href="#data_samples"><button><i class="fa fa-picture-o"></i> Samples</button></a> -->
                    <a href="https://huggingface.co/mqliu/InterleavedBench"><button><i class="fa fa-database"></i> Dataset</button></a>
<!--                     <a href="#models"><button><i class="fa fa-gears"></i> Models</button></a> -->
                    <a href="#acknowledgement"><button><i class="fa fa-arrow-circle-o-right"></i> Acknowledgement</button></a>
                </div>
            </center>
		<br>
		<br>
            <img src="./images/teaser.png" id="figure1">
            <br>
            <p class="dataset_description"> Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
            </p>
        </div><br><br><br><br>

        <div class="section_title">
            <h1>InterleavedBench<hr></h1>
            <center><img src="./images/figure2.png" id="figure2"></center>
            <p class="dataset_description">We introduce INTERLEAVEDBENCH, the first comprehensive benchmark meticulously constructed
                to evaluate text-and-image interleaved generation.</p>
            <p class="dataset_description">Our dataset includes <b>two subsets</b>:</p>
                <ul>
                    <li><p><b>context-based</b>:
                        subset where the instances contain a multimodal
                        context of interleaved text and images in the input (<b><i>first row in above figure</i></b>)</p></li>
                    <li><p><b>context-free</b>:
                        subset with text-only inputs (<b><i>second row in above figure</i></b>). The context-free subset can assess whether
                        the model can creatively generate interleaved content based on the text-only instruction, while the
                        context-based subset can better benchmark the coherence and consistency of generated outputs.</p>
                    </li>
                </ul>
            </p>
        </div>

        <div class="section_title">
            <h1>Comparison with Existing Benchmarks<hr></h1>
            <table>
                <tr>
                  <th>Dataset Name</th>
                  <th>Detailed Instruction</th>
                  <th>Image Input</th>
                  <th>Text Output</th>
                  <th>Image Output</th>
                </tr>
                <tr>
                  <td>MagicBrush</td>
                  <td>No</td>
                  <td>Single</td>
                  <td>No</td>
                  <td>Single</td>
                </tr>
                <tr>
                    <td>DreamBench</td>
                    <td>No</td>
                    <td>Multiple</td>
                    <td>No</td>
                    <td>Single</td>
                </tr>
                <tr>
                  <td>CustomDiffusion</td>
                  <td>No</td>
                  <td>Multiple</td>
                  <td>No</td>
                  <td>Single</td>
                </tr>
                <tr>
                    <td>DreamEditBench</td>
                    <td>No</td>
                    <td>Multiple</td>
                    <td>No</td>
                    <td>Single</td>
                  </tr>
                  <tr>
                    <td>Mantis-Eval</td>
                    <td>Yes</td>
                    <td>Multiple</td>
                    <td>Yes</td>
                    <td>No</td>
                  </tr>
                  <tr>
                    <td>InterleavedBench (Ours)</td>
                    <td>Yes</td>
                    <td>Multiple</td>
                    <td>Yes</td>
                    <td>Multiple</td>
                  </tr>
              </table>

              <p class="dataset_description">
                We highlight the following key differences and
                unique challenges introduced by our INTERLEAVEDBENCH compared with the existing benchmark:

                <ul>
                    <li><p><b>(1): Output modality</b>: our benchmark requires the models to generate interleaved text and
                        multiple images that could present in an arbitrary
                        order, whereas exiting benchmarks only cover the output with single modality or a single image;</p></li>
                    <li><p><b>(2)
                        Requirement on coherence</b>: given that both inputs and outputs in our benchmark can contain
                        multiple pieces of text and images, our dataset
                        can assess whether the outputs are coherent and
                        consistent with input instruction and context, and within the outputs themselves;</p></li>
                    <li><p><b>(3) Instruction following</b>: Each instance in our benchmark contains a detailed
                        human-annotated instruction to describe the task.
                        Thus, our dataset can evaluate models' instructionfollowing and generalization capabilities. We show
                        the difference between our benchmark and existing
                        datasets in the above table.</p></li>
                </ul>
              </p>
        </div>

        <div class="section_title">
            <h1>Main Results<hr></h1>

            <h3>Baselines</h3><hr>
            <p class="dataset_description">
                <ul>
                    <li><p><b>(1) MiniGPT-5 (Zheng et al., 2023a)</b> which connects a large language model with a stable diffusion
                        model via generative vokens, enabling descriptionfree multimodal generation.</p></li>
                    <li><p><b>(2) GILL (Koh et al.,
                        2023)</b> which allows a pretrained large language
                        model to generate multimodal responses by mapping the hidden states of text into the embedding
                        space of an image generation model.</p></li>
                    <li><p><b>(3) EMU-2 (Sun et al., 2023a)</b> which induces in-context
                        learning capabilities of LLMs by scaling up the
                        model size and the size of the pretraining dataset;
                    </p></li>
                    <li><p><b>(4) EMU-2 Gen + Gold Text</b> where EMU-2 Gen
                        is a pretrained EMU-2 model instruction-tuned on
                        various controllable image generation tasks. However, EMU-2 Gen cannot generate
                        text so we combine it with ground-truth textual responses to come
                        up with a complete text-and-image interleaved content for evaluation.</p></li>
                    <li><p><b>(5) GPT-4o (OpenAI, 2024) + DALL·E 3 (Betker
                        et al.)</b> where GPT-4o is the state-of-the-art proprietary
                        LMM that can comprehend interleaved textand-image inputs and generate text-only responses.
                        We leverage GPT-4o to generate text responses as
                        well as captions for image responses in the desired
                        positions. Then the captions are fed into DALL·E
                        3 to generate images. Finally, we combine the
                        text responses with generated images in their original orders.</p></li>
                    <li><p><b>(6) Gemini-1.5 (Anil et al., 2023) +
                        SDXL (Podell et al., 2023)</b>: we build this baseline
                        in a similar way as GPT-4o + DALL·E 3 but use
                        Gemini-1.5 Pro as the LMM and Stable Diffusion
                        XL Turbo as the image generation model.</p></li>
                </ul>
            </p>

            <h3>Automatic Evaluation</h3><hr>
            <i>Note: TIC means "Text-Image Coherence" and we use a scale of 0-5 for this evaluation.</i>
            <div class="tag_holder">
                <p id="best">Best</p>
                <p id="middle">Middle</p>
                <p id="worst">Worst</p>
            </div>
            <table>
                <tr>
                  <th>Model</th>
                  <th>Text Quality</th>
                  <th>Perceptual Quality</th>
                  <th>Image Coherence</th>
                  <th>TIC</th>
                  <th>Helpfulness</th>
                  <th>AVG</th>
                </tr>
                <tr>
                  <td class="model_text">MiniGPT-5</td>
                  <td>1.22</td>
                  <td>2.45</td>
                  <td>1.62</td>
                  <td>2.03</td>
                  <td>1.77</td>
                  <td>1.82</td>
                </tr>
                <tr>
                    <td class="model_text">GILL</td>
                    <td>0.75</td>
                    <td>3.21</td>
                    <td>2.25</td>
                    <td>1.53</td>
                    <td>1.48</td>
                    <td>1.84</td>
                </tr>
                <tr>
                    <td class="model_text">EMU-2</td>
                    <td>1.26</td>
                    <td>2.28</td>
                    <td>1.89</td>
                    <td>1.34</td>
                    <td>1.64</td>
                    <td>1.68</td>
                </tr>
                <tr>
                    <td class="model_text">EMU-2 (Gold Text)</td>
                    <td>1.56</td>
                    <td>3.35</td>
                    <td>2.89</td>
                    <td>1.43</td>
                    <td>2.10</td>
                    <td>2.27</td>
                </tr>
                <tr style="background-color: rgba(231, 235, 1, 0.11);">
                    <td class="model_text">Gemini1.5 + SDXL</td>
                    <td><b>4.40</b></td>
                    <td>3.99</td>
                    <td><b>3.64</b></td>
                    <td>4.13</td>
                    <td>3.62</td>
                    <td>3.96</td>
                </tr>
                <tr style="background-color: rgba(1, 235, 1, 0.11);">
                    <td class="model_text">GPT-4o + DALLE3</td>
                    <td>4.37</td>
                    <td><b>4.36</b></td>
                    <td>3.51</td>
                    <td><b>4.55</b></td>
                    <td><b>3.88</b></td>
                    <td><b>4.13</b></td>
                </tr>
              </table>
              <br><br>

              <h3>Human Evaluation</h3><hr>
              <i>Note: TIC means "Text-Image Coherence" and we use a scale of 0-3 for this evaluation.</i>
              <div class="tag_holder">
                <p id="best">Best</p>
                <p id="middle">Middle</p>
                <p id="worst">Worst</p>
            </div>
              <table>
                <tr>
                    <th>Model</th>
                    <th>Text Quality</th>
                    <th>Perceptual Quality</th>
                    <th>Image Coherence</th>
                    <th>TIC</th>
                    <th>Helpfulness</th>
                    <th>AVG</th>
                </tr>
                <tr>
                  <td class="model_text">GILL</td>
                  <td>1.35</td>
                  <td>1.89</td>
                  <td>1.72</td>
                  <td>1.43</td>
                  <td>1.19</td>
                  <td>1.52</td>
                </tr>
                <tr>
                    <td class="model_text">EMU-2</td>
                    <td>1.23</td>
                    <td>1.74</td>
                    <td>1.87</td>
                    <td>1.24</td>
                    <td>1.2</td>
                    <td>1.46</td>
                </tr>
                <tr style="background-color: rgba(1, 235, 1, 0.11);">
                    <td class="model_text">Gemini1.5 + SDXL</td>
                    <td><b>2.59</b></td>
                    <td>2.36</td>
                    <td><b>2.13</b></td>
                    <td><b>2.27</b></td>
                    <td>2.08</td>
                    <td>2.28</td>
                </tr>
                <tr style="background-color: rgba(1, 235, 1, 0.11);">
                    <td class="model_text">GPT-4o + DALLE3</td>
                    <td>2.49</td>
                    <td><b>2.51</b></td>
                    <td>2.02</td>
                    <td>2.31</td>
                    <td><b>2.13</b></td>
                    <td><b>2.29</b></td>
                </tr>
              </table>
              <br><br>
              <h3>Evaluation results on each evaluation aspect for each task<hr></h3>
              <div id="side-by-side">
              <img src="./images/figure5.png" id="figure5"><img src="./images/figure3.png" id="figure3">
              </div>

              <br><br>
              <h3>Qualitative Analysis<hr></h3>
              <center><img src="./images/figure4.png" id="figure4"></center>
        </div>


<!--         <div class="section_title" id="data_samples">
            <h1>Data Samples<hr></h1>
        </div>
        <div class="content-block" id="content-block">
            <p class="data_loading">DATA LOADING . . .</p>
        </div>

        <div class="section_title">
            <p class="dataset_description">Each instance consists of 3 primary elements: Image, Instruction, and Ouput.</p>
            <ul>
                <li><p><b>Image:</b> An image that is used as reference when performing the task specified by the instruction.</p></li>
                <li><p><b>Instruction:</b> A description or prompt of a task that is executed by the vision-language model.</p></li>
                <li><p><b>Output:</b> The expected answer to the instruction given the provided image.</p></li>
            </ul>
            <br><br><br>
        </div>
 -->

<!--         <div class="section_title" id="collection_annotation">
            <h1>Data Collection and Annotation<hr></h1>
            <br><br>
            <center><img src="./imgs/pipeline_snip.png" width="100%"><br><i><p>Icon source: <a href="https://flaticon.com/">https://flaticon.com</a></p></i></center>
            <br><br>
            <p class="dataset_description">To ensure the coverage and quality of tasks, we proposed an annotation pipeline as demonstrated in the above figure. First, the authors search on the internet to identify interesting vision-language tasks. Second, the tasks are assigned to the annotators and the annotators write download and preprocessing scripts to prepare the data. Once the dataset is processed into the required format, the authors and annotators start discuss potential tasks that can be derived from the existing annotations. Third, the annotators write instructions and templates for each task and the authors provide feedbacks for revising the instructions. This step can repeat multiple times until the instructions meet the requirement. Forth, the annotators upload the processed datasets and instructions to our database. Finally, the authors double-check the correctness of the instructions, images and outputs. The authors also check the grammar and fluency of the instructions. All the annotators are graduate computer science students who have strong background in machine learning and deep learning.
            </p>
        </div>

        <br><br><br><br>
        <div class="section_title" id="download">
            <h1>Download<hr></h1>
            <table>
                <tr>
                    <th>File</th>
                    <th>Size on Disk</th>
                    <th>Sample Size</th>
                </tr>
                <tr>
                    <td><a href="https://huggingface.co/datasets/InterleavedEval/InterleavedEval_191-task_1k">annotation_191-task_1k.json</a></td>
                    <td>108M</td>
                    <td>186k</td>
                </tr>
                <tr>
                    <td><a href="https://huggingface.co/datasets/InterleavedEval/InterleavedEval_191-task_1k/tree/main">image_191-task_1k.zip</a></td>
                    <td>37GB</td>
                    <td>186k</td>
                </tr>
                <!-- <tr>
                    <td><a href="">instructions.json</a></td>
                    <td>1TB</td>
                    <td>300k</td>
                </tr> -->
<!--             </table><br><br>
            <p class="dataset_description">We provide the download links to the annotations and images above. In the annotations file, we merged instructions and templates with original tasks' inputs and outputs. To train a model on InterleavedEval, you can simply download the annotations and images. The annotations file consists of 191 tasks and for each task we randomly sampled 1K instances which should be sufficient for the purpose of instruction tuning. By now we can not release all tasks since some datasets are not allowed to be distributed.
            </p> -->
<!--         </div> -->


        <br><br><br><br>
        <div class="section_title">
            <h1>Citation<hr></h1>
            <p class="dataset_description">If you use InterleavedEval in your research, please cite the following papers.</p>
            <center><div class="bibtex">
                <pre><code>
@article{liu_holistic_2024,
  author       = {Minqian Liu and
                  Zhiyang Xu and
                  Zihao Lin and
                  Trevor Ashby and
                  Joy Rimchala and
                  Jiaxin Zhang and
                  Lifu Huang},
  title        = {Holistic Evaluation for Interleaved Text-and-Image Generation},
  journal      = {CoRR},
  volume       = {abs/2406.14643},
  year         = {2024},
  url          = {https://doi.org/10.48550/arXiv.2406.14643},
  doi          = {10.48550/ARXIV.2406.14643},
  eprinttype    = {arXiv},
  eprint       = {2406.14643},
  timestamp    = {Tue, 16 Jul 2024 16:17:50 +0200}
}
                </code></pre>
            </div></center>
        </div>
        <br><br><br><br>
        <div class="section_title" id="acknowledgement">
            <h1>Acknowledgement<hr></h1>
            <p class="dataset_description"><span style="color:red;">InterleavedEval dataset is for research purpose only.
                Please carefully check the licenses of the original datasets before using InterleavedEval.</span>
                We provide the URLs to the original datasets and their Bibtex on this <a href="./bibtex.html">page</a>.
                The images and tasks may be taken down at any time when requested by the original
                dataset owners or owners of the referenced images. If you hope to take
                down any tasks or the images, please contact Minqian Liu and Lifu Huang at <span class="email_text">minqianliu@vt.edu</span> and <span class="email_text">lifuh@cs.vt.edu</span>.
            </p>
        </div>
	</body>
</html>