-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathindex.html
More file actions
418 lines (398 loc) · 21.7 KB
/
index.html
File metadata and controls
418 lines (398 loc) · 21.7 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
<!DOCTYPE html>
<html>
<head>
<title>InterleavedEval</title>
<link rel="stylesheet" href="./index.css">
<script src="./index.js"></script>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.4.0/css/font-awesome.min.css">
<link rel="shortcut icon" href="./imgs/vision_flan_logo.jpg"/>
</head>
<body>
<!-- NAVBAR -->
<nav class="navbar">
<div class="navbarmenu">
<a class="visionFlan">InterleavedEval</a>
</div>
</nav>
<div class="major_section">
<h1>InterleavedEval: Holistic Evaluation for Interleaved Text-and-Image Generation</h1>
<p>Created by Virginia Tech's NLP Lab. </p>
<hr>
<br>
<!-- <center>
<img src="./imgs/vision_flan_logo.jpg" width="400px">
<i><p>Generated by <a href="https://ideogram.ai/">https://ideogram.ai/</a></p></i>
</center> -->
<br>
<center>
<div>
<a href="https://github.com/VT-NLP/InterleavedBench"><button><i class="fa fa-github"></i> Code</button></a>
<a href="https://arxiv.org/abs/2406.14643"><button><i class="fa fa-newspaper-o"></i> Paper</button></a>
<!-- <a href="./tasks.html"><button><i class="fa fa-book"></i> Tasks</button></a> -->
<!-- <a href="#data_samples"><button><i class="fa fa-picture-o"></i> Samples</button></a> -->
<a href="https://huggingface.co/mqliu/InterleavedBench"><button><i class="fa fa-database"></i> Dataset</button></a>
<!-- <a href="#models"><button><i class="fa fa-gears"></i> Models</button></a> -->
<a href="#acknowledgement"><button><i class="fa fa-arrow-circle-o-right"></i> Acknowledgement</button></a>
</div>
</center>
<br>
<br>
<img src="./images/teaser.png" id="figure1">
<br>
<p class="dataset_description"> Interleaved text-and-image generation has been an intriguing research direction, where the models are required to generate both images and text pieces in an arbitrary order. Despite the emerging advancements in interleaved generation, the progress in its evaluation still significantly lags behind. Existing evaluation benchmarks do not support arbitrarily interleaved images and text for both inputs and outputs, and they only cover a limited number of domains and use cases. Also, current works predominantly use similarity-based metrics which fall short in assessing the quality in open-ended scenarios. To this end, we introduce InterleavedBench, the first benchmark carefully curated for the evaluation of interleaved text-and-image generation. InterleavedBench features a rich array of tasks to cover diverse real-world use cases. In addition, we present InterleavedEval, a strong reference-free metric powered by GPT-4o to deliver accurate and explainable evaluation. We carefully define five essential evaluation aspects for InterleavedEval, including text quality, perceptual quality, image coherence, text-image coherence, and helpfulness, to ensure a comprehensive and fine-grained assessment. Through extensive experiments and rigorous human evaluation, we show that our benchmark and metric can effectively evaluate the existing models with a strong correlation with human judgments surpassing previous reference-based metrics. We also provide substantial findings and insights to foster future research in interleaved generation and its evaluation.
</p>
</div><br><br><br><br>
<div class="section_title">
<h1>InterleavedBench<hr></h1>
<center><img src="./images/figure2.png" id="figure2"></center>
<p class="dataset_description">We introduce INTERLEAVEDBENCH, the first comprehensive benchmark meticulously constructed
to evaluate text-and-image interleaved generation.</p>
<p class="dataset_description">Our dataset includes <b>two subsets</b>:</p>
<ul>
<li><p><b>context-based</b>:
subset where the instances contain a multimodal
context of interleaved text and images in the input (<b><i>first row in above figure</i></b>)</p></li>
<li><p><b>context-free</b>:
subset with text-only inputs (<b><i>second row in above figure</i></b>). The context-free subset can assess whether
the model can creatively generate interleaved content based on the text-only instruction, while the
context-based subset can better benchmark the coherence and consistency of generated outputs.</p>
</li>
</ul>
</p>
</div>
<div class="section_title">
<h1>Comparison with Existing Benchmarks<hr></h1>
<table>
<tr>
<th>Dataset Name</th>
<th>Detailed Instruction</th>
<th>Image Input</th>
<th>Text Output</th>
<th>Image Output</th>
</tr>
<tr>
<td>MagicBrush</td>
<td>No</td>
<td>Single</td>
<td>No</td>
<td>Single</td>
</tr>
<tr>
<td>DreamBench</td>
<td>No</td>
<td>Multiple</td>
<td>No</td>
<td>Single</td>
</tr>
<tr>
<td>CustomDiffusion</td>
<td>No</td>
<td>Multiple</td>
<td>No</td>
<td>Single</td>
</tr>
<tr>
<td>DreamEditBench</td>
<td>No</td>
<td>Multiple</td>
<td>No</td>
<td>Single</td>
</tr>
<tr>
<td>Mantis-Eval</td>
<td>Yes</td>
<td>Multiple</td>
<td>Yes</td>
<td>No</td>
</tr>
<tr>
<td>InterleavedBench (Ours)</td>
<td>Yes</td>
<td>Multiple</td>
<td>Yes</td>
<td>Multiple</td>
</tr>
</table>
<p class="dataset_description">
We highlight the following key differences and
unique challenges introduced by our INTERLEAVEDBENCH compared with the existing benchmark:
<ul>
<li><p><b>(1): Output modality</b>: our benchmark requires the models to generate interleaved text and
multiple images that could present in an arbitrary
order, whereas exiting benchmarks only cover the output with single modality or a single image;</p></li>
<li><p><b>(2)
Requirement on coherence</b>: given that both inputs and outputs in our benchmark can contain
multiple pieces of text and images, our dataset
can assess whether the outputs are coherent and
consistent with input instruction and context, and within the outputs themselves;</p></li>
<li><p><b>(3) Instruction following</b>: Each instance in our benchmark contains a detailed
human-annotated instruction to describe the task.
Thus, our dataset can evaluate models' instructionfollowing and generalization capabilities. We show
the difference between our benchmark and existing
datasets in the above table.</p></li>
</ul>
</p>
</div>
<div class="section_title">
<h1>Main Results<hr></h1>
<h3>Baselines</h3><hr>
<p class="dataset_description">
<ul>
<li><p><b>(1) MiniGPT-5 (Zheng et al., 2023a)</b> which connects a large language model with a stable diffusion
model via generative vokens, enabling descriptionfree multimodal generation.</p></li>
<li><p><b>(2) GILL (Koh et al.,
2023)</b> which allows a pretrained large language
model to generate multimodal responses by mapping the hidden states of text into the embedding
space of an image generation model.</p></li>
<li><p><b>(3) EMU-2 (Sun et al., 2023a)</b> which induces in-context
learning capabilities of LLMs by scaling up the
model size and the size of the pretraining dataset;
</p></li>
<li><p><b>(4) EMU-2 Gen + Gold Text</b> where EMU-2 Gen
is a pretrained EMU-2 model instruction-tuned on
various controllable image generation tasks. However, EMU-2 Gen cannot generate
text so we combine it with ground-truth textual responses to come
up with a complete text-and-image interleaved content for evaluation.</p></li>
<li><p><b>(5) GPT-4o (OpenAI, 2024) + DALL·E 3 (Betker
et al.)</b> where GPT-4o is the state-of-the-art proprietary
LMM that can comprehend interleaved textand-image inputs and generate text-only responses.
We leverage GPT-4o to generate text responses as
well as captions for image responses in the desired
positions. Then the captions are fed into DALL·E
3 to generate images. Finally, we combine the
text responses with generated images in their original orders.</p></li>
<li><p><b>(6) Gemini-1.5 (Anil et al., 2023) +
SDXL (Podell et al., 2023)</b>: we build this baseline
in a similar way as GPT-4o + DALL·E 3 but use
Gemini-1.5 Pro as the LMM and Stable Diffusion
XL Turbo as the image generation model.</p></li>
</ul>
</p>
<h3>Automatic Evaluation</h3><hr>
<i>Note: TIC means "Text-Image Coherence" and we use a scale of 0-5 for this evaluation.</i>
<div class="tag_holder">
<p id="best">Best</p>
<p id="middle">Middle</p>
<p id="worst">Worst</p>
</div>
<table>
<tr>
<th>Model</th>
<th>Text Quality</th>
<th>Perceptual Quality</th>
<th>Image Coherence</th>
<th>TIC</th>
<th>Helpfulness</th>
<th>AVG</th>
</tr>
<tr>
<td class="model_text">MiniGPT-5</td>
<td>1.22</td>
<td>2.45</td>
<td>1.62</td>
<td>2.03</td>
<td>1.77</td>
<td>1.82</td>
</tr>
<tr>
<td class="model_text">GILL</td>
<td>0.75</td>
<td>3.21</td>
<td>2.25</td>
<td>1.53</td>
<td>1.48</td>
<td>1.84</td>
</tr>
<tr>
<td class="model_text">EMU-2</td>
<td>1.26</td>
<td>2.28</td>
<td>1.89</td>
<td>1.34</td>
<td>1.64</td>
<td>1.68</td>
</tr>
<tr>
<td class="model_text">EMU-2 (Gold Text)</td>
<td>1.56</td>
<td>3.35</td>
<td>2.89</td>
<td>1.43</td>
<td>2.10</td>
<td>2.27</td>
</tr>
<tr style="background-color: rgba(231, 235, 1, 0.11);">
<td class="model_text">Gemini1.5 + SDXL</td>
<td><b>4.40</b></td>
<td>3.99</td>
<td><b>3.64</b></td>
<td>4.13</td>
<td>3.62</td>
<td>3.96</td>
</tr>
<tr style="background-color: rgba(1, 235, 1, 0.11);">
<td class="model_text">GPT-4o + DALLE3</td>
<td>4.37</td>
<td><b>4.36</b></td>
<td>3.51</td>
<td><b>4.55</b></td>
<td><b>3.88</b></td>
<td><b>4.13</b></td>
</tr>
</table>
<br><br>
<h3>Human Evaluation</h3><hr>
<i>Note: TIC means "Text-Image Coherence" and we use a scale of 0-3 for this evaluation.</i>
<div class="tag_holder">
<p id="best">Best</p>
<p id="middle">Middle</p>
<p id="worst">Worst</p>
</div>
<table>
<tr>
<th>Model</th>
<th>Text Quality</th>
<th>Perceptual Quality</th>
<th>Image Coherence</th>
<th>TIC</th>
<th>Helpfulness</th>
<th>AVG</th>
</tr>
<tr>
<td class="model_text">GILL</td>
<td>1.35</td>
<td>1.89</td>
<td>1.72</td>
<td>1.43</td>
<td>1.19</td>
<td>1.52</td>
</tr>
<tr>
<td class="model_text">EMU-2</td>
<td>1.23</td>
<td>1.74</td>
<td>1.87</td>
<td>1.24</td>
<td>1.2</td>
<td>1.46</td>
</tr>
<tr style="background-color: rgba(1, 235, 1, 0.11);">
<td class="model_text">Gemini1.5 + SDXL</td>
<td><b>2.59</b></td>
<td>2.36</td>
<td><b>2.13</b></td>
<td><b>2.27</b></td>
<td>2.08</td>
<td>2.28</td>
</tr>
<tr style="background-color: rgba(1, 235, 1, 0.11);">
<td class="model_text">GPT-4o + DALLE3</td>
<td>2.49</td>
<td><b>2.51</b></td>
<td>2.02</td>
<td>2.31</td>
<td><b>2.13</b></td>
<td><b>2.29</b></td>
</tr>
</table>
<br><br>
<h3>Evaluation results on each evaluation aspect for each task<hr></h3>
<div id="side-by-side">
<img src="./images/figure5.png" id="figure5"><img src="./images/figure3.png" id="figure3">
</div>
<br><br>
<h3>Qualitative Analysis<hr></h3>
<center><img src="./images/figure4.png" id="figure4"></center>
</div>
<!-- <div class="section_title" id="data_samples">
<h1>Data Samples<hr></h1>
</div>
<div class="content-block" id="content-block">
<p class="data_loading">DATA LOADING . . .</p>
</div>
<div class="section_title">
<p class="dataset_description">Each instance consists of 3 primary elements: Image, Instruction, and Ouput.</p>
<ul>
<li><p><b>Image:</b> An image that is used as reference when performing the task specified by the instruction.</p></li>
<li><p><b>Instruction:</b> A description or prompt of a task that is executed by the vision-language model.</p></li>
<li><p><b>Output:</b> The expected answer to the instruction given the provided image.</p></li>
</ul>
<br><br><br>
</div>
-->
<!-- <div class="section_title" id="collection_annotation">
<h1>Data Collection and Annotation<hr></h1>
<br><br>
<center><img src="./imgs/pipeline_snip.png" width="100%"><br><i><p>Icon source: <a href="https://flaticon.com/">https://flaticon.com</a></p></i></center>
<br><br>
<p class="dataset_description">To ensure the coverage and quality of tasks, we proposed an annotation pipeline as demonstrated in the above figure. First, the authors search on the internet to identify interesting vision-language tasks. Second, the tasks are assigned to the annotators and the annotators write download and preprocessing scripts to prepare the data. Once the dataset is processed into the required format, the authors and annotators start discuss potential tasks that can be derived from the existing annotations. Third, the annotators write instructions and templates for each task and the authors provide feedbacks for revising the instructions. This step can repeat multiple times until the instructions meet the requirement. Forth, the annotators upload the processed datasets and instructions to our database. Finally, the authors double-check the correctness of the instructions, images and outputs. The authors also check the grammar and fluency of the instructions. All the annotators are graduate computer science students who have strong background in machine learning and deep learning.
</p>
</div>
<br><br><br><br>
<div class="section_title" id="download">
<h1>Download<hr></h1>
<table>
<tr>
<th>File</th>
<th>Size on Disk</th>
<th>Sample Size</th>
</tr>
<tr>
<td><a href="https://huggingface.co/datasets/InterleavedEval/InterleavedEval_191-task_1k">annotation_191-task_1k.json</a></td>
<td>108M</td>
<td>186k</td>
</tr>
<tr>
<td><a href="https://huggingface.co/datasets/InterleavedEval/InterleavedEval_191-task_1k/tree/main">image_191-task_1k.zip</a></td>
<td>37GB</td>
<td>186k</td>
</tr>
<!-- <tr>
<td><a href="">instructions.json</a></td>
<td>1TB</td>
<td>300k</td>
</tr> -->
<!-- </table><br><br>
<p class="dataset_description">We provide the download links to the annotations and images above. In the annotations file, we merged instructions and templates with original tasks' inputs and outputs. To train a model on InterleavedEval, you can simply download the annotations and images. The annotations file consists of 191 tasks and for each task we randomly sampled 1K instances which should be sufficient for the purpose of instruction tuning. By now we can not release all tasks since some datasets are not allowed to be distributed.
</p> -->
<!-- </div> -->
<br><br><br><br>
<div class="section_title">
<h1>Citation<hr></h1>
<p class="dataset_description">If you use InterleavedEval in your research, please cite the following papers.</p>
<center><div class="bibtex">
<pre><code>
@article{liu_holistic_2024,
author = {Minqian Liu and
Zhiyang Xu and
Zihao Lin and
Trevor Ashby and
Joy Rimchala and
Jiaxin Zhang and
Lifu Huang},
title = {Holistic Evaluation for Interleaved Text-and-Image Generation},
journal = {CoRR},
volume = {abs/2406.14643},
year = {2024},
url = {https://doi.org/10.48550/arXiv.2406.14643},
doi = {10.48550/ARXIV.2406.14643},
eprinttype = {arXiv},
eprint = {2406.14643},
timestamp = {Tue, 16 Jul 2024 16:17:50 +0200}
}
</code></pre>
</div></center>
</div>
<br><br><br><br>
<div class="section_title" id="acknowledgement">
<h1>Acknowledgement<hr></h1>
<p class="dataset_description"><span style="color:red;">InterleavedEval dataset is for research purpose only.
Please carefully check the licenses of the original datasets before using InterleavedEval.</span>
We provide the URLs to the original datasets and their Bibtex on this <a href="./bibtex.html">page</a>.
The images and tasks may be taken down at any time when requested by the original
dataset owners or owners of the referenced images. If you hope to take
down any tasks or the images, please contact Minqian Liu and Lifu Huang at <span class="email_text">minqianliu@vt.edu</span> and <span class="email_text">lifuh@cs.vt.edu</span>.
</p>
</div>
</body>
</html>