You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: _data/news.yml
+3Lines changed: 3 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -1,3 +1,6 @@
1
+
- date: 2025-09-21
2
+
details: >-
3
+
Po-han's paper <a href="https://openreview.net/forum?id=C35FCYZBXp"> VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR </a> was accepted to NeurIPS 2025!
1
4
- date: 2025-08-01
2
5
details: >-
3
6
Oguzhan B.'s paper <a href="https://openreview.net/forum?id=M1e2PEMLp2"> Fair Resource Allocation for Fleet Intelligence </a> was accepted to GLOBECOM 2025!
Copy file name to clipboardExpand all lines: _posts/2025-09-21-VIBE.md
+12-14Lines changed: 12 additions & 14 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -20,10 +20,9 @@ Proceedings of the 39th Conference on Neural Information Processing Systems (Neu
20
20
## Motivation
21
21
Many real-world tasks still require human oversight: a traffic officer sifting through dashcam footage, or a researcher screening long conference videos. Watching raw video is slow, and existing vision-language models (VLMs) often produce verbose, redundant captions that hinder efficiency.
22
22
Current video-to-text evaluation methods depend on costly human annotations and ignore whether summaries actually help humans make decisions. We ask:
23
-
- Q1 (Annotation-Free): Can we evaluate video summaries without relying on gold-standard human captions?
24
-
- Q2 (Task-Aware): Can we measure how much a summary improves human performance on downstream tasks?
23
+
-**Q1 (Annotation-Free):** Can we evaluate video summaries without relying on gold-standard human captions?
24
+
-**Q2 (Task-Aware):** Can we measure how much a summary improves human performance on downstream tasks?
25
25
26
-
### System Plot:
27
26
<figurestyle="text-align: center;">
28
27
<img src="{{site.baseurl}}/images/post/TLDR_system.png" alt="TLDR System Plot" height="auto" style="margin: auto; display: block;">
29
28
<figcaption>VIBE Framework Overview</figcaption>
@@ -32,10 +31,9 @@ Current video-to-text evaluation methods depend on costly human annotations and
32
31
33
32
## Contributions
34
33
We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel framework that evaluates and selects VLM summaries without annotations or retraining.
35
-
- Grounding Score: Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
36
-
- Utility Score: Measures how informative the summary is for a downstream task.
37
-
38
-
- Annotation-Free Rejection Sampling: VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
34
+
-**Grounding Score:** Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
35
+
-**Utility Score:** Measures how informative the summary is for a downstream task.
36
+
-**Annotation-Free Rejection Sampling:** VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
39
37
40
38
## VIBE Framework
41
39
<figurestyle="text-align: center;">
@@ -46,8 +44,8 @@ We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel fra
46
44
47
45
VIBE adapts the information bottleneck principle to video summarization:
48
46
49
-
- Grounding approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
50
-
- Utility approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
47
+
-**Grounding** approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
48
+
-**Utility** approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
51
49
52
50
By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.
53
51
@@ -69,11 +67,11 @@ We validate VIBE through user studies with 243 participants across three dataset
69
67
<p></p>
70
68
</figure>
71
69
72
-
Key findings:
73
-
- Accuracy Gains: VIBE-selected summaries boost task accuracy by up to 61.23%.
74
-
- Speed Improvements: Response time drops by up to 75.77% compared to raw video.
-**Scalability:** Grounding works in a fully self-supervised manner, while utility correlates strongly with higher task accuracy.
77
75
78
76
## Impact
79
77
VIBE reframes video caption evaluation from a human decision support perspective. Unlike reference-based metrics, it scales to unseen data, works with black-box VLMs, and requires no human annotations.
0 commit comments