Skip to content

Commit f7c1296

Browse files
committed
Add VIBE paper announcement and update formatting in news and blog post
1 parent 38477dc commit f7c1296

2 files changed

Lines changed: 15 additions & 14 deletions

File tree

_data/news.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,6 @@
1+
- date: 2025-09-21
2+
details: >-
3+
Po-han's paper <a href="https://openreview.net/forum?id=C35FCYZBXp"> VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR </a> was accepted to NeurIPS 2025!
14
- date: 2025-08-01
25
details: >-
36
Oguzhan B.'s paper <a href="https://openreview.net/forum?id=M1e2PEMLp2"> Fair Resource Allocation for Fleet Intelligence </a> was accepted to GLOBECOM 2025!

_posts/2025-09-21-VIBE.md

Lines changed: 12 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -20,10 +20,9 @@ Proceedings of the 39th Conference on Neural Information Processing Systems (Neu
2020
## Motivation
2121
Many real-world tasks still require human oversight: a traffic officer sifting through dashcam footage, or a researcher screening long conference videos. Watching raw video is slow, and existing vision-language models (VLMs) often produce verbose, redundant captions that hinder efficiency.
2222
Current video-to-text evaluation methods depend on costly human annotations and ignore whether summaries actually help humans make decisions. We ask:
23-
- Q1 (Annotation-Free): Can we evaluate video summaries without relying on gold-standard human captions?
24-
- Q2 (Task-Aware): Can we measure how much a summary improves human performance on downstream tasks?
23+
- **Q1 (Annotation-Free):** Can we evaluate video summaries without relying on gold-standard human captions?
24+
- **Q2 (Task-Aware):** Can we measure how much a summary improves human performance on downstream tasks?
2525

26-
### System Plot:
2726
<figure style="text-align: center;">
2827
<img src="{{site.baseurl}}/images/post/TLDR_system.png" alt="TLDR System Plot" height="auto" style="margin: auto; display: block;">
2928
<figcaption>VIBE Framework Overview</figcaption>
@@ -32,10 +31,9 @@ Current video-to-text evaluation methods depend on costly human annotations and
3231

3332
## Contributions
3433
We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel framework that evaluates and selects VLM summaries without annotations or retraining.
35-
- Grounding Score: Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
36-
- Utility Score: Measures how informative the summary is for a downstream task.
37-
38-
- Annotation-Free Rejection Sampling: VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
34+
- **Grounding Score:** Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
35+
- **Utility Score:** Measures how informative the summary is for a downstream task.
36+
- **Annotation-Free Rejection Sampling:** VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
3937

4038
## VIBE Framework
4139
<figure style="text-align: center;">
@@ -46,8 +44,8 @@ We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel fra
4644

4745
VIBE adapts the information bottleneck principle to video summarization:
4846

49-
- Grounding approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
50-
- Utility approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
47+
- **Grounding** approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
48+
- **Utility** approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
5149

5250
By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.
5351

@@ -69,11 +67,11 @@ We validate VIBE through user studies with 243 participants across three dataset
6967
<p></p>
7068
</figure>
7169

72-
Key findings:
73-
- Accuracy Gains: VIBE-selected summaries boost task accuracy by up to 61.23%.
74-
- Speed Improvements: Response time drops by up to 75.77% compared to raw video.
75-
- Efficiency: VIBE achieves lower inverse efficiency scores (time/accuracy), showing superior speed–accuracy trade-offs.
76-
- Scalability: Grounding works in a fully self-supervised manner, while utility correlates strongly with higher task accuracy.
70+
### Key findings:
71+
- **Accuracy Gains:** VIBE-selected summaries boost task accuracy by up to 61.23%.
72+
- **Speed Improvements:** Response time drops by up to 75.77% compared to raw video.
73+
- **Efficiency:** VIBE achieves lower inverse efficiency scores (time/accuracy), showing superior speed–accuracy trade-offs.
74+
- **Scalability:** Grounding works in a fully self-supervised manner, while utility correlates strongly with higher task accuracy.
7775

7876
## Impact
7977
VIBE reframes video caption evaluation from a human decision support perspective. Unlike reference-based metrics, it scales to unseen data, works with black-box VLMs, and requires no human annotations.

0 commit comments

Comments
 (0)