Add VIBE paper announcement and update formatting in news and blog post

d31003 · d31003 · commit f7c12960f98b · 2025-09-21T12:07:02.000-05:00
diff --git a/_data/news.yml b/_data/news.yml
@@ -1,3 +1,6 @@
+- date: 2025-09-21
+  details: >-
+    Po-han's paper <a href="https://openreview.net/forum?id=C35FCYZBXp"> VIBE: Annotation-Free Video-to-Text Information Bottleneck Evaluation for TL;DR </a> was accepted to NeurIPS 2025!
 - date: 2025-08-01
   details: >-
     Oguzhan B.'s paper <a href="https://openreview.net/forum?id=M1e2PEMLp2"> Fair Resource Allocation for Fleet Intelligence </a> was accepted to GLOBECOM 2025!
diff --git a/_posts/2025-09-21-VIBE.md b/_posts/2025-09-21-VIBE.md
@@ -20,10 +20,9 @@ Proceedings of the 39th Conference on Neural Information Processing Systems (Neu
 ## Motivation
 Many real-world tasks still require human oversight: a traffic officer sifting through dashcam footage, or a researcher screening long conference videos. Watching raw video is slow, and existing vision-language models (VLMs) often produce verbose, redundant captions that hinder efficiency.
 Current video-to-text evaluation methods depend on costly human annotations and ignore whether summaries actually help humans make decisions. We ask:
-- Q1 (Annotation-Free): Can we evaluate video summaries without relying on gold-standard human captions?
-- Q2 (Task-Aware): Can we measure how much a summary improves human performance on downstream tasks?
+- **Q1 (Annotation-Free):** Can we evaluate video summaries without relying on gold-standard human captions?
+- **Q2 (Task-Aware):** Can we measure how much a summary improves human performance on downstream tasks?
 
-### System Plot:
 <figure style="text-align: center;">
     <img src="{{site.baseurl}}/images/post/TLDR_system.png" alt="TLDR System Plot" height="auto" style="margin: auto; display: block;">
    <figcaption>VIBE Framework Overview</figcaption>
@@ -32,10 +31,9 @@ Current video-to-text evaluation methods depend on costly human annotations and
 
 ## Contributions
 We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel framework that evaluates and selects VLM summaries without annotations or retraining.
-- Grounding Score: Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
-- Utility Score: Measures how informative the summary is for a downstream task.
-
-- Annotation-Free Rejection Sampling: VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
+- **Grounding Score:** Measures how faithfully a summary reflects the video using pointwise mutual information between video and text.
+- **Utility Score:** Measures how informative the summary is for a downstream task.
+- **Annotation-Free Rejection Sampling:** VIBE ranks multiple VLM-generated summaries and selects the one that maximizes grounding and/or utility, supporting efficient human decision-making.
 
 ## VIBE Framework
 <figure style="text-align: center;">
@@ -46,8 +44,8 @@ We introduce VIBE (Video-to-text Information Bottleneck Evaluation), a novel fra
 
 VIBE adapts the information bottleneck principle to video summarization:
 
-- Grounding approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
-- Utility approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
+- **Grounding** approximates I(V;T), testing how well the video supports reconstruction of a masked summary.
+- **Utility** approximates I(T;Y), testing how well the summary compensates for missing video information to predict a task label.
 
 By maximizing both, VIBE selects concise, task-relevant summaries—without gold labels or retraining.
 
@@ -69,11 +67,11 @@ We validate VIBE through user studies with 243 participants across three dataset
    <p></p>
 </figure>
 
-Key findings:
-- Accuracy Gains: VIBE-selected summaries boost task accuracy by up to 61.23%.
-- Speed Improvements: Response time drops by up to 75.77% compared to raw video.
-- Efficiency: VIBE achieves lower inverse efficiency scores (time/accuracy), showing superior speed–accuracy trade-offs.
-- Scalability: Grounding works in a fully self-supervised manner, while utility correlates strongly with higher task accuracy.
+### Key findings:
+- **Accuracy Gains:** VIBE-selected summaries boost task accuracy by up to 61.23%.
+- **Speed Improvements:** Response time drops by up to 75.77% compared to raw video.
+- **Efficiency:** VIBE achieves lower inverse efficiency scores (time/accuracy), showing superior speed–accuracy trade-offs.
+- **Scalability:** Grounding works in a fully self-supervised manner, while utility correlates strongly with higher task accuracy.
 
 ## Impact
 VIBE reframes video caption evaluation from a human decision support perspective. Unlike reference-based metrics, it scales to unseen data, works with black-box VLMs, and requires no human annotations.