Zheyu Fan*1,2 ·
Jiateng Liu1 ·
Yuji Zhang1 ·
Zihan Wang2 ·
Yi R. (May) Fung1 ·
Manling Li2 ·
Heng Ji1
1University of Illinois Urbana-Champaign 2Northwestern University
*Work done during internship at UIUC.
ACL 2026 Findings
- 2026.05 🎉 EMCompress benchmark and reproduction code released on HuggingFace & GitHub.
- 2026.04 📝 Paper accepted to ACL 2026 Findings 🔥.
Current Video-LLMs treat long-video reasoning as a one-shot, sparse frame-sampling problem — diluting evidence and missing fine-grained temporal semantics. We propose Endomorphic Multimodal Compression (EMC), a cognitively-inspired task that compresses a (Video, Query) pair into a shorter, semantically coherent pair within the same bimodal space:
FEMC : (V, Q) → (v, q), preserving answer invariance across reasonable downstream Video-LLMs.
This filter-before-reason economy mirrors human attentional pre-screening (e.g., scrubbing the seek bar before detailed viewing) and recasts long-video QA as a sufficient-statistic problem under the Markov chain A → (V, Q) → (v, q).
This repository provides:
- 🧪 ReSimplifyIt — a strong EMC baseline framework (multi-agent: Launcher → Validator → Viewer).
- 📊 EMCompress benchmark — 2,754 cooking-domain QA samples (built on YouCook2) with both EMC-process and standard VideoQA labels.
- 🛠️ Two-stage reproduction pipeline (Stage 1: EMC process · Stage 2: EMC-guided downstream VideoQA).
We treat the ground-truth answer
and the Data Processing Inequality bounds
over the original multimodal task space.
Admissibility conditions.
-
(C1) Structural Continuity.
$v$ is the concatenation of$n \ge 1$ non-overlapping contiguous sub-segments of$V$ :
-
(C2) Answer Sufficiency. For any reasonable VideoQA agent
$M$ ,
Minimality objectives. Resolved via video-priority lexicographic optimization:
See paper §2 (and Appendix I) for the full derivation.
See the paper for the full algorithm (Appendix D) and the EMCompress generation protocol (Appendix B).
git clone https://github.com/LordUky/EMCompress.git
cd EMCompress
conda create -n emc python=3.10 -y
conda activate emc
pip install -r requirements.txt # openai, transformers, decord, opencv-python, tqdm, ...Create a local config.py (gitignored) with your OpenAI key and dataset roots — read the inline comments at the top of the file for what each variable holds.
The EMCompress benchmark and all 1,080 source videos (~150 GB) are hosted at:
For external benchmarks (EgoSchema, LVBench, MLVU, Video-MME, ActivityNet-QA, NExT-QA, NExT-OE), please obtain them from their original sources and place each at ${EMC_DATASETS_DIR}/<DatasetName>/test_split.json in the schema documented in emc_utils/utils.py::load_test_split.
Produces screened_timestamps and screened_question for each (video, question) sample:
# All 7 datasets, ReSimplifyIt-simple + ReSimplifyIt-full
bash run_emc_process.sh 50 # 50 parallel API threads
# Or a single dataset
python run_emc_simple_baseline.py --dataset EMCompress --num_threads 50
python run_emc_full_baseline.py --dataset EMCompress --num_threads 30Runs each of 11 video-LLMs (8 local + 3 API) on each dataset, twice (with vs without EMC), for paper Table 2:
bash run_emc_guided_inference.sh 1 200 # 1 GPU per local-model torchrun, 200 API threadsThe script auto-routes OpenAI models (GPT-4o / GPT-4.1-mini / GPT-4-turbo) through plain Python + threading, and routes local VLMs (InternVL3.5 / Qwen2.5-VL / Qwen3-VL / LLaVA-OneVision) through torchrun for multi-GPU inference.
caption_all_videos.py precomputes per-frame captions used by ReSimplifyIt's Viewer. Supports OpenAI API and local Qwen-VL / LLaVA-1.5 / LLaVA-NeXT backends auto-routed by --model:
# OpenAI API (any vision-capable model)
python caption_all_videos.py --model gpt-4o --num_threads 32 --skip_existing
# Local Qwen-VL family
torchrun --nproc_per_node=4 caption_all_videos.py --model Qwen/Qwen3-VL-32B-Instruct --skip_existing
# Local LLaVA family
torchrun --nproc_per_node=2 caption_all_videos.py --model llava-hf/llava-v1.6-mistral-7b-hf --skip_existingEMC integration yields relative gains of 7.33% in training and 33.7% in inference for video-language understanding (see paper Table 2 / Table 3).
| Setup | Δ vs. w/o EMC |
|---|---|
| 11 video-LLMs × 8 datasets, inference-time EMC | +33.7% rel. |
| Video Instruction Tuning w/ EMC labels | +7.33% rel. |
ReSimplifyIt also surpasses prior similar-task baselines by 0.40 F-1 on EMCompress's stage-1 query-rewriting evaluation.
Full per-dataset / per-model numbers in the paper.
If you find this work useful, please cite:
@inproceedings{fan2026emcompress,
title = {{EMCompress}: Video-LLMs with Endomorphic Multimodal Compression},
author = {Fan, Zheyu and Liu, Jiateng and Zhang, Yuji and Wang, Zihan and
Fung, Yi R. and Li, Manling and Ji, Heng},
booktitle = {Findings of the Association for Computational Linguistics: ACL 2026},
year = {2026}
}- Code in this repository: MIT (see
LICENSE). - EMCompress benchmark data (videos and annotations on HuggingFace): CC-BY-NC-SA 4.0, inheriting from YouCook2. Non-commercial research use only.
Built on the YouCook2 cooking-video dataset. We thank the authors of all baseline Video-LLMs evaluated in this work.
