Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
[ICLR 2026] Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs
Minji Kim*, Taekyung Kim*, Bohyung Han
(* Equal Contribution)
Official PyTorch implementation of the ICLR 2026 paper "Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs"
- 2026/03/03: Code and models released.
- 2026/01/26: Our paper is accepted to ICLR 2026 with strong reviews! π
TL;DR: This paper presents a systematic analysis of where and how information flows in VideoLLMs for temporal reasoning in VideoQA, revealing key patterns and effective pathways.
π Summary of our findings on VideoLLMs' information flow:
(a) Temporal reasoning begins with cross-frame interactions within video tokens at early-middle layers ,
followed by video-language integration into temporal keywords in the question
.
This information is conveyed to the last token at middle-late layers
,
where answer generation occurs
.
(b) These effective pathways are identified via Attention Knockout, which disconnects attention pairs and tracks the drop in probability of the final answer to quantify their impact.
(c) Layer-wise answer probability rises immediately after video-language integration, indicating that the model is ready to predict correct answers after the middle layers.
Based on our analysis, we show that VideoLLMs can retain their VideoQA performance by selecting effective information pathways while suppressing a substantial amount of attention edges, e.g., 58% in LLaVA-NeXT-7B-Video-FT.
π This repository supports:
- Causal intervention tools for VideoLLMs (e.g., Attention Knockout, Logit Lens, Attention Map Visualization)
- Reproducible experiments from our paper, including figure plotting code
- Training and evaluation across various model series and video benchmarks
You can download all model checkpoints from the Hugging Face links below. We fine-tuned LLaVA-NeXT and Mini-InternVL on VideoChat2-IT to analyze the impact of video instruction tuning on model behavior. We also adopted VideoLLaMA3 without additional fine-tuning.
| Model | Link | Initialized From |
|---|---|---|
| LLaVA-NeXT-7B-Video-FT | llava-hf/llava-v1.6-vicuna-7b-hf | |
| LLaVA-NeXT-13B-Video-FT | llava-hf/llava-v1.6-vicuna-13b-hf | |
| Mini-InternVL-4B-Video-FT | OpenGVLab/Mini-InternVL-Chat-4B-V1-5 | |
| VideoLLaMA3-7B | - | DAMO-NLP-SG/VideoLLaMA3-7B |
Tested with Python 3.10, PyTorch 2.2.1, CUDA 11.8. Other versions may be compatible.
Step 1: Create a virtual environment
-
Option 1: PyTorch Docker image with torch==2.2.1, torchaudio==2.2.1, torchvision==0.17.1
docker run -it --gpus all --ipc=host --rm --name=map_the_flow \ pytorch/pytorch:2.2.1-cuda11.8-cudnn8-devel
-
Option 2: Conda environment
conda create -n map_the_flow python=3.10 -y conda activate map_the_flow conda install pytorch==2.2.1 torchvision==0.17.1 torchaudio==2.2.1 \ pytorch-cuda=11.8 -c pytorch -c nvidia -y
Step 2: Clone the repository and install dependencies
git clone https://github.com/byminji/map-the-flow.git
cd map-the-flow
pip install -r requirements.txt
pip install mmcv-full==1.7.2 --no-build-isolation # mmcv-full must be built from sourceYou can download all evaluation data from the Hugging Face links below. After downloading, set the paths in tasks/eval/config_dataset.py.
- TVBench: We mainly adopted TVBench for our analysis.
- TOMATO: Adopted for effective pathway analysis.
- LongVideoBench: Long video understanding analysis.
- Video-MME: Spatial understanding analysis.
- VCGBench: Used for open-ended analysis. We followed the original repo to prepare the evaluation data.
All implementations are in analysis folder and run scripts are in scripts/analysis.
Results including graph plots and raw data are saved under ${output_path}/${dataset_name}/${target}/${model_name}.
To reproduce the plot style used in our paper, run analysis/visualize_graph_plots.py on the saved JSONs.
Modify these variables at the top of each script before running.
| Variable | Description | Example |
|---|---|---|
dataset_name |
Evaluation dataset | tvbench |
output_path |
Root directory for saving results | workspace/outputs/information_flow_analysis |
video_model_path |
Path to the fine-tuned model | workspace/models/LLaVA-NeXT-7B-Video-FT |
base_model_path |
Path to the base model | workspace/models/llava-v1.6-vicuna-7b-hf |
conv_mode |
Conversation template | eval_mvbench |
pooling_shape |
Token pooling shape (T-H-W) |
8-12-12 |
task_id |
Task index (-1 = full dataset) |
0 |
Task IDs for TVBench: 0=Action Antonym, 3=Action Sequence, 5=Moving Direction, 6=Object Count, 8=Scene Transition.
- Casually traces the impact of specific token interactions by using Attention Knockout.
- Script: scripts/analysis/information_flow_analysis_*.sh
- Implementation: analysis/information_flow_analysis.py
--target |
Description |
|---|---|
cross-frame |
Block cross-frame interactions among video tokens |
vql-to-ql |
Block video/question/last β question/last flows |
question-and-options-to-last |
Block question-only, true, false options β last token |
vq-to-true-opt |
Block video/question β true option token |
- Traces layer-wise answer probability changes for true/false options (Fig. 9 in our paper).
- Script: included at the end of scripts/analysis/information_flow_analysis_*.sh
- Implementation: analysis/gen_prob_analysis.py
- Disconnects attentions except for those idefined as effective pathways, showing that a substantial amount of attention edges can be suppressed while retaining VideoQA performance.
- Script: scripts/analysis/effective_pathway_analysis_*.sh
- Implementation: analysis/effective_pathway_analysis.py
- Logit probing by projecting layerwise video token representations into language vocabulary space.
- Script: scripts/analysis/logit_lens_analysis.sh
- Implementation: analysis/logit_lens_analysis.py
- After obtaining the json file, you can also run analysis/visualize_logit_lens_vocab_frequency.py to generate layer-wise vocabulary frequency plots (Fig. 4 in our paper):
- Add
--visualize_on_videoto generate per-frame visualizations (Fig. 5 in our paper). We usetask_id=3(Action Sequence) in our paper.
- Visualizes attention maps comparing baseline vs. attention knockout conditions (Fig. 6 in our paper).
- Script: scripts/analysis/attention_visualization.sh
- Implementation: analysis/attention_visualization.py
If you want to reproduce our training process, please refer to docs/TRAIN.md.
This project is built upon the following works:
- dissecting_factual_predictions, cross-modal-information-flow-in-MLLM: Causal intervention analysis
- PLLaVA: Base codebase and LLaVA-NeXT integration
- InternVL, VideoLLaMA3: Mini-InternVL and VideoLLaMA3 integration
We thank all authors who contributed to these foundational projects.
If you find our paper useful in your research, please consider citing:
@inproceedings{kim2026map,
author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
booktitle = {International Conference on Learning Representations (ICLR)},
year = {2026},
}
@article{kim2025map,
author = {Kim, Minji and Kim, Taekyung and Han, Bohyung},
title = {Map the Flow: Revealing Hidden Pathways of Information in VideoLLMs},
journal = {arXiv preprint arXiv:2510.13251},
year = {2025},
}If you have any questions, please create an issue or contact minji@snu.ac.kr and taekyung.k@navercorp.com.
