Yuwen Tan1 · Joey Huang1 · Jin Huang2 · Haoxiang Li2 · Boqing Gong1
1 Boston University 2 Pixocial Technology
Understanding camera movement in natural language is critical for training and evaluating video generation models, among other applications. However, we demonstrate that existing vision-language models (VLMs) fail this task in surprising ways, frequently confusing translation with rotation, left with right, and object movement with camera movement. To address these limitations, we establish natural language camera movement understanding as a standalone research task. We introduce a two-level cinematographic taxonomy and an extensive, atomic benchmark featuring both real and synthetic videos. Furthermore, we curate a large-scale, multi-source training set enhanced by targeted camera movement augmentation. Our fine-tuned VLM-8B outperforms Gemini 3 Pro by 10% and 17% on our benchmark's real and synthetic videos, respectively. Despite these gains, a significant gap remains relative to human performance, underscoring the need to promote and facilitate future research on natural language camera movement understanding.
| Resource | Link |
|---|---|
| 🤗 Benchmark (ACaM-Bench) | https://huggingface.co/datasets/Yuwen2024/ACaM-Bench |
| 🤗 Model checkpoints | https://huggingface.co/collections/Yuwen2024/acam |
ACaM-Bench covers a two-level cinematographic taxonomy of 17 atomic camera movement classes spanning translations, rotations, focal-length changes, static shots, and object-centric movements. It provides three splits:
| Split | # Items | Task |
|---|---|---|
real |
1464 | 4-way multiple choice (real-world clips) |
syn |
1179 | 4-way multiple choice (synthetic clips) |
binary |
1510 | Yes/No question (balanced) |
Note: 454 of the
realclips come from CameraBench and are not redistributed — download them from there (filenames match).
.
├── evaluation/ # Evaluation code
│ ├── MCQ_evaluation/ # 4-way multiple-choice eval (per model)
│ ├── Binary_evaluation/ # Yes/No binary eval (per model)
│ └── scripts/ # Launch scripts
├── metrics/ # Accuracy / P-R-F1 computation
├── train/ # Fine-tuning configs + patched vision_process.py
└── dataset/ # Benchmark JSONLs and data-prep scripts
Per-model evaluation scripts live in evaluation/:
MCQ_evaluation/eval_<model>.py— 4-way multiple choiceBinary_evaluation/eval_<model>.py— Yes/No
Compute metrics with the scripts in metrics/ (e.g.
metrics/compute_accuracy_real.py, metrics/compute_binary.py).
We fine-tune Qwen3-VL with the
2U1/Qwen-VL-Series-Finetune
codebase. See train/README.md for environment setup
(including building qwen-vl-utils from source and patching vision_process.py)
and the training launch script.
@inproceedings{XXXX,
title = {Natural Language Camera Movement Understanding},
booktitle = {European Conference on Computer Vision (ECCV)},
year = {2026}
}