Natural Language Camera Movement Understanding

Yuwen Tan¹ · Joey Huang¹ · Jin Huang² · Haoxiang Li² · Boqing Gong¹

¹ Boston University ² Pixocial Technology

Abstract

Understanding camera movement in natural language is critical for training and evaluating video generation models, among other applications. However, we demonstrate that existing vision-language models (VLMs) fail this task in surprising ways, frequently confusing translation with rotation, left with right, and object movement with camera movement. To address these limitations, we establish natural language camera movement understanding as a standalone research task. We introduce a two-level cinematographic taxonomy and an extensive, atomic benchmark featuring both real and synthetic videos. Furthermore, we curate a large-scale, multi-source training set enhanced by targeted camera movement augmentation. Our fine-tuned VLM-8B outperforms Gemini 3 Pro by 10% and 17% on our benchmark's real and synthetic videos, respectively. Despite these gains, a significant gap remains relative to human performance, underscoring the need to promote and facilitate future research on natural language camera movement understanding.

Links

Resource	Link
🤗 Benchmark (ACaM-Bench)	https://huggingface.co/datasets/Yuwen2024/ACaM-Bench
🤗 Model checkpoints	https://huggingface.co/collections/Yuwen2024/acam

Benchmark

ACaM-Bench covers a two-level cinematographic taxonomy of 17 atomic camera movement classes spanning translations, rotations, focal-length changes, static shots, and object-centric movements. It provides three splits:

Split	# Items	Task
`real`	1464	4-way multiple choice (real-world clips)
`syn`	1179	4-way multiple choice (synthetic clips)
`binary`	1510	Yes/No question (balanced)

Note: 454 of the real clips come from CameraBench and are not redistributed — download them from there (filenames match).

Repository Structure

.
├── evaluation/          # Evaluation code
│   ├── MCQ_evaluation/      # 4-way multiple-choice eval (per model)
│   ├── Binary_evaluation/   # Yes/No binary eval (per model)
│   └── scripts/             # Launch scripts
├── metrics/             # Accuracy / P-R-F1 computation
├── train/               # Fine-tuning configs + patched vision_process.py
└── dataset/             # Benchmark JSONLs and data-prep scripts

Evaluation

Per-model evaluation scripts live in evaluation/:

MCQ_evaluation/eval_<model>.py — 4-way multiple choice
Binary_evaluation/eval_<model>.py — Yes/No

Compute metrics with the scripts in metrics/ (e.g. metrics/compute_accuracy_real.py, metrics/compute_binary.py).

Training

We fine-tune Qwen3-VL with the 2U1/Qwen-VL-Series-Finetune codebase. See train/README.md for environment setup (including building qwen-vl-utils from source and patching vision_process.py) and the training launch script.

Citation

@inproceedings{XXXX,
  title     = {Natural Language Camera Movement Understanding},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Natural Language Camera Movement Understanding

Abstract

Links

Benchmark

Repository Structure

Evaluation

Training

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
dataset		dataset
evaluation		evaluation
metrics		metrics
train		train
.gitignore		.gitignore
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Natural Language Camera Movement Understanding

Abstract

Links

Benchmark

Repository Structure

Evaluation

Training

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages