Skip to content

1yuwen/AcaM--Camera-Understanding

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Natural Language Camera Movement Understanding

Yuwen Tan1 · Joey Huang1 · Jin Huang2 · Haoxiang Li2 · Boqing Gong1

1 Boston University    2 Pixocial Technology

Abstract

Understanding camera movement in natural language is critical for training and evaluating video generation models, among other applications. However, we demonstrate that existing vision-language models (VLMs) fail this task in surprising ways, frequently confusing translation with rotation, left with right, and object movement with camera movement. To address these limitations, we establish natural language camera movement understanding as a standalone research task. We introduce a two-level cinematographic taxonomy and an extensive, atomic benchmark featuring both real and synthetic videos. Furthermore, we curate a large-scale, multi-source training set enhanced by targeted camera movement augmentation. Our fine-tuned VLM-8B outperforms Gemini 3 Pro by 10% and 17% on our benchmark's real and synthetic videos, respectively. Despite these gains, a significant gap remains relative to human performance, underscoring the need to promote and facilitate future research on natural language camera movement understanding.

Links

Resource Link
🤗 Benchmark (ACaM-Bench) https://huggingface.co/datasets/Yuwen2024/ACaM-Bench
🤗 Model checkpoints https://huggingface.co/collections/Yuwen2024/acam

Benchmark

ACaM-Bench covers a two-level cinematographic taxonomy of 17 atomic camera movement classes spanning translations, rotations, focal-length changes, static shots, and object-centric movements. It provides three splits:

Split # Items Task
real 1464 4-way multiple choice (real-world clips)
syn 1179 4-way multiple choice (synthetic clips)
binary 1510 Yes/No question (balanced)

Note: 454 of the real clips come from CameraBench and are not redistributed — download them from there (filenames match).

Repository Structure

.
├── evaluation/          # Evaluation code
│   ├── MCQ_evaluation/      # 4-way multiple-choice eval (per model)
│   ├── Binary_evaluation/   # Yes/No binary eval (per model)
│   └── scripts/             # Launch scripts
├── metrics/             # Accuracy / P-R-F1 computation
├── train/               # Fine-tuning configs + patched vision_process.py
└── dataset/             # Benchmark JSONLs and data-prep scripts

Evaluation

Per-model evaluation scripts live in evaluation/:

  • MCQ_evaluation/eval_<model>.py — 4-way multiple choice
  • Binary_evaluation/eval_<model>.py — Yes/No

Compute metrics with the scripts in metrics/ (e.g. metrics/compute_accuracy_real.py, metrics/compute_binary.py).

Training

We fine-tune Qwen3-VL with the 2U1/Qwen-VL-Series-Finetune codebase. See train/README.md for environment setup (including building qwen-vl-utils from source and patching vision_process.py) and the training launch script.

Citation

@inproceedings{XXXX,
  title     = {Natural Language Camera Movement Understanding},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}

About

Natural Language Camera Movement Understanding (ECCV 2026)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors