TriMotion: Modality-Agnostic Camera Control for Video Generation

TriMotion is a unified framework for camera-controlled video generation that maps video, pose, and text inputs describing the same camera trajectory into a shared motion embedding space. This modality-agnostic design enables flexible and consistent camera control from heterogeneous user inputs, built on top of WAN-Video.

Overview

Existing camera-control methods are typically restricted to a single input modality — pose-conditioned methods require precise geometric trajectories, reference-video methods lack explicit control, and text-based methods struggle with temporal consistency. TriMotion addresses all three limitations in a single framework.

Key components:

Unified Motion Embedding Space — aligns video, pose, and text in a shared representation via contrastive learning, temporal synchronization, and geometric fidelity regularization
Motion Triplet Dataset — 136K synchronized (video, pose, text) triplets built on the Multi-Cam Video Dataset with LLM-generated geometry-grounded captions
Latent Motion Consistency — a Motion Embedding Predictor that enforces trajectory fidelity directly in latent space, avoiding costly pixel-space decoding

Three-Stage Training Pipeline:

Stage 1: Train Unified Motion Embedding Space  (video + text + pose alignment)
    ↓
Stage 2: Train Motion Embedding Predictor      (latent → motion embedding)
    ↓
Stage 3: Train WAN-Video Diffusion Model   (camera-controlled I2V / V2V)

Requirements

pip install torch torchvision
pip install transformers diffusers accelerate deepspeed
pip install pytorch-lightning
pip install decord einops scipy pillow numpy

Tested with Python 3.10, PyTorch 2.x, CUDA 11.8+.

Motion Triplet Dataset

The Motion Triplet Dataset is built upon the Multi-Cam Video Dataset(136K videos, 13.6K scenes, 40 Unreal Engine 5 environments) by adding geometry-grounded motion descriptions.

First, download Multi-Cam Video Dataset Multi-Cam Video Dataset under MotionTriplet-Dataset directory.
Then download Motion Descriptions from Google Drive and put it under MotionTriplet-Dataset directory.
Finally, run below code to prepare dataset for training.

python merge_datasets.py

The structure of Full Dataset would be as below:

Directory Structure

MotionTriplet-Dataset/
├── train/
│   └── f00/
│       └── scene1/
│           ├── cameras/
│           │   ├── camera_extrinsics.json
│           │   ├── text_description_long.json
│           │   └── text_description_short.json
│           ├── videos/
│           │   ├── cam01.mp4
│           │   ├── ...
│           │   └── cam10.mp4
│           ├── text/
│               └── text_description.json   
│           └── merged_conditions.json

Preprocess Embeddings

Precompute and cache embeddings before training:

python latent_preprocess.py \
    --base_path ./data \
    --json_path ./data/merged_camera_dataset.json \
    --t5_path path/to/t5-base \
    --vggt_path path/to/vggt \
    --embedding_model_path path/to/stage1_checkpoint \
    --output_dir ./data

Training

Stage 1: Unified Motion Embedding Space

Trains motion encoders for all three modalities with a composite loss: global InfoNCE alignment, temporal synchronization, and geometric fidelity regularization.

python train_embedding_space.py \
    --base_path ./data \
    --json_path ./data/merged_camera_dataset.json \
    --t5_path path/to/t5-base \
    --vggt_path path/to/vggt \
    --output_dir ./checkpoint/stage1 \
    --batch_size 16 \
    --lr 1e-4 \
    --epochs 100

Stage 2: Motion Embedding Predictor

Trains the predictor (3D convolutions + temporal Transformer) to estimate motion embeddings from VAE latents, using a dual-granularity cosine similarity loss (global + frame-wise).

python train_motion_embedding_projector.py \
    --base_path ./data \
    --json_path ./data/merged_camera_dataset.json \
    --embedding_model_path ./checkpoint/stage1/best.ckpt \
    --output_dir ./checkpoint/stage2 \
    --batch_size 24 \
    --lr 1e-4 \
    --epochs 10

Stage 3: Diffusion Model Fine-tuning

Fine-tunes WAN-Video with motion embedding conditioning via block-specific projection MLPs. Jointly trains I2V and V2V with equal probability per iteration.

deepspeed train_TriMotion.py \
    --base_path ./data \
    --json_path ./data/merged_camera_dataset.json \
    --wan_model_path path/to/wan-video \
    --t5_path path/to/t5-base \
    --embedding_model_path ./checkpoint/stage1/best.ckpt \
    --projector_path ./checkpoint/stage2/best.ckpt \
    --output_dir ./checkpoint/stage3 \
    --batch_size 4 \
    --lr 1e-4 \
    --deepspeed_stage 2

Training was performed on 4 × NVIDIA H200 GPUs with AdamW (β₁=0.9, β₂=0.999, weight decay=0.01, lr=1×10⁻⁴).

Inference

python demo_multimodal.py \
    --wan_model_path path/to/wan-video \
    --embedding_model_path ./checkpoint/stage1/best.ckpt \
    --projector_path ./checkpoint/stage2/best.ckpt \
    --stage3_path ./checkpoint/stage3 \
    --input_video path/to/reference.mp4 \
    --prompt "The camera starts with a steady dolly-in motion while gradually panning left." \
    --output_path output.mp4

Method

Unified Motion Embedding Space

Each modality encoder produces N temporal motion tokens + 1 global token, processed by a lightweight temporal Transformer T_m:

Modality	Encoder	Key design
Video	VGGT Aggregator (Alternating-Attention blocks)	Camera tokens aggregate multi-view 3D geometry
Text	Frozen T5 + N learnable motion queries (cross-attention)	Lifts static text into temporal motion sequence
Pose	Frame-wise MLP (GELU) on flattened 3×4 extrinsic matrix	Preserves geometric trajectory structure

Training objectives:

L_NCE (Global Alignment): InfoNCE contrastive loss over all 3 modality pairs
L_temp (Temporal Synchronization): Cosine distance between corresponding temporal tokens
L_pose (Geometric Fidelity): Shared pose regressor predicting camera extrinsics from each modality embedding (L1 loss)

L_align = L_NCE + λ_t · L_temp + λ_p · L_pose

Latent Motion Consistency

A frozen Motion Embedding Predictor M_pred estimates a motion embedding from the reconstructed clean latent during diffusion training:

ẑ_0 = z_t - t · v_θ(z̃_t, t, y, I, e_m)
L_total = L_denoise + λ_m · L_motion(M_pred(ẑ_0), e_m)

This enforces trajectory adherence without pixel-space decoding.

Motion Conditioning

Motion embeddings are injected into each DiT block via a block-specific projection MLP with residual addition:

h_in = h + F_proj(e_m)

Only the 3D spatial-temporal attention layers and projection MLPs are updated during fine-tuning.

Applications

Sequential Motion Composition

Concatenate two motion sequences (offset by the final state of the first) to generate compound multi-stage camera trajectories across any modality combination.

Cross-Modal Motion Interpolation

Linearly interpolate between motion embeddings e_a and e_b from different modalities to produce smooth blended camera motion.

Citation

@inproceedings{trimotion2026,
  title={TriMotion: Modality-Agnostic Camera Control for Video Generation},
  booktitle={European Conference on Computer Vision (ECCV)},
  year={2026}
}

Acknowledgements

WAN-Video — diffusion backbone
VGGT — video motion encoder
ReCamMaster — Multi-Cam Video Dataset
Qwen3 — geometry-grounded caption generation
Hugging Face Transformers — T5 text encoder

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
MotionTriplet-Dataset		MotionTriplet-Dataset
dataloader		dataloader
demo		demo
diffsynth		diffsynth
model		model
utils		utils
.gitignore		.gitignore
README.md		README.md
demo_mutlimodal.py		demo_mutlimodal.py
latent_preprocess.py		latent_preprocess.py
train_TriMotion.py		train_TriMotion.py
train_embedding_space.py		train_embedding_space.py
train_motion_embedding_projector.py		train_motion_embedding_projector.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TriMotion: Modality-Agnostic Camera Control for Video Generation

Overview

Requirements

Motion Triplet Dataset

Directory Structure

Preprocess Embeddings

Training

Stage 1: Unified Motion Embedding Space

Stage 2: Motion Embedding Predictor

Stage 3: Diffusion Model Fine-tuning

Inference

Method

Unified Motion Embedding Space

Latent Motion Consistency

Motion Conditioning

Applications

Sequential Motion Composition

Cross-Modal Motion Interpolation

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

TriMotion: Modality-Agnostic Camera Control for Video Generation

Overview

Requirements

Motion Triplet Dataset

Directory Structure

Preprocess Embeddings

Training

Stage 1: Unified Motion Embedding Space

Stage 2: Motion Embedding Predictor

Stage 3: Diffusion Model Fine-tuning

Inference

Method

Unified Motion Embedding Space

Latent Motion Consistency

Motion Conditioning

Applications

Sequential Motion Composition

Cross-Modal Motion Interpolation

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages