TriMotion is a unified framework for camera-controlled video generation that maps video, pose, and text inputs describing the same camera trajectory into a shared motion embedding space. This modality-agnostic design enables flexible and consistent camera control from heterogeneous user inputs, built on top of WAN-Video.
Existing camera-control methods are typically restricted to a single input modality — pose-conditioned methods require precise geometric trajectories, reference-video methods lack explicit control, and text-based methods struggle with temporal consistency. TriMotion addresses all three limitations in a single framework.
Key components:
- Unified Motion Embedding Space — aligns video, pose, and text in a shared representation via contrastive learning, temporal synchronization, and geometric fidelity regularization
- Motion Triplet Dataset — 136K synchronized (video, pose, text) triplets built on the Multi-Cam Video Dataset with LLM-generated geometry-grounded captions
- Latent Motion Consistency — a Motion Embedding Predictor that enforces trajectory fidelity directly in latent space, avoiding costly pixel-space decoding
Three-Stage Training Pipeline:
Stage 1: Train Unified Motion Embedding Space (video + text + pose alignment)
↓
Stage 2: Train Motion Embedding Predictor (latent → motion embedding)
↓
Stage 3: Train WAN-Video Diffusion Model (camera-controlled I2V / V2V)
pip install torch torchvision
pip install transformers diffusers accelerate deepspeed
pip install pytorch-lightning
pip install decord einops scipy pillow numpyTested with Python 3.10, PyTorch 2.x, CUDA 11.8+.
The Motion Triplet Dataset is built upon the Multi-Cam Video Dataset(136K videos, 13.6K scenes, 40 Unreal Engine 5 environments) by adding geometry-grounded motion descriptions.
- First, download Multi-Cam Video Dataset Multi-Cam Video Dataset under MotionTriplet-Dataset directory.
- Then download Motion Descriptions from Google Drive and put it under MotionTriplet-Dataset directory.
- Finally, run below code to prepare dataset for training.
python merge_datasets.pyThe structure of Full Dataset would be as below:
MotionTriplet-Dataset/
├── train/
│ └── f00/
│ └── scene1/
│ ├── cameras/
│ │ ├── camera_extrinsics.json
│ │ ├── text_description_long.json
│ │ └── text_description_short.json
│ ├── videos/
│ │ ├── cam01.mp4
│ │ ├── ...
│ │ └── cam10.mp4
│ ├── text/
│ └── text_description.json
│ └── merged_conditions.json
Precompute and cache embeddings before training:
python latent_preprocess.py \
--base_path ./data \
--json_path ./data/merged_camera_dataset.json \
--t5_path path/to/t5-base \
--vggt_path path/to/vggt \
--embedding_model_path path/to/stage1_checkpoint \
--output_dir ./dataTrains motion encoders for all three modalities with a composite loss: global InfoNCE alignment, temporal synchronization, and geometric fidelity regularization.
python train_embedding_space.py \
--base_path ./data \
--json_path ./data/merged_camera_dataset.json \
--t5_path path/to/t5-base \
--vggt_path path/to/vggt \
--output_dir ./checkpoint/stage1 \
--batch_size 16 \
--lr 1e-4 \
--epochs 100Trains the predictor (3D convolutions + temporal Transformer) to estimate motion embeddings from VAE latents, using a dual-granularity cosine similarity loss (global + frame-wise).
python train_motion_embedding_projector.py \
--base_path ./data \
--json_path ./data/merged_camera_dataset.json \
--embedding_model_path ./checkpoint/stage1/best.ckpt \
--output_dir ./checkpoint/stage2 \
--batch_size 24 \
--lr 1e-4 \
--epochs 10Fine-tunes WAN-Video with motion embedding conditioning via block-specific projection MLPs. Jointly trains I2V and V2V with equal probability per iteration.
deepspeed train_TriMotion.py \
--base_path ./data \
--json_path ./data/merged_camera_dataset.json \
--wan_model_path path/to/wan-video \
--t5_path path/to/t5-base \
--embedding_model_path ./checkpoint/stage1/best.ckpt \
--projector_path ./checkpoint/stage2/best.ckpt \
--output_dir ./checkpoint/stage3 \
--batch_size 4 \
--lr 1e-4 \
--deepspeed_stage 2Training was performed on 4 × NVIDIA H200 GPUs with AdamW (β₁=0.9, β₂=0.999, weight decay=0.01, lr=1×10⁻⁴).
python demo_multimodal.py \
--wan_model_path path/to/wan-video \
--embedding_model_path ./checkpoint/stage1/best.ckpt \
--projector_path ./checkpoint/stage2/best.ckpt \
--stage3_path ./checkpoint/stage3 \
--input_video path/to/reference.mp4 \
--prompt "The camera starts with a steady dolly-in motion while gradually panning left." \
--output_path output.mp4Each modality encoder produces N temporal motion tokens + 1 global token, processed by a lightweight temporal Transformer T_m:
| Modality | Encoder | Key design |
|---|---|---|
| Video | VGGT Aggregator (Alternating-Attention blocks) | Camera tokens aggregate multi-view 3D geometry |
| Text | Frozen T5 + N learnable motion queries (cross-attention) | Lifts static text into temporal motion sequence |
| Pose | Frame-wise MLP (GELU) on flattened 3×4 extrinsic matrix | Preserves geometric trajectory structure |
Training objectives:
- L_NCE (Global Alignment): InfoNCE contrastive loss over all 3 modality pairs
- L_temp (Temporal Synchronization): Cosine distance between corresponding temporal tokens
- L_pose (Geometric Fidelity): Shared pose regressor predicting camera extrinsics from each modality embedding (L1 loss)
L_align = L_NCE + λ_t · L_temp + λ_p · L_pose
A frozen Motion Embedding Predictor M_pred estimates a motion embedding from the reconstructed clean latent during diffusion training:
ẑ_0 = z_t - t · v_θ(z̃_t, t, y, I, e_m)
L_total = L_denoise + λ_m · L_motion(M_pred(ẑ_0), e_m)
This enforces trajectory adherence without pixel-space decoding.
Motion embeddings are injected into each DiT block via a block-specific projection MLP with residual addition:
h_in = h + F_proj(e_m)
Only the 3D spatial-temporal attention layers and projection MLPs are updated during fine-tuning.
Concatenate two motion sequences (offset by the final state of the first) to generate compound multi-stage camera trajectories across any modality combination.
Linearly interpolate between motion embeddings e_a and e_b from different modalities to produce smooth blended camera motion.
@inproceedings{trimotion2026,
title={TriMotion: Modality-Agnostic Camera Control for Video Generation},
booktitle={European Conference on Computer Vision (ECCV)},
year={2026}
}- WAN-Video — diffusion backbone
- VGGT — video motion encoder
- ReCamMaster — Multi-Cam Video Dataset
- Qwen3 — geometry-grounded caption generation
- Hugging Face Transformers — T5 text encoder