This document provides a comprehensive guide to training DVD (Deterministic Video Depth).
Before starting, it is helpful to understand the core scripts involved in the training process:
train_script/train_video_new.sh,examples/wanvideo/model_training/train_with_accelerate_video.py: The main entry point for the training loop.examples/wanvideo/model_training/WanTrainingModule.py: Handles training and validation logic. Please note that we only validate on a single window during training. Please consider using the inference script to perform more validation to save time if needed.examples/dataset: Handles dataset (both train and val).train_config/normal_config/video_config_new.yaml: Contains all hyperparameters, including learning rate, batch size, dataset config, and so on.diffsynth/pipelines/wan_video_new_determine.py: The core model architecture.
As mentioned in our paper, DVD requires only 367K frames to unlock generative priors. We mainly Hypersim (image), TartanAir (video) and Virtual KITTI (video,image) for training. For detailed preprocessing instructions, please refer to Lotus.
Please download the raw datasets from their official websites and organize them as follows:
vkitti
├── Scene01
├── Scene02
├── ...
hypersim
├── test
├── train
└── val
ttr
├── abandonedfactory
├── abandonedfactory_night
├── amusement
├── ...
All training hyperparameters are centralized in configs/train_config.yaml.
Key parameters you might want to adjust based on your hardware:
-
batch_size: Reduce this if you encounter Out-Of-Memory (OOM) errors. -
gradient_accumulation_steps: Increase this to maintain the effective batch size if you reducebatch_size. -
use_gradient_checkpointing: Set this toTrueif you are facing OOM errors. -
learning_rate: Default is set to1e-4. -
{test/train}_{min/max}_num_frame: The number of frames processed in one clip (default is e.g., 45-45). -
denoise_step: The$\tau$ condition in our paper. -
grad_loss,grad_co: The LMR and the$\lambda_{LMR}$ in our paper. -
lora_rank: Set to 512 following Lotus-2. -
init_validate: Whether to perform initial validation before training. -
log_step: Interval for logging training state. -
prob: The ratio for Hypersim(img), Virtual KITTI(img), TartanAir(vid), Virtual KITTI(vid). -
batch_size: The batch for image. The default batch_size for video is 1. -
dataset settings: Please refer toexamples/datasetfor more details.
Make sure you have downloaded the base weights (e.g., Wan2.1) before starting (You will automatically download the weight if you are using the training script we provide).
We use accelerate for multi-GPU training. To train on 4 GPUs:
bash train_script/train_video_new.shYou might also alter the files under train_config/accelerate_config to change the GPU configuration(e.g., for single GPU or DEEPSPEED).
Checkpoints are saved automatically in the output_path directory every validate_step. If your training is interrupted, you can resume it by specifying the training_state_dir and setting resume and load_optimizer to True. Please also set global_step if possible. Then just simply rerun the training script:
bash train_script/train_video_new.shPlease refer to this line in the training script for more details.