Environment
- Repository: facebookresearch/vjepa2
- Model: VJEPA2.1 (ViT-L)
- Hardware: 2× H200 NVL GPUs
- Dataset (test): ~22k videos (custom domain)
Bug: crop size mismatch between checkpoint and cooldown config
The checkpoint uploaded by Meta for VJEPA2.1 was trained with a crop size of 384, but the accompanying cooldown config file specifies a crop size of 256.
Affected files
- Checkpoint: trained and saved at resolution 384×384
- Cooldown config:
crop_size: 256
Trying to reproduce the 2-step pretraining of VJEPA2.1 using the provided configs may lead to a mismatch with the released checkpoint. Moreover, would loading the 384-pretrained checkpoint and continuing training with the cooldown config as-is cause a positional embedding shape mismatch, or silently train at the wrong resolution?
Follow-up: NaN loss during continuous pretraining of VJEPA2.1
Context
I ran continuous pretraining of VJEPA2 (ViT-L) on a custom dataset and measured an accuracy improvement when comparing a fine-tuned linear probe on the base VJEPA2 checkpoint vs the same probe on my domain-pretrained VJEPA2 checkpoint. Given this result, I wanted to replicate the approach with VJEPA2.1 hoping for an even stronger encoder baseline. Unfortunately the run diverges immediately.
Observed behaviour
- Loss: NaN after the first few iterations.
- Dataset size (test run): ~22k videos.
- Target dataset size (full run): ~500k videos.
What I have tried
- Checked for
inf / NaN in the input tensors before the forward pass.
- Lowered the learning rate significantly.
- Suspected the crop size mismatch as a potential cause (positional embedding interpolation artefacts).
Questions
- Are the pretrain and cooldown config files intended to be used with the released checkpoint or as a standalone stage-2 config from scratch? If the former, can the crop size be corrected to 384?
- Are there known instability issues when continuously pretraining VJEPA2.1 on small datasets (~22k videos)?
- Is there a recommended starting LR / warmup schedule for continuous pretraining with the released VJEPA2.1 checkpoint?
- Does the crop size mismatch (384 checkpoint --> 256 config) cause positional embedding interpolation that could lead to NaN gradients?
Additional info
The small-dataset test run (~22k videos) is intentional: I want to validate the approach before scaling to the full ~500k video corpus.
Environment
Bug: crop size mismatch between checkpoint and cooldown config
The checkpoint uploaded by Meta for VJEPA2.1 was trained with a crop size of
384, but the accompanying cooldown config file specifies a crop size of256.Affected files
crop_size: 256Trying to reproduce the 2-step pretraining of VJEPA2.1 using the provided configs may lead to a mismatch with the released checkpoint. Moreover, would loading the 384-pretrained checkpoint and continuing training with the cooldown config as-is cause a positional embedding shape mismatch, or silently train at the wrong resolution?
Follow-up: NaN loss during continuous pretraining of VJEPA2.1
Context
I ran continuous pretraining of VJEPA2 (ViT-L) on a custom dataset and measured an accuracy improvement when comparing a fine-tuned linear probe on the base VJEPA2 checkpoint vs the same probe on my domain-pretrained VJEPA2 checkpoint. Given this result, I wanted to replicate the approach with VJEPA2.1 hoping for an even stronger encoder baseline. Unfortunately the run diverges immediately.
Observed behaviour
What I have tried
inf/NaNin the input tensors before the forward pass.Questions
Additional info
The small-dataset test run (~22k videos) is intentional: I want to validate the approach before scaling to the full ~500k video corpus.