[VJEPA2.1] Crop size mismatch between uploaded checkpoint (384) and cooldown config (256) + NaN loss during continuous pretraining

## Environment
* Repository: facebookresearch/vjepa2
* Model: VJEPA2.1 (ViT-L)
* Hardware: 2× H200 NVL GPUs
* Dataset (test): ~22k videos (custom domain)


## Bug: crop size mismatch between checkpoint and cooldown config

The checkpoint uploaded by Meta for VJEPA2.1 was trained with a crop size of `384`, but the accompanying cooldown config file specifies a crop size of `256`.

### Affected files
- Checkpoint: trained and saved at resolution 384×384
- Cooldown config: `crop_size: 256`

Trying to reproduce the 2-step pretraining of VJEPA2.1 using the provided configs may lead to a mismatch with the released checkpoint. Moreover, would loading the 384-pretrained checkpoint and continuing training with the cooldown config as-is cause a positional embedding shape mismatch, or silently train at the wrong resolution?

## Follow-up: NaN loss during continuous pretraining of VJEPA2.1

### Context

I ran continuous pretraining of VJEPA2 (ViT-L) on a custom dataset and measured an accuracy improvement when comparing a fine-tuned linear probe on the base VJEPA2 checkpoint vs the same probe on my domain-pretrained VJEPA2 checkpoint. Given this result, I wanted to replicate the approach with VJEPA2.1 hoping for an even stronger encoder baseline. Unfortunately the run diverges immediately.

### Observed behaviour

* Loss: NaN after the first few iterations.
* Dataset size (test run): ~22k videos.
* Target dataset size (full run): ~500k videos.

### What I have tried

- Checked for `inf` / `NaN` in the input tensors before the forward pass.
- Lowered the learning rate significantly.
- Suspected the crop size mismatch as a potential cause (positional embedding interpolation artefacts).

### Questions

* Are the pretrain and cooldown config files intended to be used with the released checkpoint or as a standalone stage-2 config from scratch? If the former, can the crop size be corrected to 384?
* Are there known instability issues when continuously pretraining VJEPA2.1 on small datasets (~22k videos)?
* Is there a recommended starting LR / warmup schedule for continuous pretraining with the released VJEPA2.1 checkpoint?
* Does the crop size mismatch (384 checkpoint --> 256 config) cause positional embedding interpolation that could lead to NaN gradients?

### Additional info

The small-dataset test run (~22k videos) is intentional: I want to validate the approach before scaling to the full ~500k video corpus.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VJEPA2.1] Crop size mismatch between uploaded checkpoint (384) and cooldown config (256) + NaN loss during continuous pretraining #163

Environment

Bug: crop size mismatch between checkpoint and cooldown config

Affected files

Follow-up: NaN loss during continuous pretraining of VJEPA2.1

Context

Observed behaviour

What I have tried

Questions

Additional info

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[VJEPA2.1] Crop size mismatch between uploaded checkpoint (384) and cooldown config (256) + NaN loss during continuous pretraining #163

Description

Environment

Bug: crop size mismatch between checkpoint and cooldown config

Affected files

Follow-up: NaN loss during continuous pretraining of VJEPA2.1

Context

Observed behaviour

What I have tried

Questions

Additional info

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions